Hi there! I have a scrapy setup that uses selenium in the callbacks in case the page contains some javascript loaded data. However obtaining this data can take a very long time as I often scrape with selenium paginated pages and since every request creates a new selenium session I end up creating 10+ browsers which is quite memory intensive. It seems that I need to modify how the scheduler marks a request as inactive, instead of doing it after it has finished downloading to do it after it's response callbacks are finished.
However, regarding this functionality extension, here <<lmartuc> http://scrapy.readthedocs.org/en/1.0/topics/api.html#scrapy.crawler.Crawler.engine> the scrapy documentation says: "The execution engine, which coordinates the core crawling logic between the scheduler, downloader and spiders. Some extension may want to access the Scrapy engine, to inspect or modify the downloader and scheduler behaviour, although this is an advanced use and this API is not yet stable." How should I write this extension. I believe I must extend the engine class and in the extension overwrite the engine variable? Is there another approach to my problem that does not require to change the how scrapy behaves? Best Regards Leonardo -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
