Change how the scheduler handles requests

leonardo Wed, 03 Jun 2015 07:44:41 -0700

Hi there!

I have a scrapy setup that uses selenium in the callbacks in case the page 
contains some javascript loaded data. However obtaining this data can take 
a very long time as I often scrape with selenium paginated pages and since 
every request creates a new selenium session I end up creating 10+ browsers 
which is quite memory intensive. It seems that I need to modify how the 
scheduler marks a request as inactive, instead of doing it after it has 
finished downloading to do it after it's response callbacks are finished.


However, regarding this functionality extension, here <<lmartuc> 
http://scrapy.readthedocs.org/en/1.0/topics/api.html#scrapy.crawler.Crawler.engine>
 the 
scrapy documentation says: "The execution engine, which coordinates the 
core crawling logic between the scheduler, downloader and spiders. Some 
extension may want to access the Scrapy engine, to inspect or modify the 
downloader and scheduler behaviour, although this is an advanced use and 
this API is not yet stable."

How should I write this extension. I believe I must extend the engine class 
and in the extension overwrite the engine variable?  Is there another 
approach to my problem that does not require to change the how scrapy 
behaves?

Best Regards
Leonardo

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Change how the scheduler handles requests

Reply via email to