Re: keeping spiders open

Nicolás Alejandro Ramírez Quiros Mon, 12 Jan 2015 08:27:04 -0800

"Slowness"; Scrapy has an async downloader, this means that it will be 
always downloading requests through all its slots. So imagine this 
situation when a request gets stuck till timeout is reached; it will use 
time of your spider leaving the rest of the slots idle, because the spider 
already "finished" but it is waiting for this particular request to finish.


El domingo, 11 de enero de 2015, 2:25:48 (UTC-2), user12345 escribió:
>
> So I think the best way is connecting to `signals.spider_idle`, and 
> raising `DontCloseSpider`. Feel free to point out any flaws here.
>
> On Saturday, January 10, 2015 at 5:43:29 PM UTC-8, user12345 wrote:
>>
>> I'm planning to have daemon CrawlWorker (subclassing 
>> multiprocessing.Process) that monitors a queue for scrape requests.
>>
>> The responsibility of this worker is to take scrape requests from the 
>> queue and feed them to spiders. In order to avoid implementing batching 
>> logic (like wait for N requests before creating a new spider), would it 
>> make sense to keep all my spiders alive, and then *add* more scrape 
>> requests to each spider when they're idle, and if there are no more scrape 
>> requests, keep them open? 
>>
>> What would be the best, simplest, and most elegant way to implement this? 
>> It seems that given that attributes `start_urls`, that a spider is meant to 
>> be instantiated with an initial work list, do its work, then die.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: keeping spiders open

Reply via email to