scrapy for broad crawls

Davide Setti Thu, 04 Sep 2014 05:53:07 -0700

Hi,
I'm trying to use scrapy to do a broad crawl. What I'm doing is to follow 
every link I find on every page, if the domain matches a rule. I feed the 
Spider with a few public directories of websites, and I use very high 
concurrency. At the beginning it's fast (first minute: 1381 pages/min), but 
then the speed decreases every minute down to 50 pages/min after 6 minutes. 
Then it's slow and stable ;)


CPU, memory, network and disk usages are very low, after the initial peak.

I noticed more than 100k requests in the queue after a few minutes, and 
just a few thoundands crawled pages.

I tried different settings for CONCURRENT_REQUESTS, 
CONCURRENT_REQUESTS_PER_IP and added more start_urls, but it only increased 
the initial peak.

Is there something wrong with my settings? 
https://gist.github.com/vad/3c3859ee17c07bcb3636

In the gist I also put the stack trace i see in every thread (i used the 
debugging middleware).

Regards

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

scrapy for broad crawls

Reply via email to