Help with slow crawl rate on broad crawls

kris brown Sat, 06 Feb 2016 20:51:06 -0800

Hello everyone! Apologies if this topic appeared twice, my first attempt to 
post it did not seem to show up in the group.


Anyways, this is my first scrapy project and I'm trying to crawl multiple 
domains ( about 100) which has presented a scheduling issue.  In trying to 
be polite to the sites I'm crawling I've set a reasonable download delay 
and limited the ip concurrency to 1 for any particular domain.  What I 
think is happening is that the url queue fills up with many urls for a 
single domain which of course ends up dragging the crawl rate down to about 
15/minute.  I've been thinking about writing a scheduler that would return 
the next url based on a heap sorted by the earliest time a domain can be 
crawled next.  However, I'm sure others have faced a similar problem and as 
I'm a total beginner to scrapy I wanted to hear some different opinions on 
how to resolve this.  Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Help with slow crawl rate on broad crawls

Reply via email to