Re: Help with slow crawl rate on broad crawls

kris brown Sun, 07 Feb 2016 17:01:43 -0800

So the project is scraping several university websites.  I've profiled the 
crawl as it's going to see the engine and downloader slots which eventually 
converge to just having a single domain that urls come from.  Having looked 
at the download latency on the headers I don't see any degradation of 
response times.  The drift towards an extremely long series of responses 
from a single domain is what lead me to think I need a different scheduler. 
 If there's any other info I can provide that would be more useful let me 
know.


On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote:
>
> What site are you scraping?  Lots of sites have good caching on common 
> pages, but if you go a link or two deep, the site has to recreate the page.
>
> What I'm getting as is this - I think scrapy should handle this situation 
> out of the box, and I'm wondering if the remote server is throttling you.
>
> Have you profiled the scrape of the urls to determine if there's 
> throttling or timing issues?
>
> On Sat, Feb 6, 2016 at 8:25 PM, kris brown <kris.br...@gmail.com 
> <javascript:>> wrote:
>
>> Hello everyone! Apologies if this topic appeared twice, my first attempt 
>> to post it did not seem to show up in the group.  
>>
>> Anyways, this is my first scrapy project and I'm trying to crawl multiple 
>> domains ( about 100) which has presented a scheduling issue.  In trying to 
>> be polite to the sites I'm crawling I've set a reasonable download delay 
>> and limited the ip concurrency to 1 for any particular domain.  What I 
>> think is happening is that the url queue fills up with many urls for a 
>> single domain which of course ends up dragging the crawl rate down to about 
>> 15/minute.  I've been thinking about writing a scheduler that would return 
>> the next url based on a heap sorted by the earliest time a domain can be 
>> crawled next.  However, I'm sure others have faced a similar problem and as 
>> I'm a total beginner to scrapy I wanted to hear some different opinions on 
>> how to resolve this.  Thanks!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to scrapy-users...@googlegroups.com <javascript:>.
>> To post to this group, send email to scrapy...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Help with slow crawl rate on broad crawls

Reply via email to