Re: Help with slow crawl rate on broad crawls

Dimitris Kouzis - Loukas Mon, 08 Feb 2016 12:29:41 -0800

Just a quick ugly tip... Set download timeout 
<http://doc.scrapy.org/en/latest/topics/settings.html#download-timeout> to 
3 seconds... get done with handling the responsive websites and then try 
another approach with the slower ones (or skip them altogether?)


Also don't be over-polite... if you could do something with a browser, I 
think, it's fair to do it with Scrapy.


On Monday, February 8, 2016 at 1:01:19 AM UTC, kris brown wrote:
>
> So the project is scraping several university websites.  I've profiled the 
> crawl as it's going to see the engine and downloader slots which eventually 
> converge to just having a single domain that urls come from.  Having looked 
> at the download latency on the headers I don't see any degradation of 
> response times.  The drift towards an extremely long series of responses 
> from a single domain is what lead me to think I need a different scheduler. 
>  If there's any other info I can provide that would be more useful let me 
> know.
>
> On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote:
>>
>> What site are you scraping?  Lots of sites have good caching on common 
>> pages, but if you go a link or two deep, the site has to recreate the page.
>>
>> What I'm getting as is this - I think scrapy should handle this situation 
>> out of the box, and I'm wondering if the remote server is throttling you.
>>
>> Have you profiled the scrape of the urls to determine if there's 
>> throttling or timing issues?
>>
>> On Sat, Feb 6, 2016 at 8:25 PM, kris brown <kris.br...@gmail.com> wrote:
>>
>>> Hello everyone! Apologies if this topic appeared twice, my first attempt 
>>> to post it did not seem to show up in the group.  
>>>
>>> Anyways, this is my first scrapy project and I'm trying to crawl 
>>> multiple domains ( about 100) which has presented a scheduling issue.  In 
>>> trying to be polite to the sites I'm crawling I've set a reasonable 
>>> download delay and limited the ip concurrency to 1 for any particular 
>>> domain.  What I think is happening is that the url queue fills up with many 
>>> urls for a single domain which of course ends up dragging the crawl rate 
>>> down to about 15/minute.  I've been thinking about writing a scheduler that 
>>> would return the next url based on a heap sorted by the earliest time a 
>>> domain can be crawled next.  However, I'm sure others have faced a similar 
>>> problem and as I'm a total beginner to scrapy I wanted to hear some 
>>> different opinions on how to resolve this.  Thanks!
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to scrapy-users...@googlegroups.com.
>>> To post to this group, send email to scrapy...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Help with slow crawl rate on broad crawls

Reply via email to