Re: Help with slow crawl rate on broad crawls

kris brown Mon, 08 Feb 2016 15:20:09 -0800

Dmitris, 
    Thanks for the tips.  So I have taken your advice and put a download 
timeout on my crawl and a download delay of just a couple seconds, but 
still face the issue of a long string of urls for a single domain in my 
queues.  Since I'm only making a single request to a given ip address this 
seems to bring down the crawl rate despite having a quick download time for 
any given single request.  This is where I'm wondering what my best option 
is for doing a broad crawl.  Implement my own scheduling queue? Run 
multiple spiders? Not sure what the best option is.  For my current project 
with no pipelines implemented I'm maxing out at about 2-300 crawls per 
minute.  Definitely need to get that number much higher to have any kind of 
reasonable performance on the crawl. Thanks again for everyone's advice!


On Monday, February 8, 2016 at 2:29:22 PM UTC-6, Dimitris Kouzis - Loukas 
wrote:
>
> Just a quick ugly tip... Set download timeout 
> <http://doc.scrapy.org/en/latest/topics/settings.html#download-timeout> to 
> 3 seconds... get done with handling the responsive websites and then try 
> another approach with the slower ones (or skip them altogether?)
>
> Also don't be over-polite... if you could do something with a browser, I 
> think, it's fair to do it with Scrapy.
>
>
> On Monday, February 8, 2016 at 1:01:19 AM UTC, kris brown wrote:
>>
>> So the project is scraping several university websites.  I've profiled 
>> the crawl as it's going to see the engine and downloader slots which 
>> eventually converge to just having a single domain that urls come from. 
>>  Having looked at the download latency on the headers I don't see any 
>> degradation of response times.  The drift towards an extremely long series 
>> of responses from a single domain is what lead me to think I need a 
>> different scheduler.  If there's any other info I can provide that would be 
>> more useful let me know.
>>
>> On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote:
>>>
>>> What site are you scraping?  Lots of sites have good caching on common 
>>> pages, but if you go a link or two deep, the site has to recreate the page.
>>>
>>> What I'm getting as is this - I think scrapy should handle this 
>>> situation out of the box, and I'm wondering if the remote server is 
>>> throttling you.
>>>
>>> Have you profiled the scrape of the urls to determine if there's 
>>> throttling or timing issues?
>>>
>>> On Sat, Feb 6, 2016 at 8:25 PM, kris brown <kris.br...@gmail.com> wrote:
>>>
>>>> Hello everyone! Apologies if this topic appeared twice, my first 
>>>> attempt to post it did not seem to show up in the group.  
>>>>
>>>> Anyways, this is my first scrapy project and I'm trying to crawl 
>>>> multiple domains ( about 100) which has presented a scheduling issue.  In 
>>>> trying to be polite to the sites I'm crawling I've set a reasonable 
>>>> download delay and limited the ip concurrency to 1 for any particular 
>>>> domain.  What I think is happening is that the url queue fills up with 
>>>> many 
>>>> urls for a single domain which of course ends up dragging the crawl rate 
>>>> down to about 15/minute.  I've been thinking about writing a scheduler 
>>>> that 
>>>> would return the next url based on a heap sorted by the earliest time a 
>>>> domain can be crawled next.  However, I'm sure others have faced a similar 
>>>> problem and as I'm a total beginner to scrapy I wanted to hear some 
>>>> different opinions on how to resolve this.  Thanks!
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to scrapy-users...@googlegroups.com.
>>>> To post to this group, send email to scrapy...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Help with slow crawl rate on broad crawls

Reply via email to