Let me give you a very clear example. Let's assume you have 10k fast URLs 
where each takes 1sec and 10k slow URLs where each takes 15 seconds. The 
first ones need 2.7 download hours while the slow ones need 41 download 
hours.

That's your workload - there isn't much you can do about it if you want to 
do all this work. By increasing the number of Requests you run in parallel 
e.g. from 1 to 10, you divide those numbers by 10. By using a more clever 
scheduling algorithm what you can do is bring those fast URLs first, so in 
3 hours you have most fast URLs and a few slow ones. By setting download 
timeout to e.g. 3 seconds, you trim the slow job from 41 hours to 8 hours 
(and of course potentially lose many URLs).

The most effective way to attack the problem you have is find ways to do 
less.

On Tuesday, February 9, 2016 at 10:34:10 AM UTC, Dimitris Kouzis - Loukas 
wrote:
>
> Yes - certainly bring your concurrency level to 8 requests per IP and be 
> less nice. See if it fixes the problem and if anyone complains. Beyond 
> that, make sure you don't download stuff you've already downloaded 
> recently. If a site is slow, it likely doesn't have much updated content... 
> not even comments... so don't recrawl the same boring old static pages. Or 
> crawl them once a month, instead of on every crawl.
>
> On Monday, February 8, 2016 at 11:19:15 PM UTC, kris brown wrote:
>>
>> Dmitris, 
>>     Thanks for the tips.  So I have taken your advice and put a download 
>> timeout on my crawl and a download delay of just a couple seconds, but 
>> still face the issue of a long string of urls for a single domain in my 
>> queues.  Since I'm only making a single request to a given ip address this 
>> seems to bring down the crawl rate despite having a quick download time for 
>> any given single request.  This is where I'm wondering what my best option 
>> is for doing a broad crawl.  Implement my own scheduling queue? Run 
>> multiple spiders? Not sure what the best option is.  For my current project 
>> with no pipelines implemented I'm maxing out at about 2-300 crawls per 
>> minute.  Definitely need to get that number much higher to have any kind of 
>> reasonable performance on the crawl. Thanks again for everyone's advice!
>>
>> On Monday, February 8, 2016 at 2:29:22 PM UTC-6, Dimitris Kouzis - Loukas 
>> wrote:
>>>
>>> Just a quick ugly tip... Set download timeout 
>>> <http://doc.scrapy.org/en/latest/topics/settings.html#download-timeout> to 
>>> 3 seconds... get done with handling the responsive websites and then try 
>>> another approach with the slower ones (or skip them altogether?)
>>>
>>> Also don't be over-polite... if you could do something with a browser, I 
>>> think, it's fair to do it with Scrapy.
>>>
>>>
>>> On Monday, February 8, 2016 at 1:01:19 AM UTC, kris brown wrote:
>>>>
>>>> So the project is scraping several university websites.  I've profiled 
>>>> the crawl as it's going to see the engine and downloader slots which 
>>>> eventually converge to just having a single domain that urls come from. 
>>>>  Having looked at the download latency on the headers I don't see any 
>>>> degradation of response times.  The drift towards an extremely long series 
>>>> of responses from a single domain is what lead me to think I need a 
>>>> different scheduler.  If there's any other info I can provide that would 
>>>> be 
>>>> more useful let me know.
>>>>
>>>> On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote:
>>>>>
>>>>> What site are you scraping?  Lots of sites have good caching on common 
>>>>> pages, but if you go a link or two deep, the site has to recreate the 
>>>>> page.
>>>>>
>>>>> What I'm getting as is this - I think scrapy should handle this 
>>>>> situation out of the box, and I'm wondering if the remote server is 
>>>>> throttling you.
>>>>>
>>>>> Have you profiled the scrape of the urls to determine if there's 
>>>>> throttling or timing issues?
>>>>>
>>>>> On Sat, Feb 6, 2016 at 8:25 PM, kris brown <kris.br...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Hello everyone! Apologies if this topic appeared twice, my first 
>>>>>> attempt to post it did not seem to show up in the group.  
>>>>>>
>>>>>> Anyways, this is my first scrapy project and I'm trying to crawl 
>>>>>> multiple domains ( about 100) which has presented a scheduling issue.  
>>>>>> In 
>>>>>> trying to be polite to the sites I'm crawling I've set a reasonable 
>>>>>> download delay and limited the ip concurrency to 1 for any particular 
>>>>>> domain.  What I think is happening is that the url queue fills up with 
>>>>>> many 
>>>>>> urls for a single domain which of course ends up dragging the crawl rate 
>>>>>> down to about 15/minute.  I've been thinking about writing a scheduler 
>>>>>> that 
>>>>>> would return the next url based on a heap sorted by the earliest time a 
>>>>>> domain can be crawled next.  However, I'm sure others have faced a 
>>>>>> similar 
>>>>>> problem and as I'm a total beginner to scrapy I wanted to hear some 
>>>>>> different opinions on how to resolve this.  Thanks!
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "scrapy-users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to scrapy-users...@googlegroups.com.
>>>>>> To post to this group, send email to scrapy...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/scrapy-users.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to