Just a quick ugly tip... Set download timeout <http://doc.scrapy.org/en/latest/topics/settings.html#download-timeout> to 3 seconds... get done with handling the responsive websites and then try another approach with the slower ones (or skip them altogether?)
Also don't be over-polite... if you could do something with a browser, I think, it's fair to do it with Scrapy. On Monday, February 8, 2016 at 1:01:19 AM UTC, kris brown wrote: > > So the project is scraping several university websites. I've profiled the > crawl as it's going to see the engine and downloader slots which eventually > converge to just having a single domain that urls come from. Having looked > at the download latency on the headers I don't see any degradation of > response times. The drift towards an extremely long series of responses > from a single domain is what lead me to think I need a different scheduler. > If there's any other info I can provide that would be more useful let me > know. > > On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote: >> >> What site are you scraping? Lots of sites have good caching on common >> pages, but if you go a link or two deep, the site has to recreate the page. >> >> What I'm getting as is this - I think scrapy should handle this situation >> out of the box, and I'm wondering if the remote server is throttling you. >> >> Have you profiled the scrape of the urls to determine if there's >> throttling or timing issues? >> >> On Sat, Feb 6, 2016 at 8:25 PM, kris brown <kris.br...@gmail.com> wrote: >> >>> Hello everyone! Apologies if this topic appeared twice, my first attempt >>> to post it did not seem to show up in the group. >>> >>> Anyways, this is my first scrapy project and I'm trying to crawl >>> multiple domains ( about 100) which has presented a scheduling issue. In >>> trying to be polite to the sites I'm crawling I've set a reasonable >>> download delay and limited the ip concurrency to 1 for any particular >>> domain. What I think is happening is that the url queue fills up with many >>> urls for a single domain which of course ends up dragging the crawl rate >>> down to about 15/minute. I've been thinking about writing a scheduler that >>> would return the next url based on a heap sorted by the earliest time a >>> domain can be crawled next. However, I'm sure others have faced a similar >>> problem and as I'm a total beginner to scrapy I wanted to hear some >>> different opinions on how to resolve this. Thanks! >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to scrapy-users...@googlegroups.com. >>> To post to this group, send email to scrapy...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.