So the project is scraping several university websites. I've profiled the crawl as it's going to see the engine and downloader slots which eventually converge to just having a single domain that urls come from. Having looked at the download latency on the headers I don't see any degradation of response times. The drift towards an extremely long series of responses from a single domain is what lead me to think I need a different scheduler. If there's any other info I can provide that would be more useful let me know.
On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote: > > What site are you scraping? Lots of sites have good caching on common > pages, but if you go a link or two deep, the site has to recreate the page. > > What I'm getting as is this - I think scrapy should handle this situation > out of the box, and I'm wondering if the remote server is throttling you. > > Have you profiled the scrape of the urls to determine if there's > throttling or timing issues? > > On Sat, Feb 6, 2016 at 8:25 PM, kris brown <kris.br...@gmail.com > <javascript:>> wrote: > >> Hello everyone! Apologies if this topic appeared twice, my first attempt >> to post it did not seem to show up in the group. >> >> Anyways, this is my first scrapy project and I'm trying to crawl multiple >> domains ( about 100) which has presented a scheduling issue. In trying to >> be polite to the sites I'm crawling I've set a reasonable download delay >> and limited the ip concurrency to 1 for any particular domain. What I >> think is happening is that the url queue fills up with many urls for a >> single domain which of course ends up dragging the crawl rate down to about >> 15/minute. I've been thinking about writing a scheduler that would return >> the next url based on a heap sorted by the earliest time a domain can be >> crawled next. However, I'm sure others have faced a similar problem and as >> I'm a total beginner to scrapy I wanted to hear some different opinions on >> how to resolve this. Thanks! >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to scrapy-users...@googlegroups.com <javascript:>. >> To post to this group, send email to scrapy...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.