Broad crawl - more than half responses are timeouts when I use a lot of spider instances

Futile Studio Mon, 03 Apr 2017 05:10:12 -0700

Hello,

I have a database with urls and xpaths to get elements from these urls. 
There can be many urls from one site like: example.com/article1, 
example.com/article2... which has the same xpaths.
I want to scrape those urls as fast as possible but I don't want to send 
many requests to some domain at once, there should be some delay.


For these purposes, I've created a GenericSpider. Instance of this spider 
gets arguments - list of urls and xpath. When I want to get data, I 
instantiate all spiders and run them.

The problem is that when I do this for all spider instances, there is a lot 
of timeouts (more than half requests ends with timeout). 

But when I do it only for 50 spiders, everything works correctly. 

So my solution is to instantiate and crawl first 50 spiders, then another 
50 spider etc. but it raises ReactorNotRestartable.


I'm new in scrapy so I appreciate all advices, maybe this is not a best 
solution. Thanks


class GenericScraper(scrapy.Spider):
    download_timeout = 20
    name = 'will_be_overriden'
    custom_settings = {'CONCURRENT_REQUESTS': 30,
                       'DOWNLOAD_DELAY':1}
    def __init__(self, occs_occurence_scanning_id_map_dict):
        super(GenericScraper,self).__init__()
        ...

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse, errback=self.err 
,meta={'handle_httpstatus_all':True})

    def err(self,failure):
        ...


    def parse(self, response):
        ...
        hxs = HtmlXPathSelector(response)
        save result to database


And this is a method in which I instantiate spiders and run crawl:


def run_spiders():
    from scrapy.crawler import CrawlerProcess
    ....
    process = CrawlerProcess({'TELNETCONSOLE_ENABLED': 0,
                              "EXTENSIONS": {
                                  'scrapy.telnet.TelnetConsole': None
                              },
                              "LOG_FILE": 'scrapylog.log',
                              "CONCURRENT_REQUESTS": 30,
                              'REACTOR_THREADPOOL_MAXSIZE': 20,
                              "ROBOTSTXT_OBEY": False,
                              "USER_AGENT": ua.chrome,
                              "LOG_LEVEL": 'INFO',
                              "COOKIES_ENABLED": False
                              })
    


*   # THIS SCRAPES LESS THAN HALF URLS, THE REST ENDS WITH TIMEOUTS*


    for s in Site.objects.all(): # site contains list of urls and xpath
        ...
        process.crawl(occurence_spider.GenericScraper, site)

    process.start()



*   # THIS SCRAPES ONLY THE FIRST 50 SITES (without timeouts), THEN IT RAISES*


 File "C:\Users..., line 730, in startRunning

    raise error.ReactorNotRestartable()        ReactorNotRestartable


    st = 50
    while st<sites_count:

        st += 50
        for s in Site.objects.all()[st:st+50]:
             ...
             process.crawl(occurence_spider.GenericScraper, 
occs_occurence_scanning_id_map_dict)

        process.start()

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Broad crawl - more than half responses are timeouts when I use a lot of spider instances

Reply via email to