Hello, I have a database with urls and xpaths to get elements from these urls. There can be many urls from one site like: example.com/article1, example.com/article2... which has the same xpaths. I want to scrape those urls as fast as possible but I don't want to send many requests to some domain at once, there should be some delay.
For these purposes, I've created a GenericSpider. Instance of this spider gets arguments - list of urls and xpath. When I want to get data, I instantiate all spiders and run them. The problem is that when I do this for all spider instances, there is a lot of timeouts (more than half requests ends with timeout). But when I do it only for 50 spiders, everything works correctly. So my solution is to instantiate and crawl first 50 spiders, then another 50 spider etc. but it raises ReactorNotRestartable. I'm new in scrapy so I appreciate all advices, maybe this is not a best solution. Thanks class GenericScraper(scrapy.Spider): download_timeout = 20 name = 'will_be_overriden' custom_settings = {'CONCURRENT_REQUESTS': 30, 'DOWNLOAD_DELAY':1} def __init__(self, occs_occurence_scanning_id_map_dict): super(GenericScraper,self).__init__() ... def start_requests(self): for url in self.urls: yield scrapy.Request(url=url, callback=self.parse, errback=self.err ,meta={'handle_httpstatus_all':True}) def err(self,failure): ... def parse(self, response): ... hxs = HtmlXPathSelector(response) save result to database And this is a method in which I instantiate spiders and run crawl: def run_spiders(): from scrapy.crawler import CrawlerProcess .... process = CrawlerProcess({'TELNETCONSOLE_ENABLED': 0, "EXTENSIONS": { 'scrapy.telnet.TelnetConsole': None }, "LOG_FILE": 'scrapylog.log', "CONCURRENT_REQUESTS": 30, 'REACTOR_THREADPOOL_MAXSIZE': 20, "ROBOTSTXT_OBEY": False, "USER_AGENT": ua.chrome, "LOG_LEVEL": 'INFO', "COOKIES_ENABLED": False }) * # THIS SCRAPES LESS THAN HALF URLS, THE REST ENDS WITH TIMEOUTS* for s in Site.objects.all(): # site contains list of urls and xpath ... process.crawl(occurence_spider.GenericScraper, site) process.start() * # THIS SCRAPES ONLY THE FIRST 50 SITES (without timeouts), THEN IT RAISES* File "C:\Users..., line 730, in startRunning raise error.ReactorNotRestartable() ReactorNotRestartable st = 50 while st<sites_count: st += 50 for s in Site.objects.all()[st:st+50]: ... process.crawl(occurence_spider.GenericScraper, occs_occurence_scanning_id_map_dict) process.start() -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.