Hello, I'd like the ability to cancel spiders before they are finished, and obviously, there are many ways to accomplish this. I.e., I can send a SIGINT or SIGTERM to the spider, and I see the default signal handler for those causes a "graceful shutdown" on the first signal received and a more "forceful shutdown" on the second. Of course, I could use scrapyd, but scrapyd seems to simply send a SIGTERM, so my following question does not apply to scrapyd, I think..
When the spider is cancelled with a "graceful shutdown", the behavior seems to be as follows: whatever Request objects remain in the queue will be completed (and associated callbacks called), and only then will the spider be closed and any registered handlers for the signals.spider_closed event called. What I'm really looking for, however, is a faster "graceful shutdown" whereby the queue is first emptied, no more Request callbacks executed, and the spider is closed "immediately." How can that be achieved? For example, note how in the attached example, that if a SIGINT is received during the first parse() call (with 3 sleeps inserted so there's time do so in testing), the spider will be closed when that single parse() call completes, as start_urls only contained 1 URL. However, at the end of the first parse() call, I add 4 Request objects into the queue (either via the "yield technique" or "return list technique"), so if a SIGINT is received after that first parse() completes, the spider will not be closed until 4 more parse() calls complete, one for each Request added. Is there any way to avoid this behavior, so the spider can be closed immediately, without worrying about those 4 pending requests? Thanks a bunch, Drew -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
from scrapy.spider import Spider from scrapy.http import Request from scrapy import signals from scrapy.xlib.pydispatch import dispatcher import time import signal class Spidey(Spider): name = "spidey" allowed_domains = ["abc.com"] start_urls = [ "http://www.abc.com/" ] I = 0 X = [ "http://www.abc.com/shows/" + str(x) for x in [ "black-box", "castle", "the-chew", "nashville" ] ] def __init__(self, *args, **kwargs): super(Spidey, self).__init__(*args, **kwargs) #signal.signal( signal.SIGTERM, self.yo ) #signal.signal( signal.SIGINT, self.yo ) dispatcher.connect(self.close, signals.spider_closed) def yo(self, signum, _): self.log("yoyo!") def close(self, spider, reason): self.log("close! spider[%s] reason[%s]" % ( str(spider), str(reason) ) ) def parse(self, response): for i in range(3): self.log("hi there!") time.sleep(1) self.log( "more requests please!!!" ) if self.I == 0: self.I = 1 for x in self.X: yield Request(x, callback=self.parse) #return [ Request(x) for x in self.X ] #else: # return []