Ok, I realized the behavior I was looking for can be accomplished by using DownloaderMiddleware and then overriding the default SIGINT and SIGTERM handlers (see attached code). My only questions left are:
1) is there a better way to do this without overriding the POSIX signal handlers? 2) is this (overriding the POSIX signal handlers) definitely safe? On Monday, May 12, 2014 2:10:52 PM UTC-7, drew wrote: > > Hello, > > I'd like the ability to cancel spiders before they are finished, and > obviously, there are many ways to accomplish this. I.e., I can send a > SIGINT or SIGTERM to the spider, and I see the default signal handler for > those causes a "graceful shutdown" on the first signal received and a more > "forceful shutdown" on the second. Of course, I could use scrapyd, but > scrapyd seems to simply send a SIGTERM, so my following question does not > apply to scrapyd, I think.. > > When the spider is cancelled with a "graceful shutdown", the behavior > seems to be as follows: whatever Request objects remain in the queue will > be completed (and associated callbacks called), and only then will the > spider be closed and any registered handlers for the signals.spider_closed > event called. What I'm really looking for, however, is a faster "graceful > shutdown" whereby the queue is first emptied, no more Request callbacks > executed, and the spider is closed "immediately." How can that be achieved? > > For example, note how in the attached example, that if a SIGINT is > received during the first parse() call (with 3 sleeps inserted so there's > time do so in testing), the spider will be closed when that single parse() > call completes, as start_urls only contained 1 URL. However, at the end of > the first parse() call, I add 4 Request objects into the queue (either via > the "yield technique" or "return list technique"), so if a SIGINT is > received after that first parse() completes, the spider will not be closed > until 4 more parse() calls complete, one for each Request added. Is there > any way to avoid this behavior, so the spider can be closed immediately, > without worrying about those 4 pending requests? > > Thanks a bunch, > Drew > > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
from scrapy.spider import Spider from scrapy.http import Request from scrapy import signals from scrapy.xlib.pydispatch import dispatcher from scrapy.exceptions import CloseSpider import time import signal class Spidey(Spider): name = "spidey" allowed_domains = ["abc.go.com"] start_urls = [ "http://abc.go.com/" ] I = 0 X = [ "http://abc.go.com/shows/" + str(x) for x in [ "black-box", "castle", "the-chew", "nashville" ] ] closing = 0 def __init__(self, *args, **kwargs): super(Spidey, self).__init__(*args, **kwargs) self.term_handler = signal.signal(signal.SIGTERM, self.term_handler) self.int_handler = signal.signal(signal.SIGINT, self.int_handler) dispatcher.connect(self.close, signals.spider_closed) def int_handler(self, signum, handler): self.log('got SIGINT !!!') self.closing = 1 self.int_handler(signum, handler) def term_handler(self, signum, handler): self.log('got SIGTERM !!!') self.closing = 1 self.term_handler(signum, handler) def close(self, spider, reason): self.log("close! spider[%s] reason[%s]" % ( str(spider), str(reason) ) ) def parse(self, response): for i in range(3): self.log("hi there!") time.sleep(1) self.log( "more requests please!!!" ) if self.I == 0: self.I = 1 for x in self.X: yield Request(x, callback=self.parse) #return [ Request(x) for x in self.X ] #else: #return []
from scrapy.exceptions import IgnoreRequest class CancelMiddleware(): def process_request(self, request, spider): print "Middleware.request(): closing[%d]" % spider.closing if spider.closing: raise IgnoreRequest() def process_response(self, request, response, spider): print "Middleware.response(): closing[%d]" % spider.closing if spider.closing: raise IgnoreRequest() return response