Re: closing of spiders w.r.t. pending requests

drew Tue, 13 May 2014 11:14:27 -0700

Ok, I realized the behavior I was looking for can be accomplished by using 
DownloaderMiddleware and then overriding the default SIGINT and SIGTERM 
handlers (see attached code).  My only questions left are:


1) is there a better way to do this without overriding the POSIX signal 
handlers?

2) is this (overriding the POSIX signal handlers) definitely safe?


On Monday, May 12, 2014 2:10:52 PM UTC-7, drew wrote:
>
> Hello,
>
> I'd like the ability to cancel spiders before they are finished, and 
> obviously, there are many ways to accomplish this.  I.e., I can send a 
> SIGINT or SIGTERM to the spider, and I see the default signal handler for 
> those causes a "graceful shutdown" on the first signal received and a more 
> "forceful shutdown" on the second.  Of course, I could use scrapyd, but 
> scrapyd seems to simply send a SIGTERM, so my following question does not 
> apply to scrapyd, I think..  
>
> When the spider is cancelled with a "graceful shutdown", the behavior 
> seems to be as follows: whatever Request objects remain in the queue will 
> be completed (and associated callbacks called), and only then will the 
> spider be closed and any registered handlers for the signals.spider_closed 
> event called.  What I'm really looking for, however, is a faster "graceful 
> shutdown" whereby the queue is first emptied, no more Request callbacks 
> executed, and the spider is closed "immediately."  How can that be achieved?
>
> For example, note how in the attached example, that if a SIGINT is 
> received during the first parse() call (with 3 sleeps inserted so there's 
> time do so in testing), the spider will be closed when that single parse() 
> call completes, as start_urls only contained 1 URL.  However, at the end of 
> the first parse() call, I add 4 Request objects into the queue (either via 
> the "yield technique" or "return list technique"), so if a SIGINT is 
> received after that first parse() completes, the spider will not be closed 
> until 4 more parse() calls complete, one for each Request added.  Is there 
> any way to avoid this behavior, so the spider can be closed immediately, 
> without worrying about those 4 pending requests?
>
> Thanks a bunch,
> Drew
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider

import time
import signal

class Spidey(Spider):
    name = "spidey"
    allowed_domains = ["abc.go.com"]
    start_urls = [
        "http://abc.go.com/";
    ]

    I = 0
    X = [ "http://abc.go.com/shows/"; + str(x) for x in [ "black-box", "castle", "the-chew", "nashville" ] ]

    closing = 0

    def __init__(self, *args, **kwargs):
        super(Spidey, self).__init__(*args, **kwargs)

        self.term_handler = signal.signal(signal.SIGTERM, self.term_handler)
        self.int_handler = signal.signal(signal.SIGINT, self.int_handler)

        dispatcher.connect(self.close, signals.spider_closed)

    def int_handler(self, signum, handler):
        self.log('got SIGINT !!!')
        self.closing = 1
        self.int_handler(signum, handler)

    def term_handler(self, signum, handler):
        self.log('got SIGTERM !!!')
        self.closing = 1
        self.term_handler(signum, handler)

    def close(self, spider, reason):
        self.log("close! spider[%s] reason[%s]" % ( str(spider), str(reason) ) )

    def parse(self, response):
        
        for i in range(3):
            self.log("hi there!")
            time.sleep(1)

        self.log( "more requests please!!!" )

        if self.I == 0:
            self.I = 1

            for x in self.X:
               yield Request(x, callback=self.parse)

            #return [ Request(x) for x in self.X ]
        #else:
            #return []


from scrapy.exceptions import IgnoreRequest

class CancelMiddleware():

   def process_request(self, request, spider):

      print "Middleware.request(): closing[%d]" % spider.closing 

      if spider.closing:
         raise IgnoreRequest()

   def process_response(self, request, response, spider):

      print "Middleware.response(): closing[%d]" % spider.closing 

      if spider.closing:
         raise IgnoreRequest()

      return response

Re: closing of spiders w.r.t. pending requests

Reply via email to