Middleware to avoid downloading new requests while parse callbacks are still being processed

Leonardo Casanova Fri, 12 Jun 2015 09:43:17 -0700

Hi,

As the title says my problem is that I need to limit the amount of parse 
callbacks being processed concurrently as the are quite memory intensive 
(some of them create a selenium instance). 
To avoid this I want to prevent scrapy from downloading new requests while 
parse callbacks are still being processed.


I have written this middleware to reschedule requests if they exceed a 
certain limit. 
However at some point the scraping seems to stop and no new requests are 
made. 
Can you help me check if I missed something?

Best Regards
Leo

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

from scrapy.exceptions import NotConfigured
from scrapy.http.response import Response
from scrapy.http.request import Request
from scrapy.xlib.pydispatch import dispatcher

from alascrapy.signals import response_processed

import uuid
import time

class LimitRequestsMiddleware(object):

    def __init__(self, settings):
        self.active_requests = []
        self.max_requests = settings.getint('MAX_REQUESTS', 1)
        dispatcher.connect(self._response_finished, response_processed)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def _response_finished(self, **kwargs):
        request = kwargs['named']['request']
        if 'uuid' in request.meta:
            if request.meta['uuid'] in self.active_requests:
                self.active_requests.remove(request.meta['uuid'])

    def process_request(self, request, spider):
        if 'uuid' not in request.meta:
            uid = uuid.uuid4()
            request.meta['uuid'] = uid
        else:
            uid=request.meta['uuid']

        if len(self.active_requests) >= self.max_requests:
            new_request = request.replace(priority=request.priority+1)
            new_request.meta['uuid']=uid
        else:
            self.active_requests.append(request.meta['uuid'])
            return None

    def process_response(self, request, response, spider):
        return response

Middleware to avoid downloading new requests while parse callbacks are still being processed

Reply via email to