Could implement a spider middleware which fetches a domain white list from Redis during initialization, the domain white list is changed later, using Redis Pub/Sub to update it. if a URL's domain isn't in the white list, discard the Request, don't matter where it comes from.
2015-12-17 22:03 GMT+08:00 somewhatofftheway <benjamins...@gmail.com>: > Exactly, that's what I'm trying to work around. My solution does work, I > was just interested in whether anybody had tried other approaches. > > On Thursday, December 17, 2015 at 2:59:58 AM UTC, lnxpgn wrote: >> >> >> If the dont_filter argument in Request is True or the spider's >> allowed_domains is empty, OffsiteMiddleware does nothing >> >> On 15-12-15 上午2:52, somewhatofftheway wrote: >> >> I'm trying to implement a spider which will: >> >> a. Pull URLs from a queue of some sort >> b. Only crawl those sites >> >> It's essentially a broad crawl in that it is designed to look at any >> site, but I want to be able to limit the sites rather than letting it crawl >> the whole web. >> >> I had experimented with a RabbitMQ based solution, but have recently been >> trying scrapy-redis. This seems to generally work very well. However, it >> attempts to crawl sites other than those specified, as self.allowed_domain >> does not get set and therefore the OffsiteMiddleware does not trigger. >> >> I implemented a workaround for this; I wanted to present it both it case >> it is useful and to see if anybody has found better solutions to this >> problem. >> >> What I did was >> >> 1. Modify the parse_start_url function to add the domain in question >> 2. Use a filter_links callback to only allow links from that domain >> >> I guess I could also override make_requests_from_url or similar to >> achieve the same thing? >> >> In any case, any comments on this approach welcomed or suggestions on how >> to achieve the above. >> >> Thanks, >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to scrapy-users...@googlegroups.com. >> To post to this group, send email to scrapy...@googlegroups.com. >> Visit this group at https://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> >> >> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to scrapy-users+unsubscr...@googlegroups.com. > To post to this group, send email to scrapy-users@googlegroups.com. > Visit this group at https://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.