I'm trying to implement a spider which will: a. Pull URLs from a queue of some sort b. Only crawl those sites
It's essentially a broad crawl in that it is designed to look at any site, but I want to be able to limit the sites rather than letting it crawl the whole web. I had experimented with a RabbitMQ based solution, but have recently been trying scrapy-redis. This seems to generally work very well. However, it attempts to crawl sites other than those specified, as self.allowed_domain does not get set and therefore the OffsiteMiddleware does not trigger. I implemented a workaround for this; I wanted to present it both it case it is useful and to see if anybody has found better solutions to this problem. What I did was 1. Modify the parse_start_url function to add the domain in question 2. Use a filter_links callback to only allow links from that domain I guess I could also override make_requests_from_url or similar to achieve the same thing? In any case, any comments on this approach welcomed or suggestions on how to achieve the above. Thanks, -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.