Exactly, that's what I'm trying to work around. My solution does work, I was just interested in whether anybody had tried other approaches.
On Thursday, December 17, 2015 at 2:59:58 AM UTC, lnxpgn wrote: > > > If the dont_filter argument in Request is True or the spider's > allowed_domains is empty, OffsiteMiddleware does nothing > > On 15-12-15 上午2:52, somewhatofftheway wrote: > > I'm trying to implement a spider which will: > > a. Pull URLs from a queue of some sort > b. Only crawl those sites > > It's essentially a broad crawl in that it is designed to look at any site, > but I want to be able to limit the sites rather than letting it crawl the > whole web. > > I had experimented with a RabbitMQ based solution, but have recently been > trying scrapy-redis. This seems to generally work very well. However, it > attempts to crawl sites other than those specified, as self.allowed_domain > does not get set and therefore the OffsiteMiddleware does not trigger. > > I implemented a workaround for this; I wanted to present it both it case > it is useful and to see if anybody has found better solutions to this > problem. > > What I did was > > 1. Modify the parse_start_url function to add the domain in question > 2. Use a filter_links callback to only allow links from that domain > > I guess I could also override make_requests_from_url or similar to achieve > the same thing? > > In any case, any comments on this approach welcomed or suggestions on how > to achieve the above. > > Thanks, > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to scrapy-users...@googlegroups.com <javascript:>. > To post to this group, send email to scrapy...@googlegroups.com > <javascript:>. > Visit this group at https://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.