Exactly, that's what I'm trying to work around. My solution does work, I 
was just interested in whether anybody had tried other approaches.

On Thursday, December 17, 2015 at 2:59:58 AM UTC, lnxpgn wrote:
>
>
> If the dont_filter argument in Request is True or the spider's 
> allowed_domains is empty,  OffsiteMiddleware does nothing
>
> On 15-12-15 上午2:52, somewhatofftheway wrote:
>
> I'm trying to implement a spider which will: 
>
> a. Pull URLs from a queue of some sort
> b. Only crawl those sites
>
> It's essentially a broad crawl in that it is designed to look at any site, 
> but I want to be able to limit the sites rather than letting it crawl the 
> whole web.
>
> I had experimented with a RabbitMQ based solution, but have recently been 
> trying scrapy-redis. This seems to generally work very well. However, it 
> attempts to crawl sites other than those specified, as self.allowed_domain 
> does not get set and therefore the OffsiteMiddleware does not trigger.
>
> I implemented a workaround for this; I wanted to present it both it case 
> it is useful and to see if anybody has found better solutions to this 
> problem.
>
> What I did was 
>
> 1. Modify the parse_start_url function to add the domain in question
> 2. Use a filter_links callback to only allow links from that domain
>
> I guess I could also override make_requests_from_url or similar to achieve 
> the same thing?
>
> In any case, any comments on this approach welcomed or suggestions on how 
> to achieve the above.
>
> Thanks,
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to scrapy-users...@googlegroups.com <javascript:>.
> To post to this group, send email to scrapy...@googlegroups.com 
> <javascript:>.
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to