Using OffsiteMiddleware with scrapy-redis

somewhatofftheway Mon, 14 Dec 2015 10:52:30 -0800

I'm trying to implement a spider which will:

a. Pull URLs from a queue of some sort
b. Only crawl those sites


It's essentially a broad crawl in that it is designed to look at any site, 
but I want to be able to limit the sites rather than letting it crawl the 
whole web.

I had experimented with a RabbitMQ based solution, but have recently been 
trying scrapy-redis. This seems to generally work very well. However, it 
attempts to crawl sites other than those specified, as self.allowed_domain 
does not get set and therefore the OffsiteMiddleware does not trigger.

I implemented a workaround for this; I wanted to present it both it case it 
is useful and to see if anybody has found better solutions to this problem.

What I did was 

1. Modify the parse_start_url function to add the domain in question
2. Use a filter_links callback to only allow links from that domain

I guess I could also override make_requests_from_url or similar to achieve 
the same thing?

In any case, any comments on this approach welcomed or suggestions on how 
to achieve the above.

Thanks,

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Using OffsiteMiddleware with scrapy-redis

Reply via email to