Could implement a spider middleware which fetches a domain white list from
Redis during initialization, the domain white list is changed later, using
Redis Pub/Sub to update it. if a URL's domain isn't in the white list,
discard the Request, don't matter where it comes from.

2015-12-17 22:03 GMT+08:00 somewhatofftheway <benjamins...@gmail.com>:

> Exactly, that's what I'm trying to work around. My solution does work, I
> was just interested in whether anybody had tried other approaches.
>
> On Thursday, December 17, 2015 at 2:59:58 AM UTC, lnxpgn wrote:
>>
>>
>> If the dont_filter argument in Request is True or the spider's
>> allowed_domains is empty,  OffsiteMiddleware does nothing
>>
>> On 15-12-15 上午2:52, somewhatofftheway wrote:
>>
>> I'm trying to implement a spider which will:
>>
>> a. Pull URLs from a queue of some sort
>> b. Only crawl those sites
>>
>> It's essentially a broad crawl in that it is designed to look at any
>> site, but I want to be able to limit the sites rather than letting it crawl
>> the whole web.
>>
>> I had experimented with a RabbitMQ based solution, but have recently been
>> trying scrapy-redis. This seems to generally work very well. However, it
>> attempts to crawl sites other than those specified, as self.allowed_domain
>> does not get set and therefore the OffsiteMiddleware does not trigger.
>>
>> I implemented a workaround for this; I wanted to present it both it case
>> it is useful and to see if anybody has found better solutions to this
>> problem.
>>
>> What I did was
>>
>> 1. Modify the parse_start_url function to add the domain in question
>> 2. Use a filter_links callback to only allow links from that domain
>>
>> I guess I could also override make_requests_from_url or similar to
>> achieve the same thing?
>>
>> In any case, any comments on this approach welcomed or suggestions on how
>> to achieve the above.
>>
>> Thanks,
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to scrapy-users...@googlegroups.com.
>> To post to this group, send email to scrapy...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to