Re: How to enable the Scrapy's duplicate urls filter for start_urls?

Paul Tremberth Tue, 03 May 2016 04:21:53 -0700

Hi Antoine,

you can override the start_requests method of your spider.
the default is this 
<https://github.com/scrapy/scrapy/blob/ebef6d7c6dd8922210db8a4a44f48fe27ee0cd16/scrapy/spiders/__init__.py#L68>(explicitly
 
disabling filtering):


    def start_requests(self):
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True)

You can change it to (default for Request is dont_filter=False 
<https://github.com/scrapy/scrapy/blob/d42a98d3b590515bae30fb698e7aba2d7511608e/scrapy/http/request/__init__.py#L21>
):

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url)



Regards,
Paul.

On Monday, May 2, 2016 at 10:04:34 PM UTC+2, Antoine Brunel wrote:
>
> Hello,
>
> I found out that Scrapy's duplicate url filter RFPDupeFilter is disabled 
> for urls set in start_urls. 
> How can I enable it?
>
> Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: How to enable the Scrapy's duplicate urls filter for start_urls?

Reply via email to