Re: RFPDupeFilter doesn't work

Paul Tremberth Tue, 15 Jul 2014 02:21:13 -0700

Hello,

RFDupeFilter is enabled by default.
And dont_filter is set to False by default when instantiating Request 
objects:
see 
https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/__init__.py#L22


> After some investigation, I realized that overridden request_seen() 
method is never called, and this is happening because dont_filter variable 
is set to True.

The one notable place where it is set to True explicitly is for start URLs 
where make_requests_from_url() is used
https://github.com/scrapy/scrapy/blob/master/scrapy/spider.py#L52

Are you seeing that non-filtering of URLs for non-start_urls?

If you want to filter start_urls, one way is to override start_requests()


On Tuesday, July 15, 2014 12:59:44 AM UTC+2, Sungmin Lee wrote:
>
> Hi,
>
> Last night, I was trying to use RFPDupeFilter to discard duplicate urls.
>
> I implemented a class inheriting RFPDupeFilter and overrode request_seen() 
> method.
> After linking the custom class to settings.py, I tested the code, but the 
> crawler still scraped all duplicate urls.
>
> After some investigation, I realized that overridden request_seen() method 
> is never called, and this is happening because dont_filter variable is set 
> to True.
>
> which is weird. according to Scrapy documentation, it is supposed to be 
> set to False:
>
>    - *dont_filter* (*boolean*) – indicates that this request should not 
>    be filtered by the scheduler. This is used when you want to perform an 
>    identical request multiple times, to ignore the duplicates filter. Use it 
>    with care, or you will get into crawling loops. Default to False.
>
> Just to test, I ended up changing a bit of scrapy code at 
> https://github.com/scrapy/scrapy/blob/master/scrapy/core/scheduler.py#L48
> from 
>     if not request.dont_filter and self.df.request_seen(request):
> to
>     if self.df.request_seen(request):
>
> , and finally the dupefilter started to work.
>
>
> Why is this happening? Why is dont_filter value set to True by default?
>
> Is there any neater solution rather than changing original Scrapy library?
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: RFPDupeFilter doesn't work

Reply via email to