Hello, RFDupeFilter is enabled by default. And dont_filter is set to False by default when instantiating Request objects: see https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/__init__.py#L22
> After some investigation, I realized that overridden request_seen() method is never called, and this is happening because dont_filter variable is set to True. The one notable place where it is set to True explicitly is for start URLs where make_requests_from_url() is used https://github.com/scrapy/scrapy/blob/master/scrapy/spider.py#L52 Are you seeing that non-filtering of URLs for non-start_urls? If you want to filter start_urls, one way is to override start_requests() On Tuesday, July 15, 2014 12:59:44 AM UTC+2, Sungmin Lee wrote: > > Hi, > > Last night, I was trying to use RFPDupeFilter to discard duplicate urls. > > I implemented a class inheriting RFPDupeFilter and overrode request_seen() > method. > After linking the custom class to settings.py, I tested the code, but the > crawler still scraped all duplicate urls. > > After some investigation, I realized that overridden request_seen() method > is never called, and this is happening because dont_filter variable is set > to True. > > which is weird. according to Scrapy documentation, it is supposed to be > set to False: > > - *dont_filter* (*boolean*) – indicates that this request should not > be filtered by the scheduler. This is used when you want to perform an > identical request multiple times, to ignore the duplicates filter. Use it > with care, or you will get into crawling loops. Default to False. > > Just to test, I ended up changing a bit of scrapy code at > https://github.com/scrapy/scrapy/blob/master/scrapy/core/scheduler.py#L48 > from > if not request.dont_filter and self.df.request_seen(request): > to > if self.df.request_seen(request): > > , and finally the dupefilter started to work. > > > Why is this happening? Why is dont_filter value set to True by default? > > Is there any neater solution rather than changing original Scrapy library? > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.