> > Dennis Kubes wrote: > > If you are using more than one index then dedup will not work across > > indexes. > > This is incorrect. DeleteDuplicates works just fine with multiple > indexes, assuming you process all indexes in the same run of > DeleteDuplicates, so that it has a global view of all input indexes. > > A single index should dedup correctly unless the pages are not > > exact duplicates but near duplicates. The dedup process works on url > > and byte hash. If the content is even 1 byte different, it doesn't work. > > This depends on the implementation of Signature. Indeed, the default > MD5HashSignature works this way. > > > > > Near duplicate detection is another set of algorithms that haven't been > > implemented in Nutch yet. > > Well, the existing TextProfileSignature can be used as a form of (crude) > near-duplicate detection, precisely because it is tolerant to small > changes in the input text.
Thanks Andrzej. How do you tell Nutch to use the TextProfileSignature instead of MD5HashSignature for deduplicating? > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > _________________________________________________________________ Make a mini you and download it into Windows Live Messenger http://clk.atdmt.com/UKM/go/111354029/direct/01/
