Edward Quick wrote:
Dennis Kubes wrote:
If you are using more than one index then dedup will not work across
indexes.
This is incorrect. DeleteDuplicates works just fine with multiple
indexes, assuming you process all indexes in the same run of
DeleteDuplicates, so that it has a global view of all input indexes.
A single index should dedup correctly unless the pages are not
exact duplicates but near duplicates. The dedup process works on url
and byte hash. If the content is even 1 byte different, it doesn't work.
This depends on the implementation of Signature. Indeed, the default
MD5HashSignature works this way.
Near duplicate detection is another set of algorithms that haven't been
implemented in Nutch yet.
Well, the existing TextProfileSignature can be used as a form of (crude)
near-duplicate detection, precisely because it is tolerant to small
changes in the input text.
Thanks Andrzej.
How do you tell Nutch to use the TextProfileSignature instead of
MD5HashSignature for deduplicating?
See the following property in your nutch-site.xml: db.signature.class.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com