NG-Marketing, M.Schneider wrote:
DeleteDuplicates removes documents having the same digest or the same
url. If you use the TextProfileSigniture instead of MD5Signiture, it
will remove near similar documents. The MD5Signiture class set digest as
the md5 of all the content, whereas textProfileSigniture sets digest as
the md5 of significant terms. You should check the class for
implementation details.  also look at signitureFactory for how to change
the configuration.

DeleteDuplicates does NOT delete same URLs, it compares only the digest. See
Nutch 371
http://www.mail-archive.com/[email protected]/msg04635.html

Erhm, please see the followup here: http://issues.apache.org/jira/browse/NUTCH-371 . This issue is fixed now.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to