> > DeleteDuplicates removes documents having the same digest or the same > url. If you use the TextProfileSigniture instead of MD5Signiture, it > will remove near similar documents. The MD5Signiture class set digest as > the md5 of all the content, whereas textProfileSigniture sets digest as > the md5 of significant terms. You should check the class for > implementation details. also look at signitureFactory for how to change > the configuration.
DeleteDuplicates does NOT delete same URLs, it compares only the digest. See Nutch 371 http://www.mail-archive.com/[email protected]/msg04635.html In fact I have some important URLs in every single segment (although this should not happen because I generate with the topN option. Maybe topN doesn't look in the crawldb or so.)
