It will remove the one with the lowest score in the crawldb as set by
the scoring filters. Dedup first removes by url then by content hash.
If the content is changed even slightly though it will *not* be detected
as a duplicate. Solving that problem is called near duplicate detection
(ndd) and uses an algorithm called shingling which isn't currently
implemented in Nutch (but hopefully will be in the near future).
Dennis
Patrick Markiewicz wrote:
Hi,
If I have a url http://www.example.com/index.html stored in
my index with the content: EMPTY FILE, and I have a file
http://www.domain.com/index.html with the content: EMPTY FILE, then the
two files are duplicates. Which one will the de-duplication process
remove from the index? Thanks.
Patrick