Re: Dedup Question

Dennis Kubes Wed, 23 Jul 2008 07:55:40 -0700

It will remove the one with the lowest score in the crawldb as set bythe scoring filters. Dedup first removes by url then by content hash.If the content is changed even slightly though it will *not* be detectedas a duplicate. Solving that problem is called near duplicate detection(ndd) and uses an algorithm called shingling which isn't currentlyimplemented in Nutch (but hopefully will be in the near future).


Dennis


Patrick Markiewicz wrote:

Hi,

            If I have a url http://www.example.com/index.html stored in
my index with the content: EMPTY FILE, and I have a file
http://www.domain.com/index.html with the content: EMPTY FILE, then the
two files are duplicates.  Which one will the de-duplication process
remove from the index?  Thanks.

Patrick

Re: Dedup Question

Reply via email to