[Nutch-dev] Duplicate Detection: Offlince vs. Search Time

Shailesh Kochhar Wed, 12 Apr 2006 15:21:21 -0700

Hi,

I'm trying to implement a duplicate detection method that doesn't deleteduplicate pages from the index. Essentially, I want to be able todisplay all the duplicate URLs for a page in the search results insteadof just the one that was kept in the index.


There are two (potentially more) ways that I can think of to implement this.

1. Offline duplicate detection which deletes the pages from the indexbut stores references to the deleted pages with the copy that is kept.The search results can then display all the URLs that have the same content.

2. Duplicate detection at search time that groups identical/similarpages together. This method has the advantage that one could implementduplicate detection that is sensitive to the query terms. However, itwould add a performance penalty to the search.

I not very familiar with the Nutch API though I know there's a MD5signature based deduping method in place and a Signature class to extendfor offline duplicate detection. I was wondering if anyone had triedsearch time deduping and what would be good places to try and implement it.


Any other suggestions/advice would be great.

Thanks,
  - Shailesh


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Duplicate Detection: Offlince vs. Search Time

Reply via email to