[Nutch-dev] Re: Duplicate Detection: Offlince vs. Search Time

Shailesh Kochhar Sun, 16 Apr 2006 17:21:14 -0700

Doug Cutting wrote:

Shailesh Kochhar wrote:
I not very familiar with the Nutch API though I know there's a MD5signature based deduping method in place and a Signature class toextend for offline duplicate detection. I was wondering if anyone hadtried search time deduping and what would be good places to try andimplement it.
Nutch already does search-time deduping. By default it limits things totwo hits per host, but you can dedup by other fields and with otherper-dup counts. This is available through NutchBean:
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/NutchBean.html#search(org.apache.nutch.searcher.Query,%20int,%20int,%20java.lang.String)
and though the OpenSearch servlet.

If I understand this correctly, you can only dedup by one field. Thiswould mean that if you were to implement and use content-baseddeduplication, you'd have to give up limiting the number of hits per host.


Is this correct, or did I miss something?

  - Shailesh



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Duplicate Detection: Offlince vs. Search Time

Reply via email to