Doug Cutting wrote:
Chetan Sahasrabudhe wrote:

I do some processing on my master index.
"dedup" does not guarantee that it will delete newly inserted record or old record.


Newer records are in fact preferred. When there are multiple records with the same url, dedup keeps the url from the last index in the list of segments returned by NutchFileSystem.listFiles(segmentsDir), which is typically the most recent.

When there are multiple records whose content has the same MD5 hash, the record with the higher page score (either from link analysis or just the number of incoming links). If page score's are the same (i.e., when no link analysis has been done, and link counts are not used), then the record with the shortest url is selected.

The de-duplication algorithm should be abstracted and separated into a utility method/class - currently both DeleteDuplicates and SegmentMergeTool perform de-duplication, but I'm afraid that each follows a slightly different, hardcoded routine...


SegmentMergeTool uses a simpler algorithm: for all documents with equal url hash or content hash, keep only the latest document. Neither link scores nor url length is taken into account... :-(

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to