Chetan Sahasrabudhe wrote:
I do some processing on my master index.
"dedup" does not guarantee that it will delete newly inserted record or old 
record.

Newer records are in fact preferred. When there are multiple records with the same url, dedup keeps the url from the last index in the list of segments returned by NutchFileSystem.listFiles(segmentsDir), which is typically the most recent.


When there are multiple records whose content has the same MD5 hash, the record with the higher page score (either from link analysis or just the number of incoming links). If page score's are the same (i.e., when no link analysis has been done, and link counts are not used), then the record with the shortest url is selected.

Doug

Reply via email to