deleting duplicate documents from my index

gekkokid Sat, 28 Jan 2006 09:00:29 -0800

Hi, im trying to delete duplicate documents from my index, the unique 
indentifier is the documents url (aka field "url").


my initial thought of how to acomplish this is to open the index via a reader 
and sort them by the documents url and then iterate through them looking for a 
match with the current document and the previous document, if it matches i 
would delete the current document etc.

what other methods that are not too taxing could i try?

how could i sort the documents via url internally? what classes should i be 
looking at to do this


Thanks,
_gk

deleting duplicate documents from my index

Reply via email to