Re: deleting duplicate documents from my index

Chris Hostetter Sun, 29 Jan 2006 21:57:30 -0800

: Hi, im trying to delete duplicate documents from my index, the unique
: indentifier is the documents url (aka field "url").
:
: my initial thought of how to acomplish this is to open the index via a
: reader and sort them by the documents url and then iterate through them
: looking for a match with the current document and the previous document,
: if it matches i would delete the current document etc.


Assuming your "url" filed is a keyword field (indexed, not-tokenized) then
take a look at IndexReader.termEnum ... if you start with new
Term("url","") and iterate untill the field is no longer url, you'll be
iterating over every url Term in your index.  for each one, check docFreq,
and if it's more then 1 you've got a dup.

Then look at IndexReader.termDocs for an easy way to find out which docs
share that url.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: deleting duplicate documents from my index

Reply via email to