Re: Index Dedupe

Erick Erickson Tue, 02 Oct 2007 06:32:30 -0700

Here's a couple of fragments, alter to suit....
    public void doRemove(Directory dir) throws Exception
    {


      this.reader = IndexReader.open(dir);

        TermEnum theTerms = this.reader.terms(new Term("unique_field", ""));

        Term term = null;

        do {
            term = theTerms.term();

            if ((term == null) || ! term.field().equalsIgnoreCase("doc_id"))
{
                break;
            }

            if (theTerms.docFreq() > 1) {
                this.removeDupsForTerm(term);
            }
        } while (theTerms.next());
}


   private void removeDupsForTerm(Term term) throws Exception
    {
        TermDocs td = this.reader.termDocs(term);
        for ( int idx = 0; td.next(); ++idx) {
            if (idx > 0) {
                this.reader.deleteDocument(td.doc());
           }
        }
    }
On 10/2/07, Johnny R. Ruiz III <[EMAIL PROTECTED]> wrote:
>
> Hi Daniel,
>
> Tnx, but forgive my ignorance..  can u give me a sample code to do it
> :).   I have never used termDocs() before.
>
> Tnx,
> Johnny
>
> ----- Original Message ----
> From: Daniel Noll <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, October 2, 2007 12:00:07 PM
> Subject: Re: Index Dedupe
>
> On Tuesday 02 October 2007 12:25:47 Johnny R. Ruiz III wrote:
> > Hi,
> >
> > I can't seem to find a way to delete duplicate in lucene index.  I
> hve  a
> > unique key so it seems to be straight forward.  But I can't find a
> simple
> > way  to do it except for putting  each record in the index into HashMap.
> > Are there any method in lucene package that I could use?
>
> I would use termDocs() to iterate all the terms in that field.  Then skip
> the
> first doc for each term and delete all subsequent ones.
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
>
>
>
>
>
> ____________________________________________________________________________________
> Need a vacation? Get great deals
> to amazing places on Yahoo! Travel.
> http://travel.yahoo.com/

Re: Index Dedupe

Reply via email to