Erick:
    Thanks for your suggestion. I think another solution would be keeping an 
list of keywords that could uniquely identify a document in a database, and 
search for keywords before adding a new document. As querying database is fast, 
this probaly wouldn't cost much time. But this would request maintaining a 
database while indexing. I just wondered if lucene offers an interface 
identifying duplicates. I think identifying duplicate URLs when indexing web 
would be common. 
    Best
Wishes.
----- Original Message ----- 
From: "Erick Erickson" <erickerick...@gmail.com>
To: <java-user@lucene.apache.org>
Sent: Saturday, March 07, 2009 10:58 PM
Subject: Re: Search while indexing


> First, you'll probably want to search the user list archive for this issue,
> as
> it's been discussed and you'll find more information than I can remember
> off the top of my head. That said:
> 
> 1> changes to an index are not visible until you reopen the reader. You
>     probably have to flush the writer in the meantime. And this will
>    be costly to do for every document.
> 
> 2> How do you identify duplicates? If it's a short enough signature,
>      you could consider keeping an in-memory list and check that
>      while indexing. If you needed to update your index you could
>      simply use TermEnum/TermDocs to read all the values into
>      memory before adding to it.
> 
> 3> You could consider using some kind of calculated signature of
>     the whole file for your key, but that may not suit your app.
> 
> Best
> Erick
> 
> 
> 
> On Sat, Mar 7, 2009 at 12:21 AM, sonfon <son...@gmail.com> wrote:
> 
>> Dear All,
>>    Now, I'm considering to build index for my application with lucene.
>> However, as the document sources I'm going to index has many duplications,
>> so before adding a document to an IndexWriter, I hope search in the index
>> database first to see if a same document copy has already been added. I used
>> IndexSearcher to search the same Dir while IndexWriter writing to it.
>> However, it seem that IndexSearcher returned no result though I'm sure there
>> are duplicate copies indexed already. And after the indexing procedure, I
>> can get the search results, so I'm sure I didn't write the wrong code.
>> Anyone could offer some help? Some example codes are appreciated.
>>    Best
>> Wishes.
>

Reply via email to