Erick:
Thanks for your suggestion. I think another solution would be keeping an
list of keywords that could uniquely identify a document in a database, and
search for keywords before adding a new document. As querying database is fast,
this probaly wouldn't cost much time. But this would request maintaining a
database while indexing. I just wondered if lucene offers an interface
identifying duplicates. I think identifying duplicate URLs when indexing web
would be common.
Best
Wishes.
----- Original Message -----
From: "Erick Erickson" <[email protected]>
To: <[email protected]>
Sent: Saturday, March 07, 2009 10:58 PM
Subject: Re: Search while indexing
> First, you'll probably want to search the user list archive for this issue,
> as
> it's been discussed and you'll find more information than I can remember
> off the top of my head. That said:
>
> 1> changes to an index are not visible until you reopen the reader. You
> probably have to flush the writer in the meantime. And this will
> be costly to do for every document.
>
> 2> How do you identify duplicates? If it's a short enough signature,
> you could consider keeping an in-memory list and check that
> while indexing. If you needed to update your index you could
> simply use TermEnum/TermDocs to read all the values into
> memory before adding to it.
>
> 3> You could consider using some kind of calculated signature of
> the whole file for your key, but that may not suit your app.
>
> Best
> Erick
>
>
>
> On Sat, Mar 7, 2009 at 12:21 AM, sonfon <[email protected]> wrote:
>
>> Dear All,
>> Now, I'm considering to build index for my application with lucene.
>> However, as the document sources I'm going to index has many duplications,
>> so before adding a document to an IndexWriter, I hope search in the index
>> database first to see if a same document copy has already been added. I used
>> IndexSearcher to search the same Dir while IndexWriter writing to it.
>> However, it seem that IndexSearcher returned no result though I'm sure there
>> are duplicate copies indexed already. And after the indexing procedure, I
>> can get the search results, so I'm sure I didn't write the wrong code.
>> Anyone could offer some help? Some example codes are appreciated.
>> Best
>> Wishes.
>