Another solution that works well in some applications is to rely on document number. This number will remain the same for the life of an IndexReader. This number is also always larger for documents added later. So given two documents with the same ID, the one with the highest document number is the latest one. The rest can be deleted. One way to store a list of documents easily is to use a filter (which could also be serialized to disk if needed). This filter would only be valid for the IndexReader used to create it.

So here's a modified sequence of operations, perhaps a bit more efficient than proposed by Christoph:
1) Open an IndexReader for searching - S. Keep it open until the transaction is committed.
2) Open a second IndexReader for deletions - D.
3) Create a filter bitset F (or use any other mechanism for storing document numbers to be deleted)
4) Open an IndexWriter for new documents - W.
5) As documents come in, add them using W. Find their old versions in D and record their document numbers in F. D will not show any new documents, only documents present at the time D was created.
6) Close W.
7) Use D to delete all documents marked in F.
8) Close D.


Step 8 commits the transaction. At this point, another IndexReader S2 can be created and all new searches can go to that. Once all searches using S are done, S can be closed.

Would this work? I think it might. Anyone sees any holes in this? This can even allow multiple Ws to be used concurrently, and perhaps even multiple machines can be utilized that write to the same index, but I'm not sure if this is desirable.

Yea, this would be a great thing to have available in Lucene...
Dmitry.


Christoph Goller wrote:

Giulio Cesare Solaroli wrote:

I have been thinking about this for a while, but could not find out a
reasonable solution.
The basic problems are:
- where do I (safely) store the index of the documents that needs to be deleted?
- how can I uniquely identify the Lucene documents that I have to
delete, given that there are different Lucene document matching a
single "real" document?


The second problem could be "easily" solved adding a kind of version
field (stored in the Lucene index) that is incremented every time a
new version of a document is inserted. In this way, when searching for
duplicated documents (using the "real" document ID) I will find a set
of Lucene documents and I could delete all but the one with the
highest version number.


You need unique document ids. They may either be produced by the
fulltext-Index (example 1) or they may come from outside (example 2):

1) You could use a unique id for every doucment added to the Lucene index
(a kind of counter for the number of added documents). You have to provide
this number by yourself. It is not provided by Lucene! We are doing this
in some applications. This unique id is stored in a dedicated field and in
your database you associate this unique id with your document. If you change
your document in the database, you find the unique id there and thus you know
which document to delete in the Lucene index. If the changed document is added
to the Lucene-Index, you get a new unique id and store this one with the changed
document in your database.


2) In another application we store a url of each document in the Lucene index.
If the document underlying the url has changed, we know which document to delete
in the Lucene index simply via the url and we store the new version of the document again with a url-field.


Christoph


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to