Giulio Cesare Solaroli wrote:
I have been thinking about this for a while, but could not find out a
reasonable solution.
The basic problems are:
- where do I (safely) store the index of the documents that needs to be deleted?
- how can I uniquely identify the Lucene documents that I have to
delete, given that there are different Lucene document matching a
single "real" document?

The second problem could be "easily" solved adding a kind of version
field (stored in the Lucene index) that is incremented every time a
new version of a document is inserted. In this way, when searching for
duplicated documents (using the "real" document ID) I will find a set
of Lucene documents and I could delete all but the one with the
highest version number.

You need unique document ids. They may either be produced by the fulltext-Index (example 1) or they may come from outside (example 2):

1) You could use a unique id for every doucment added to the Lucene index
(a kind of counter for the number of added documents). You have to provide
this number by yourself. It is not provided by Lucene! We are doing this
in some applications. This unique id is stored in a dedicated field and in
your database you associate this unique id with your document. If you change
your document in the database, you find the unique id there and thus you know
which document to delete in the Lucene index. If the changed document is added
to the Lucene-Index, you get a new unique id and store this one with the changed
document in your database.

2) In another application we store a url of each document in the Lucene index.
If the document underlying the url has changed, we know which document to delete
in the Lucene index simply via the url and we store the new version of the document again with a url-field.


Christoph


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to