Why go through all this effort when it's easy to make your own unique ID? Add a new field to each document "myuniqueid" and fill it in yourself. It'll never change then.
The complex coordination way. To coordinate things, you could keep the last ID used (and maybe other information) in a unique doc. Remember that Lucene doesn't require that every doc has the same fields. So it's possible to keep meta-data in a document without ever getting that document confused with your "real" documents. You could even keep the lucene docID <-> your new ID mapping in this document if it would help. Imagine a doc like this... metadata (contains some dummy string to make finding this doc easy). lastuniqueid (contains the last document ID you assigned). mapping(contains, say, a bunch of strings of the form ####:@@@@ where the #### is the Lucene ID and @@@@ is your id) I suspect you don't really need this. Whenever you add documents to your index, read the meta-data document, use lastunieuqid + 1 for your new document, and write the meta-data document back. The easy coordination: Alternatively, at the beginning of an indexing run, just use a TermEnum to find the maximum already-used uniqueid and assign incremental IDs to new documents. I was surprised at how very quickly you can enumerate a field in a Lucene index. Look around at the various TermEnum classes to see if you can easily find the largest one rather than iterating all of the terms. This may not be a good idea if you're indexing small chunks often.... Of course, I may misunderstand your problem space enough that this is useless. If so, please tell us the problem you're trying to solve and maybe wiser heads than mine will have better suggestions.... Erick On 10/17/06, Johan Stuyts <[EMAIL PROTECTED]> wrote:
Hi, (I am using Lucene 2.0.0) I have been looking at a way to use stable IDs with Lucene. The reason I want this is so I can efficiently store and retrieve information outside of Lucene for filtering search results. It looks like this is going to require most of Lucene to be rewritten, so I gave up on that approach. I have a new idea where I want the documents IDs to only change at a specific moment instead of whenever Lucene choses to do so. This way the document IDs remain stable and I can use these IDs in the external data. I want to merge the segments of the index at a specific moment because updating the external data to match the new document IDs is too expensive to do continuously. At the moment that I want to merge the segments of the index causing the document IDs to change, I can also update my external data so the correct data is attached to the correct Lucene document ID. If I understand correctly, merging only shifts document IDs to remove deleted document IDs, so I can do the same shifting with the external data by getting the set of deleted documents before the merge. I already set 'mergeFactor' and 'maxBufferedDocs' to very high values so all documents of a batch will be stored in RAM. The problem I am facing is that the IndexWriter merges the segments in RAM with the segments on disk when I close the IndexWriter. What I need instead is that the IndexWriter will create a new segment on disk containing the data from the segment(s) in RAM. This way the document IDs of the exising disk segments are not affected. Creating a new segment instead of merging with the existing ones will also cause lots of segments with a variable number of documents to be created on disk, but I believe the IndexReader/IndexSearcher is able to handle this. I only have to make sure that the number of segments does not become to high (i.e. merge regularly) because this might cause 'too many open files' errors. So my questions are: is there a way to prevent the IndexWriter from merging, forcing it to create a new segment for each indexing batch? And if so, will it still be possible to merge the disk segments when I want to? Kind regards, Johan Stuyts Hippo --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]