Re: Preventing merging by IndexWriter

Erick Erickson Tue, 17 Oct 2006 09:17:01 -0700

Why go through all this effort when it's easy to make your own unique ID?
Add a new field to each document "myuniqueid" and fill it in yourself. It'll
never change then.



The complex coordination way.
To coordinate things, you could keep the last ID used (and maybe other
information) in a unique doc. Remember that Lucene doesn't require that
every doc has the same fields. So it's possible to keep meta-data in a
document without ever getting that document confused with your "real"
documents. You could even keep the lucene docID <-> your new ID mapping in
this document if it would help. Imagine a doc like this...

metadata (contains some dummy string to make finding this doc easy).
lastuniqueid (contains the last document ID you assigned).
mapping(contains, say, a bunch of strings of the form ####:@@@@ where the
#### is the Lucene ID and @@@@ is your id) I suspect you don't really need
this.


Whenever you add documents to your index, read the meta-data document, use
lastunieuqid + 1 for your new document, and write the meta-data document
back.


The easy coordination:
Alternatively, at the beginning of an indexing run, just use a TermEnum to
find the maximum already-used uniqueid and assign incremental IDs to new
documents. I was surprised at how very quickly you can enumerate a field in
a Lucene index. Look around at the various TermEnum classes to see if you
can easily find the largest one rather than iterating all of the terms. This
may not be a good idea if you're indexing small chunks often....

Of course, I may misunderstand your problem space enough that this is
useless. If so, please tell us the problem you're trying to solve and maybe
wiser heads than mine will have better suggestions....

Erick

On 10/17/06, Johan Stuyts <[EMAIL PROTECTED]> wrote:


Hi,

(I am using Lucene 2.0.0)

I have been looking at a way to use stable IDs with Lucene. The reason I
want this is so I can efficiently store and retrieve information outside
of Lucene for filtering search results. It looks like this is going to
require most of Lucene to be rewritten, so I gave up on that approach.

I have a new idea where I want the documents IDs to only change at a
specific moment instead of whenever Lucene choses to do so. This way the
document IDs remain stable and I can use these IDs in the external data.
I want to merge the segments of the index at a specific moment because
updating the external data to match the new document IDs is too
expensive to do continuously. At the moment that I want to merge the
segments of the index causing the document IDs to change, I can also
update my external data so the correct data is attached to the correct
Lucene document ID. If I understand correctly, merging only shifts
document IDs to remove deleted document IDs, so I can do the same
shifting with the external data by getting the set of deleted documents
before the merge.

I already set 'mergeFactor' and 'maxBufferedDocs' to very high values so
all documents of a batch will be stored in RAM. The problem I am facing
is that the IndexWriter merges the segments in RAM with the segments on
disk when I close the IndexWriter. What I need instead is that the
IndexWriter will create a new segment on disk containing the data from
the segment(s) in RAM. This way the document IDs of the exising disk
segments are not affected.

Creating a new segment instead of merging with the existing ones will
also cause lots of segments with a variable number of documents to be
created on disk, but I believe the IndexReader/IndexSearcher is able to
handle this. I only have to make sure that the number of segments does
not become to high (i.e. merge regularly) because this might cause 'too
many open files' errors.

So my questions are: is there a way to prevent the IndexWriter from
merging, forcing it to create a new segment for each indexing batch? And
if so, will it still be possible to merge the disk segments when I want
to?

Kind regards,

Johan Stuyts
Hippo

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Preventing merging by IndexWriter

Reply via email to