Best strategy for reindexing large amount of data

Maarten_D Wed, 07 Oct 2009 11:16:25 -0700

Hi,
I've searched the mailinglists and documentation for a clear answer to the
following question, but haven't found one, so here goes:

We use Lucene to index and search a constant stream of messages: our index
is always growing. In the past, if we added new features to the software
that required the index to be rebuilt (adopting an accent-insensitive
analyzer for instance, or adding a field to every lucene Document), we would
build an entirely new index out of all the messages we had stored, and then
swap out the old one with the new one. Recently, we've had a couple of
clients whose message stores are so large that our strategy is no longer
viable: building a new index from scratch takes, for various reasons not
related to lucene, upwards of 48 hours, and that period will only increase
when client message stores grow bigger and bigger.

What I would like is to update the index piecemeal, starting with the most
recently added document (ie the most recent messages, since clients usually
care about those the most). Then, most of the users will see the new
functionality in their searches fairly quickly, and the older stuff, which
doesn't matter so much, will get reindexed at a later date. However, I'm
unclear as to what would be the best/most performant way to accomplish this.

There are a few strategies I've thought of, and I was wondering if anyone
could help me out as to which would be the best idea (or if there are other,
better methods that I haven't thought of). I should also say that every
message in the system has a unique identifier (guid) that can be used to see
whether two different lucene documents represent the same message.

1. Simply iterate over all message in the message store, convert them to
lucene documents, and call IndexWriter.update() for each one (using the
guid).

2. Iterate over all messages in small steps (say 1000 at a time), and the
for each batch delete the existing documents from the index, and then do
Indexwriter.insert() for all messages (this is essentially step 1, split up
into small parts and with the delete and insert part batched).

3. Iterate over all messages in small steps, and for each batch create a
separate index (lets say a RAM index), delete all the old documents from the
main index, and merge the seperate index into the main one.

4. Same as 3, except merge first, and then remove the old duplicates.

Any help on this issue would be much appreciated.

Thanks in advance,
Maarten
--
View this message in context:
http://www.nabble.com/Best-strategy-for-reindexing-large-amount-of-data-tp25791659p25791659.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Best strategy for reindexing large amount of data

Reply via email to