Hi, I've searched the mailinglists and documentation for a clear answer to the following question, but haven't found one, so here goes:
We use Lucene to index and search a constant stream of messages: our index is always growing. In the past, if we added new features to the software that required the index to be rebuilt (adopting an accent-insensitive analyzer for instance, or adding a field to every lucene Document), we would build an entirely new index out of all the messages we had stored, and then swap out the old one with the new one. Recently, we've had a couple of clients whose message stores are so large that our strategy is no longer viable: building a new index from scratch takes, for various reasons not related to lucene, upwards of 48 hours, and that period will only increase when client message stores grow bigger and bigger. What I would like is to update the index piecemeal, starting with the most recently added document (ie the most recent messages, since clients usually care about those the most). Then, most of the users will see the new functionality in their searches fairly quickly, and the older stuff, which doesn't matter so much, will get reindexed at a later date. However, I'm unclear as to what would be the best/most performant way to accomplish this. There are a few strategies I've thought of, and I was wondering if anyone could help me out as to which would be the best idea (or if there are other, better methods that I haven't thought of). I should also say that every message in the system has a unique identifier (guid) that can be used to see whether two different lucene documents represent the same message. 1. Simply iterate over all message in the message store, convert them to lucene documents, and call IndexWriter.update() for each one (using the guid). 2. Iterate over all messages in small steps (say 1000 at a time), and the for each batch delete the existing documents from the index, and then do Indexwriter.insert() for all messages (this is essentially step 1, split up into small parts and with the delete and insert part batched). 3. Iterate over all messages in small steps, and for each batch create a separate index (lets say a RAM index), delete all the old documents from the main index, and merge the seperate index into the main one. 4. Same as 3, except merge first, and then remove the old duplicates. Any help on this issue would be much appreciated. Thanks in advance, Maarten -- View this message in context: http://www.nabble.com/Best-strategy-for-reindexing-large-amount-of-data-tp25791659p25791659.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org