Hello,
I need some advice regarding incremental index updates. There are three cases I need to handle when iterating over the sourcefiles (files that need to be indexed): 1. A file did not change since the last update 2. A file did change since the last update 3. A file was removed since the last update Case 1. is easy... Case 2. as well.. just remove the old file and add the new one Case 3. is bugging me.. How can I find out if a file which is specified in the index, does not exist anymore? The blunt solution would be to retrieve *all* file paths from the index, and check whether each one exists. If so - go on, if the file does not exist on disk, remove it from the index. The problem I have with this is, that I am possibly pulling a lot of data from the lucene index. I will also do a lot of local filesystem checks. Sloooow?! Another idea I had is about introducing an "index version" integer. This number will be unique for each start of the parsing process. So each time my indexer program is started a new "index version" is created. Now each file which exists in the index and gets processed will have the "index version" number stored as a document field. This way all newly added and modified documents will have an up to date "index version" flag after indexing is complete. Now, to remove all physically deleted files from the index, I would select all documents which have an old "index version" flag stored inside them. Every document with such an old number can be safely removed. Problem with this solution is, that *every* document in the index will get updated: First the old index version field is removed, then the new field is added. On the plusside, removing deleted files will be very fast. What would you recommend for keeping an incremental update? I fear the first version will be utterly slow for small updates whereas the second version will be a lot faster - though adding stuff is slower because of the additional field update for every document. Thanks for your advice, Johannes :-)
