i would solve your problem external to the index ... everytime you run your incrimental process, as you walk your directory tree of files (adding the new ones, deleting/readdign the modified ones) record every file and save that somewhere. when you are all done, compare the list from this run with the list from the last run -- any file in the old list and not in hte new list is a document to be deleted.
: Date: Tue, 8 Aug 2006 15:48:16 +0200 : From: "Leimbach, Johannes" <[EMAIL PROTECTED]> : Reply-To: [email protected] : To: [email protected] : Subject: Need advice for doing incremental Index updates : : Hello, : : : : I need some advice regarding incremental index updates. : : : : There are three cases I need to handle when iterating over the : sourcefiles (files that need to be indexed): : : 1. A file did not change since the last update : 2. A file did change since the last update : 3. A file was removed since the last update : : : : Case 1. is easy... : : Case 2. as well.. just remove the old file and add the new one : : Case 3. is bugging me.. : : : : How can I find out if a file which is specified in the index, does not : exist anymore? : : : : The blunt solution would be to retrieve *all* file paths from the index, : and check whether each one exists. If so - go on, if the file does not : exist on disk, remove it from the index. The problem I have with this : is, that I am possibly pulling a lot of data from the lucene index. I : will also do a lot of local filesystem checks. Sloooow?! : : : : Another idea I had is about introducing an "index version" integer. This : number will be unique for each start of the parsing process. So each : time my indexer program is started a new "index version" is created. Now : each file which exists in the index and gets processed will have the : "index version" number stored as a document field. : : This way all newly added and modified documents will have an up to date : "index version" flag after indexing is complete. : : Now, to remove all physically deleted files from the index, I would : select all documents which have an old "index version" flag stored : inside them. Every document with such an old number can be safely : removed. : : Problem with this solution is, that *every* document in the index will : get updated: First the old index version field is removed, then the new : field is added. : : On the plusside, removing deleted files will be very fast. : : : : : : What would you recommend for keeping an incremental update? : : I fear the first version will be utterly slow for small updates whereas : the second version will be a lot faster - though adding stuff is slower : because of the additional field update for every document. : : : : Thanks for your advice, : : Johannes :-) : : : : : : -Hoss
