Doug Cutting wrote: >>From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]] >> >>>It seems that either a) deletes should be write-through, or >>>b) deletes should >>>be done by the writer, or c) writer should not optimize >>>non-RAM segments unless >>>asked to. As a client, I like option b) the best, though, >>>this is not the easiest option to implement. My $0.02 >>> >>Or maybe >>d) when merging, a writer should share an in-memory image of segment1 >>and prohibit any deletes on segment one while merge is in progress? >> > >Or maybe: >e) Deleting from a reader while an IndexWriter is open on the same index >should throw an exception. This just requires the delete code to obtain the >write.lock. > I don't think this would address the reported problem. I had not verified it, but per bug report it seems that the IndexReader caches the deletes and it is possible to have an IndexWriter perform optimization while an IndexReader is still holding delete information in memory (although from the application point of view the delete has been performed prior to addition).
If there is one "user" performing additions and deletions, then the two can be ordered. But if an application is such that it allows multiple people initiate index updates of various kinds, it may be much harder to order additions and deletions. > >Deletions and additions must happen serially. In particular, the intended >order of operations is: > reader.open(); > reader.deleteDocument(...); > reader.close(); > writer.open(); > writer.addDocument(...); > writer.close(); > >The bug is that this is not enforced, nor is it well documented. Let's fix >that first. Another bug might be that IndexWriter is a misnomer: it should >really be called something like DocumentAdder. > >>Personally, I would also like to see deletion moved into the writer. >> > >And I'd like to see cars outlawed. > Cars?!?! Ok, so we'll all ride the Ginger! :) > >Yes, this would be a cleaner API, but it would also encourage folks to write >less efficient index updating code. The most efficient approach is to batch >deletions and additions separately. Intermingling them will never be as >fast. The current API encourages one to do things this way. Also, >currently the deletion code is very simple and easy to maintain. Optimizing >intermingled additions and deletions would require adding a lot of new code, >substantially complicating Lucene, and likely introducing bugs. > >Some background: To delete a document we need an IndexReader to find its >document number. To add a document we just need to add a new segment, >opening no readers. Periodically a subset of the segments are opened by a >reader to merge them. > Yes, this is one of the more ingenious design ideas in Lucene, I have to say! It makes a world of difference that segments are read-only and that document additions never have to update anything - only create new files. > > >If deletion were added to an IndexWriter it would need to have an >IndexReader opened on all segments, in order to find the document number and >mark it as deleted. Each time a document is added or segments are merged >this reader must be invalidated. It would be very inefficient to re-open >this IndexReader each time a document is deleted, so code would need to be >added to incrementally update a SegmentsReader in light of document >additions and merges. Such a reader could also be optimized to only open >those files that are required for deletion. Still, intermingling inserts >and deletes would be less efficient, since it would require the dictionaries >for each altered segment to be re-read in order to find the document number. > This are all excellent points and I had not realized most of them. Also, I agree that the DocumentAdder would be a clearer name for the IndexWriter. Also, +1 on documenting the preferred operation order and enforcing it if possible. However, there are applications where this becomes very awkward. I think the main need for doing delete + add in one operation is when replacing documents with more up-to-date copies. Wouldn't it be great, if IndexWriter provided a way to not simply add a document, but add it as a replacement for another document (yes, I know that the document numbers are unsatble, but let's forget that for the moment). What would happen is that the existing IndexReaders would continue to use the document from the older segment where it still exists, but new IndexReaders would perform an in-memory merge where they would discover that a document is now deleted. Same thing would happen during optimization. I think this would make replacing documents very easy, atomic, and probably thread-safe. But the problem is how do we identify a document from an older segment when replacing it. Document ids are, of course, segment-specific and change during merge. This makes searches fast, but it makes it impossible to identify a document. But what if we add a "special" field, or add a unique document id in some other way? The searches will still use the segment-specific ids and remain fast, but there would be a unique id assigned to each document that applications could use if needed and also the replace operation could use in the IndexWriter. Obviously, we would have to make sure that these ids can be created quickly by multiple writers without a possibility of duplicate ids. Would this work? -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
