[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654269#action_12654269 ]
Marvin Humphrey commented on LUCENE-1476: ----------------------------------------- > One approach would be to use a "segmented" model. That would improve the average performance of deleting a document, at the cost of some added complexity. Worst-case performance -- which you'd hit when you consolidated those sub-segment deletions files -- would actually degrade a bit. To manage consolidation, you'd need a deletions merge policy that operated independently from the primary merge policy. Aside from the complexity penalty, having two un-coordinated merge policies would be bad for real-time search, because you want to be able to control exactly when you pay for a big merge. I'm also bothered by the proliferation of small deletions files. Probably you'd want automatic consolidation of files under 4k, but you still could end up with a lot of files in a big index. So... what if we wrote, merged, and removed deletions files on the same schedule as ordinary segment files? Instead of going back and quasi-modifying an existing segment by associating a next-generation .del file with it, we write deletions to a NEW segment and have them reference older segments. In other words, we add "tombstones" rather than "delete" documents. Logically speaking, each tombstone segment file would consist of an array of segment identifiers, each of which would point to a "tombstone row" array of vbyte-encoded doc nums: {code} // _6.tombstone _2: [3, 4, 25] _3: [13] // _7.tombstone _2: [5] // _8.tombstone _1: [94] _2: [7, 8] _5: [54, 55] {code} The thing that makes this possible is that the dead docs marked by tombstones never get their doc nums shuffled during segment merging -- they just disappear. If deleted docs lived to be consolidated into new segments and acquire new doc nums, tombstones wouldn't work. However, we can associate tombstone rows with segment names and they only need remain valid as long as the segments they reference survive. Some tombstone rows will become obsolete once the segments they reference go away, but we never arrive at a scenario where we are forced to discard valid tombstones. Merging tombstone files simply involves dropping obsolete tombstone rows and collating valid ones. At search time, we'd use an iterator with an internal priority queue to collate tombstone rows into a stream -- so there's still no need to slurp the files at IndexReader startup. > BitVector implement DocIdSet > ---------------------------- > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]