[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Marvin Humphrey (JIRA) Sun, 07 Dec 2008 16:58:08 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654269#action_12654269
 ]


Marvin Humphrey commented on LUCENE-1476:
-----------------------------------------

> One approach would be to use a "segmented" model. 

That would improve the average performance of deleting a document, at the cost
of some added complexity.  Worst-case performance -- which you'd hit when you
consolidated those sub-segment deletions files -- would actually degrade a
bit.

To manage consolidation, you'd need a deletions merge policy that operated
independently from the primary merge policy.  Aside from the complexity 
penalty, 
having two un-coordinated merge policies would be bad for real-time search, 
because you want to be able to control exactly when you pay for a big merge.

I'm also bothered by the proliferation of small deletions files.  Probably
you'd want automatic consolidation of files under 4k, but you still could end
up with a lot of files in a big index.

So... what if we wrote, merged, and removed deletions files on the same
schedule as ordinary segment files?  Instead of going back and quasi-modifying
an existing segment by associating a next-generation .del file with it, we write
deletions to a NEW segment and have them reference older segments.  

In other words, we add "tombstones" rather than "delete" documents.

Logically speaking, each tombstone segment file would consist of an array of
segment identifiers, each of which would point to a "tombstone row" array of
vbyte-encoded doc nums:

{code}
// _6.tombstone
   _2: [3, 4, 25]
   _3: [13]

// _7.tombstone
   _2: [5]

// _8.tombstone
   _1: [94]
   _2: [7, 8]
   _5: [54, 55]
{code}

The thing that makes this possible is that the dead docs marked by tombstones
never get their doc nums shuffled during segment merging -- they just
disappear.   If deleted docs lived to be consolidated into new segments and
acquire new doc nums, tombstones wouldn't work.  However, we can associate
tombstone rows with segment names and they only need remain valid as long 
as the segments they reference survive.  

Some tombstone rows will become obsolete once the segments they reference go
away, but we never arrive at a scenario where we are forced to discard valid
tombstones.  Merging tombstone files simply involves dropping obsolete
tombstone rows and collating valid ones.

At search time, we'd use an iterator with an internal priority queue to
collate tombstone rows into a stream -- so there's still no need to slurp the
files at IndexReader startup.

> BitVector implement DocIdSet
> ----------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Reply via email to