[jira] Commented: (LUCENE-1526) Tombstone deletions in IndexReader

Michael McCandless (JIRA) Wed, 21 Jan 2009 13:30:24 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665969#action_12665969
 ]


Michael McCandless commented on LUCENE-1526:
--------------------------------------------


{quote}
For Lucene, I think the SegmentReader should lazily create an internal
structure to hold the deleted doc IDs on the first search.
{quote}

This is basically doing the copy-on-write, which for realtime search
we're wanting to avoid.  But as long as this is a sparse structure
(sorted list of deleted docIDs, assuming not many deletes accumulate
in RAM) it should be OK.

I also think for Lucene we could leave the index format unchanged
(which means commit() is still more costly than it need be, but I'm
not sure that's too serious), and use tombstones/list-of-sorted-docIDs
representation only in RAM.

For realtime search, I think we can accept some slowdown of search
performance in exchange for very low latency turnaround when
adding/deleting docs.

But I think these decisions (the approach we take here) is very much
dependent on what we learn from the performance tests from
LUCENE-1476.


> Tombstone deletions in IndexReader
> ----------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1526) Tombstone deletions in IndexReader

Reply via email to