[jira] Commented: (LUCENE-1526) Tombstone deletions in IndexReader

Marvin Humphrey (JIRA) Wed, 21 Jan 2009 12:20:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665955#action_12665955
 ]


Marvin Humphrey commented on LUCENE-1526:
-----------------------------------------

> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. 

For Lucene, I think the SegmentReader should lazily create an internal
structure to hold the deleted doc IDs on the first search.  It would either be
an integer array (if there are few deletions) or a BitVector (if there are
many), and it would be created by recording the output of a priority queue
merging multiple tombstone streams.

Subsequent calls would not require the priority queue, but would use an
iterator wrapper around the shared int array / BitVector.

For Lucy/KS, if we are to stick with the "cheap IndexReader" model, we'll want
to keep using the priority queue.  Thus, it will be important to keep the
deletion rate for the whole index down, in order to minimize priority queue
iteration costs.  One possibility is to have the default merge policy
automatically consolidate any segment as soon as its deletion rate climbs over
10%.  That's pretty aggressive, but it keeps the search-time deletions
iteration costs reasonably low, and it's in the spirit of "few writes, many
reads".

> A probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The merge
> policy may be set in the clone/reopen methods or on the IndexReader. 

Would it make sense to realize a new integer array / BitVector on the first
search after any new tombstone rows are added which affect the segment in
question?

> Tombstone deletions in IndexReader
> ----------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1526) Tombstone deletions in IndexReader

Reply via email to