[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Michael McCandless (JIRA) Sat, 07 Nov 2009 07:43:59 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774631#action_12774631
 ]


Michael McCandless commented on LUCENE-1526:
--------------------------------------------

bq. We did this in Zoie for a while, and it turned out to be a bottleneck - not 
as much of a bottleneck as continually cloning a bitvector (that was even 
worse), but still not good. We currently use a bloomfilter on top of an 
openintset, which performs pretty fantastically: constant-time adds and 
even-faster constant-time contains() checks, with small size (necessary for the 
new Reader per query scenario since this requires lots of deep-cloning of this 
structure).

Good, real-world feedback -- thanks!  This sounds like a compelling
approach.

So the SegmentReader still had its full BitVector, but your OpenIntSet
(what exactly is that?) + the bloom filter is then also checked when
you enum the TermDocs?  It's impressive this is fast enough... do you
expect this approach to be faster than the paged "copy on write" bit
vector approach?

bq. It also helped to not produce a docIdset iterator using these bits, but 
instead override TermDocs to be returned on the disk reader, and keep track of 
it directly there.

The flex API should make this possible, without overriding TermDocs
(just expose the Bits interface).

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to