[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Jason Rutherglen (JIRA) Tue, 10 Nov 2009 21:53:07 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776296#action_12776296
 ]


Jason Rutherglen commented on LUCENE-1526:
------------------------------------------

bq. 300 documents a second

Whoa, pretty insane volume. 

bq. how many of these BitVectors are you going to end up making? 

A handful by pooling the BitVector fixed size bytes arrays (see
LUCENE-1574). I'm not sure if the synchronization on the pool
will matter. If it does, we can use ConcurrentHashMap like
Solr's LRUCache. Granted, JVMs are supposed to be able to handle
rapid allocation efficiently, however I can't see the overhead
of pooling being too significant. If it is, there's always the
default of allocating new BVs. 

I really need a solution that will absolutely not affect query
performance from what is today. Personally, pooling is the
safest route for me to use in production as then, there are no
worries about slowing down queries with alternative deleted docs
mechanisms. And the memory allocation is kept within scope. The
overhead is System.arraycopy, which no doubt will be
insignificant for my use case. 

http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.5
http://www.javapractices.com/topic/TopicAction.do?Id=3

I suppose if one has fairly simple queries and is willing to
sacrifice query performance for update rate, then other deleted
docs mechanisms may be a desired solution. I need to implement a more
conservative approach.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to