[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Jake Mannix (JIRA) Mon, 09 Nov 2009 13:20:59 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775151#action_12775151
 ]


Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. But how many msec does this clone add in practice?  Note that it's only 
done if there is a new deletion against that
segment.  I do agree it's silly wasteful, but searching should then be faster
than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the
turnaround time for search perf.

I actually don't know for sure if this is the majority of the time, as I 
haven't actually run both the AcceleratedIntSet or 2.9 NRT through a profiler, 
but if you're indexing at high speed (which is what is done in our load/perf 
tests), you're going to be cloning these things hundreds of times per second 
(look at the indexing throughput we're forcing the system to go through), and 
even if it's fast, that's costly.

bq. I'd love to see how the worst-case queries (matching millions of hits)
perform with each of these three options.

It's pretty easy to change the index and query files in our test to do that, 
that's a good idea.  You can feel free to check out our load testing framework 
too - it will let you monkey with various parameters, monitor the whole thing 
via JMX, and so forth, both for the full zoie-based stuff, and where the zoie 
api is wrapped purely around Lucene 2.9 NRT.   The instructions for how to set 
it up are on the zoie wiki.

bq. When a doc needs to be updated, you index it immediately into the
RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the
AcceleratedIntSet, and all searches "and NOT"'d against that set. You
don't tell Lucene to delete the old doc, yet.

Yep, basically.  The IntSetAccellerator (of UIDs) is set on the (long lived) 
IndexReader for the disk index - this is why it's done as a ThreadLocal - 
everybody is sharing that IndexReader, but different threads have different 
point-in-time views of how much of it has been deleted.

bq. These are great results! If I'm reading them right, it looks like
generally you get faster query throughput, and roughly equal indexing
throughput, on upgrading from 2.4 to 2.9?

That's about right.  Of course, the comparison between zoie with either 2.4 or 
2.9 against lucene 2.9 NRT is an important one to look at: zoie is pushing 
about 7-9x better throughput for both queries and indexing than NRT.

I'm sure the performance numbers would change if we allowed not realtimeness, 
yes, that's one of the many dimensions to consider in this (along with 
percentage of indexing events which are deletes, how many of those are from 
really old segments vs. newer ones, how big the queries are, etc...).

bq. One optimization you could make with Zoie is, if a real-time deletion
(from the AcceleratedIntSet) is in fact hit, it could mark the
corresponding docID, to make subsequent searches a bit faster (and
save the bg CPU when flushing the deletes to Lucene).

That sound interesting - how would that work?  We don't really touch the disk 
indexReader, other than to set this modSet on it in the ThreadLocal, where 
would this mark live?


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to