[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Michael McCandless (JIRA) Sat, 14 Nov 2009 03:06:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777881#action_12777881
 ]


Michael McCandless commented on LUCENE-1526:
--------------------------------------------

bq. One of the nice things that we can do in Zoie by using this kind of 
index-latency backoff, is that because we have an in-memory two-way mapping of 
zoie-specific UID to docId, if we actually have time (in the background, since 
we're caching these readers now) to zip through and update the real delete 
BitVectors on the segments, and lose the extra check at query time, only using 
that if you have the index-latency time set below some threshold (determined by 
how long it takes the system to do this resolution - mapping docId to UID is an 
array lookup, the reverse is a little slower).

Right -- I think such a hybrid approach would have the best tradeoffs
of all.  You'd get insanely fast reopen, and then searching would only
take the performance hit until the BG resolution of deleted UID ->
Lucene docID completed.  Similar to the JRE's BG hotspot compiler.

{quote} 
bq. Right, Zoie is making determined tradeoffs. I would expect that most apps 
are fine with controlled reopen frequency, ie, they would choose to not lose 
indexing and searching performance if it means they can "only" reopen, eg, 2X 
per second.

In theory Zoie is making tradeoffs - in practice, at least against what is on 
trunk, Zoie's just going way faster in both indexing and querying in the 
redline perf test. I agree that in principle, once LUCENE-1313 and other 
improvements and bugs have been worked out of NRT, that query performance 
should be faster, and if zoie's default BalancedMergePolicy (nee 
ZoieMergePolicy) is in use for NRT, the indexing performance should be faster 
too - it's just not quite there yet at this point.
{quote}

Well.. unfortunately, we can't conclude much from the current test,
besides that Zoie's reopen time is much faster than Lucene's (until/if
we add the "reopen frequency" as a dimension, and see those results).

Also the test is rather synthetic, in that most apps don't really need
to reopen 100s of times per second.  We really should try to test more
realistic cases.

One question: where is CPU utilization when you run the Lucene test?
Presumably, if you block an incoming query until the reopen completes,
and because only one reopen can happen at once, it seems like CPU must
not be saturated?

But, I agree, there are alot of moving parts here still -- Zoie has
far faster add-only throughput than Lucene (could simply be due to
lack of LUCENE-1313), Lucene may have correctness issue (still can't
repro), Lucene has some pending optimizations (LUCENE-2047), etc.

In LUCENE-2061 I'm working on a standard benchmark we can use to test
improvements to Lucene's NRT; it'll let us assess potential
improvements and spot weird problems.

{quote}
One thing that Zoie benefited from, from an API standpoint, which might be nice 
in Lucene, now that 1.5 is in place, is that the IndexReaderWarmer could 
replace the raw SegmentReader with a warmed user-specified subclass of 
SegmentReader:

{code} 
public abstract class IndexReaderWarmer<R extends IndexReader> {
  public abstract T warm(IndexReader rawReader);
}
{code} 
Which could replace the reader in the readerPool with the 
possibly-user-overridden subclass of SegmentReader (now that SegmentReader is 
as public as IndexReader itself is) which has now been warmed. For users who 
like to decorate their readers to keep additional state, instead of use them as 
keys into WeakHashMaps kept separate, this could be extremely useful (I know 
that the people I talked to at Apple's iTunes store do this, as well as in 
bobo, and zoie, to name a few examples off the top of my head).
{quote}

This is a good idea, and it's been suggested several times now,
including eg notification when segment merging starts/commits, but I
think we should take it up in the larger context of how to centralize
reader pooling?  This pool is just the pool used by IndexWriter, when
its in NRT mode; I think IndexReader.open should somehow share the
same infrastructure.  And maybe LUCENE-2026 (refactoring IW) is the
vehicle for "centralizing" this?  Can you go carry over this
suggestion there?

{quote} 
bq. I think Lucene could handle this well, if we made an IndexReader impl that 
directly searches DocumentWriter's RAM buffer. But that's somewhat challenging

Jason mentioned this approach in his talk at ApacheCon, but I'm not at all 
convinced it's necessary - if a single box can handle indexing a couple hundred 
smallish documents a second (into a RAMDirectory), and could be sped up by 
using multiple IndexWriters (writing into multiple RAMDirecotries in parallel, 
if you were willing to give up some CPU cores to indexing), and you can search 
against them without having to do any deduplification / bloomfilter check 
against the disk, then I'd be surprised if searching the pre-indexed RAM buffer 
would really be much of a speedup in comparison to just doing it the simple 
way. But I could be wrong, as I'm not sure how much faster such a search could 
be.
{quote}

Right, we should clearly only take such a big step if performance
shows it's justified.  From the initial results I just posted in
LUCENE-2061, it looks like Lucene does in fact handle the add-only
case very well (ie degredation to QPS is fairly contained), even into
an FSDir.  I need to restest with LUCENE-1313.


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to