[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Jake Mannix (JIRA) Thu, 12 Nov 2009 09:21:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777068#action_12777068
 ]


Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. OK. It's clear Zoie's design is optimized for insanely fast reopen.

That, and maxing out QPS and indexing rate while keeping query latency 
degredation to a minimum.  From trying to turn off the extra deleted check, the 
latency overhead on a 5M doc index is a difference of queries taking 12-13ms 
with the extra check turned on, and 10ms without it, and you only really start 
to notice on the extreme edges (the queries hitting all 5million docs by way of 
an actual query (not MatchAllDocs)), when your performance goes from maybe 
100ms to 140-150ms.  

bq. EG what I'd love to see is, as a function of reopen rate, the "curve" of 
QPS vs docs per sec. Ie, if you reopen 1X per second, that consumes some of 
your machine's resources. What's left can be spent indexing or searching or 
both, so, it's a curve/line. So we should set up fixed rate indexing, and then 
redline the QPS to see what's possible, and do this for multiple indexing 
rates, and for multiple reopen rates.

Yes, that curve would be a very useful benchmark.  Now that I think of it, it 
wouldn't be too hard to just sneak some reader caching into the ZoieSystem with 
a tunable parameter for how long you hang onto it, so that we could see how 
much that can help.  One of the nice things that we can do in Zoie by using 
this kind of index-latency backoff, is that because we have an in-memory 
two-way mapping of zoie-specific UID to docId, if we actually have time (in the 
background, since we're caching these readers now) to zip through and update 
the real delete BitVectors on the segments, and lose the extra check at query 
time, only using that if you have the index-latency time set below some 
threshold (determined by how long it takes the system to do this resolution - 
mapping docId to UID is an array lookup, the reverse is a little slower).

bq. Right, Zoie is making determined tradeoffs. I would expect that most apps 
are fine with controlled reopen frequency, ie, they would choose to not lose 
indexing and searching performance if it means they can "only" reopen, eg, 2X 
per second.

In theory Zoie is making tradeoffs - in practice, at least against what is on 
trunk, Zoie's just going way faster in both indexing and querying in the 
redline perf test.  I agree that in principle, once LUCENE-1313 and other 
improvements and bugs have been worked out of NRT, that query performance 
should be faster, and if zoie's default BalancedMergePolicy (nee 
ZoieMergePolicy) is in use for NRT, the indexing performance should be faster 
too - it's just not quite there yet at this point.

bq. I agree - having such well defined API semantics ("once updateDoc returns, 
searches can see it") is wonderful. But I think they can be cleanly built on 
top of Lucene NRT as it is today, with a pre-determined (reopen) latency.

Of course!  These api semantics are already built up on top of plain-old Lucene 
- even without NRT, so I can't imagine how NRT would *remove* this ability! :)

bq. I think the "large merge just finished" case is the most costly for such 
apps (which the "merged segment warmer" on IW should take care of)? (Because 
otherwise the segments are tiny, assuming everything is cutover to "per 
segment").

Definitely.  One thing that Zoie benefited from, from an API standpoint, which 
might be nice in Lucene, now that 1.5 is in place, is that the 
IndexReaderWarmer could replace the raw SegmentReader with a warmed 
user-specified subclass of SegmentReader:

{code}
public abstract class IndexReaderWarmer<R extends IndexReader> {
  public abstract T warm(IndexReader rawReader);
}
{code}

Which could replace the reader in the readerPool with the 
possibly-user-overridden subclass of SegmentReader (now that SegmentReader is 
as public as IndexReader itself is) which has now been warmed.  For users who 
like to decorate their readers to keep additional state, instead of use them as 
keys into WeakHashMaps kept separate, this could be extremely useful (I know 
that the people I talked to at Apple's iTunes store do this, as well as in 
bobo, and zoie, to name a few examples off the top of my head).

bq.  I think Lucene could handle this well, if we made an IndexReader impl that 
directly searches DocumentWriter's RAM buffer. But that's somewhat challenging

Jason mentioned this approach in his talk at ApacheCon, but I'm not at all 
convinced it's necessary - if a single box can handle indexing a couple hundred 
smallish documents a second (into a RAMDirectory), and could be sped up by 
using multiple IndexWriters (writing into multiple RAMDirecotries in parallel, 
if you were willing to give up some CPU cores to indexing), and you can search 
against them without having to do any deduplification / bloomfilter check 
against the disk, then I'd be surprised if searching the pre-indexed RAM buffer 
would really be much of a speedup in comparison to just doing it the simple 
way.  But I could be wrong, as I'm not sure how much faster such a search could 
be.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to