[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Jake Mannix (JIRA) Tue, 10 Nov 2009 21:49:01 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776294#action_12776294
 ]


Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. Zoie must do the IntSet check plus the BitVector check (done by
Lucene), right?

Yes, so how does Lucene NRT deal with new deletes?  The disk-backed IndexReader 
still does its internal check for deletions, right?  I haven't played with the 
latest patches on LUCENE-1313, so I'm not sure what has changed, but if you're 
leaving the disk index alone (to preserve point-in-time status of the index 
without writing to disk all the time), you've got your in-memory BitVector of 
newly uncommitted deletes, and then the SegmentReaders from the disk have their 
own internal deletedDocs BitVector.  Are these two OR'ed with each other 
somewhere?  What is done in NRT to minimize the time of checking both of these 
without modifying the read-only SegmentReader?  In the current 2.9.0 code, the 
segment is reloaded completely on getReader() if there are new add/deletes, 
right?

bq. Ie comparing IntSet lookup vs BitVector lookup isn't the comparison
you want to do. You should compare the IntSet lookup (Zoie's added
cost) to 0.

If you've got that technique for resolving new deletes against the disk-based 
ones while maintaining point-in-time nature and can completely amortize the 
reopen cost so that it doesn't affect performance, then yeah, that would be the 
comparison.  I'm not sure I understand how the NRT implementation is doing this 
currently - I tried to step through the debugger while running the 
TestIndexWriterReader test, but I'm still not quite sure what is going on 
during the reopen.

bq.  So, for a query that hits 5M docs, Zoie will take 64 msec longer than
Lucene, due to the extra check. What I'd like to know is what
pctg. slowdown that works out to be, eg for a simple TermQuery that
hits those 5M results - that's Zoie's worst case search slowdown.

Yes, this is a good check to see, for while it is still a micro-benchmark, 
really, since it would be done in isolation, while no other production tasks 
are going on, like rapid indexing and the consequent flushes to disk and reader 
reopening is going on, but it would be useful to see.

What would be even better, however, would be to have a running system whereby 
there is continual updating of the index, and many concurrent requests are 
coming in which hit all 5M documents, and measure the mean latency for zoie in 
this case, in both comparison to NRT, and in comparison to lucene when you 
*don't* reopen the index (ie. you do things the pre-lucene2.9 way, where the 
CPU is still being consumed by indexing, but the reader is out of date until 
the next time it's scheduled by the application to reopen).  This would measure 
the effective latency and throughtput costs of zoie and NRT vs non-NRT lucene.  
I'm not really sure it's terribly helpful to see "what is zoie's latency when 
you're not indexing at all" - why on earth would you use either NRT or zoie if 
you're not doing lots of indexing? 

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to