[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776294#action_12776294 ]
Jake Mannix commented on LUCENE-1526: ------------------------------------- bq. Zoie must do the IntSet check plus the BitVector check (done by Lucene), right? Yes, so how does Lucene NRT deal with new deletes? The disk-backed IndexReader still does its internal check for deletions, right? I haven't played with the latest patches on LUCENE-1313, so I'm not sure what has changed, but if you're leaving the disk index alone (to preserve point-in-time status of the index without writing to disk all the time), you've got your in-memory BitVector of newly uncommitted deletes, and then the SegmentReaders from the disk have their own internal deletedDocs BitVector. Are these two OR'ed with each other somewhere? What is done in NRT to minimize the time of checking both of these without modifying the read-only SegmentReader? In the current 2.9.0 code, the segment is reloaded completely on getReader() if there are new add/deletes, right? bq. Ie comparing IntSet lookup vs BitVector lookup isn't the comparison you want to do. You should compare the IntSet lookup (Zoie's added cost) to 0. If you've got that technique for resolving new deletes against the disk-based ones while maintaining point-in-time nature and can completely amortize the reopen cost so that it doesn't affect performance, then yeah, that would be the comparison. I'm not sure I understand how the NRT implementation is doing this currently - I tried to step through the debugger while running the TestIndexWriterReader test, but I'm still not quite sure what is going on during the reopen. bq. So, for a query that hits 5M docs, Zoie will take 64 msec longer than Lucene, due to the extra check. What I'd like to know is what pctg. slowdown that works out to be, eg for a simple TermQuery that hits those 5M results - that's Zoie's worst case search slowdown. Yes, this is a good check to see, for while it is still a micro-benchmark, really, since it would be done in isolation, while no other production tasks are going on, like rapid indexing and the consequent flushes to disk and reader reopening is going on, but it would be useful to see. What would be even better, however, would be to have a running system whereby there is continual updating of the index, and many concurrent requests are coming in which hit all 5M documents, and measure the mean latency for zoie in this case, in both comparison to NRT, and in comparison to lucene when you *don't* reopen the index (ie. you do things the pre-lucene2.9 way, where the CPU is still being consumed by indexing, but the reader is out of date until the next time it's scheduled by the application to reopen). This would measure the effective latency and throughtput costs of zoie and NRT vs non-NRT lucene. I'm not really sure it's terribly helpful to see "what is zoie's latency when you're not indexing at all" - why on earth would you use either NRT or zoie if you're not doing lots of indexing? > For near real-time search, use paged copy-on-write BitVector impl > ----------------------------------------------------------------- > > Key: LUCENE-1526 > URL: https://issues.apache.org/jira/browse/LUCENE-1526 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Minor > Attachments: LUCENE-1526.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > SegmentReader currently uses a BitVector to represent deleted docs. > When performing rapid clone (see LUCENE-1314) and delete operations, > performing a copy on write of the BitVector can become costly because > the entire underlying byte array must be created and copied. A way to > make this clone delete process faster is to implement tombstones, a > term coined by Marvin Humphrey. Tombstones represent new deletions > plus the incremental deletions from previously reopened readers in > the current reader. > The proposed implementation of tombstones is to accumulate deletions > into an int array represented as a DocIdSet. With LUCENE-1476, > SegmentTermDocs iterates over deleted docs using a DocIdSet rather > than accessing the BitVector by calling get. This allows a BitVector > and a set of tombstones to by ANDed together as the current reader's > delete docs. > A tombstone merge policy needs to be defined to determine when to > merge tombstone DocIdSets into a new deleted docs BitVector as too > many tombstones would eventually be detrimental to performance. A > probable implementation will merge tombstones based on the number of > tombstones and the total number of documents in the tombstones. The > merge policy may be set in the clone/reopen methods or on the > IndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org