[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661934#action_12661934 ]
Michael McCandless commented on LUCENE-1476: -------------------------------------------- {quote} > PostingList would be completely ignorant of deletions, as would classes like > NOTScorer and MatchAllScorer: {quote} This is a neat idea! Deletions are then applied just like a Filter. For a TermQuery (one term) the cost of the two approaches should be the same. For OR'd Term queries, it actually seems like your proposed approach may be lower cost? Ie rather than each TermEnum doing the "AND NOT deleted" intersection, you only do it once at the top. There is added cost in that each TermEnum is now returning more docIDs than before, but the deleted ones are eliminated before scoring. For AND (and other) queries I'm not sure. In theory, having to process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery may see much higher net cost. We should test. Conceivably, a future "search optimization phase" could pick & choose the best point to inject the "AND NOT deleted" filter. In fact, it could also pick when to inject a Filter... a costly per-docID search with a very restrictive filter could be far more efficient if you applied the Filter earlier in the chain. I'm also curious what cost you see of doing the merge sort for every search; I think it could be uncomfortably high since it's so hard-for-cpu-to-predict-branch-intensive. We could take the first search that doesn't use skipTo and save the result of the merge sort, essentially doing an in-RAM-only "merge" of those deletes, and let subsequent searches use that single merged stream. (This is not MMAP friendly, though). In my initial rough testing, I switched to iterator API for SegmentTermEnum and found if %tg deletes is < 10% the search was a bit faster using an iterator vs random access, but above that was slower. This was with an already "merged" list of in-order docIDs. Switching to an iterator API for accessing field values for many docs (LUCENE-831 -- new FieldCache API, LUCENE-1231 -- column stride fields) shouldn't have this same problem since it's the "top level" that's accessing the values (ie, one iterator per field X query). > BitVector implement DocIdSet > ---------------------------- > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org