[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Michael McCandless (JIRA) Thu, 08 Jan 2009 02:59:28 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661934#action_12661934
 ]


Michael McCandless commented on LUCENE-1476:
--------------------------------------------


{quote}
> PostingList would be completely ignorant of deletions, as would classes like 
> NOTScorer and MatchAllScorer:
{quote}

This is a neat idea! Deletions are then applied just like a Filter.

For a TermQuery (one term) the cost of the two approaches should be
the same.

For OR'd Term queries, it actually seems like your proposed approach
may be lower cost?  Ie rather than each TermEnum doing the "AND NOT
deleted" intersection, you only do it once at the top.  There is added
cost in that each TermEnum is now returning more docIDs than before,
but the deleted ones are eliminated before scoring.

For AND (and other) queries I'm not sure.  In theory, having to
process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
may see much higher net cost.  We should test.

Conceivably, a future "search optimization phase" could pick & choose
the best point to inject the "AND NOT deleted" filter.  In fact, it
could also pick when to inject a Filter... a costly per-docID search
with a very restrictive filter could be far more efficient if you
applied the Filter earlier in the chain.

I'm also curious what cost you see of doing the merge sort for every
search; I think it could be uncomfortably high since it's so
hard-for-cpu-to-predict-branch-intensive.  We could take the first
search that doesn't use skipTo and save the result of the merge sort,
essentially doing an in-RAM-only "merge" of those deletes, and let
subsequent searches use that single merged stream.  (This is not MMAP
friendly, though).

In my initial rough testing, I switched to iterator API for
SegmentTermEnum and found if %tg deletes is < 10% the search was a bit
faster using an iterator vs random access, but above that was slower.
This was with an already "merged" list of in-order docIDs.

Switching to an iterator API for accessing field values for many docs
(LUCENE-831 -- new FieldCache API, LUCENE-1231 -- column stride
fields) shouldn't have this same problem since it's the "top level"
that's accessing the values (ie, one iterator per field X query).



> BitVector implement DocIdSet
> ----------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Reply via email to