[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Marvin Humphrey (JIRA) Fri, 05 Dec 2008 10:23:36 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653883#action_12653883
 ]


Marvin Humphrey commented on LUCENE-1476:
-----------------------------------------

> Marvin, in KS/Lucy are you using random-access or iterator to access 
> deletedDocs & norms?

Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, 
but it's implemented using BitVectors which get the next deleted doc num by 
calling nextSetBit() internally. 

 I happened to be coding up those classes this spring when there was the big 
brouhaha about IndexReader.isDeleted().  It seemed wrong to pay the method call 
overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or 
MatchAllScorer.next(), when we could just store the next deletion:

{code}
i32_t
MatchAllScorer_next(MatchAllScorer* self) 
{
    do {
        if (++self->doc_num > self->max_docs) {
            self->doc_num--;
            return 0;
        }
        if (self->doc_num > self->next_deletion) {
            self->next_deletion 
                = DelEnum_Skip_To(self->del_enum, self->doc_num);
        }
    } while (self->doc_num == self->next_deletion);
    return self->doc_num;
}
{code}

(Note: Scorer.next() in KS returns the document number; doc nums start at 1, 
and 0 is the sentinel signaling iterator termination. I expect that Lucy will 
be the same.)

Perhaps we could get away without needing the random access, but that's because 
IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int 
docNum) returns the doc even if it's deleted -- unlike Lucene which throws an 
exception. Also, you can't delete documents against an IndexReader, so Robert's 
objection doesn't apply to us.

I had always assumed we were going to have to expose isDeleted() eventually, 
but maybe we can get away with zapping it. Interesting!

I've actually been trying to figure out a new design for deletions because 
writing them out for big segments is our last big write bottleneck, now that 
we've theoretically solved the sort cache warming issue.  I figured we would 
continue to need bit-vector files because they're straightforward to mmap, but 
if we only need iterator access, we can use vbyte encoding instead... Hmm, we 
still face the problem of outsized write cost when a segment has a large number 
of deletions and you add one more...

> BitVector implement DocIdSet
> ----------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Reply via email to