[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653883#action_12653883 ]
Marvin Humphrey commented on LUCENE-1476: ----------------------------------------- > Marvin, in KS/Lucy are you using random-access or iterator to access > deletedDocs & norms? Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, but it's implemented using BitVectors which get the next deleted doc num by calling nextSetBit() internally. I happened to be coding up those classes this spring when there was the big brouhaha about IndexReader.isDeleted(). It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion: {code} i32_t MatchAllScorer_next(MatchAllScorer* self) { do { if (++self->doc_num > self->max_docs) { self->doc_num--; return 0; } if (self->doc_num > self->next_deletion) { self->next_deletion = DelEnum_Skip_To(self->del_enum, self->doc_num); } } while (self->doc_num == self->next_deletion); return self->doc_num; } {code} (Note: Scorer.next() in KS returns the document number; doc nums start at 1, and 0 is the sentinel signaling iterator termination. I expect that Lucy will be the same.) Perhaps we could get away without needing the random access, but that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted -- unlike Lucene which throws an exception. Also, you can't delete documents against an IndexReader, so Robert's objection doesn't apply to us. I had always assumed we were going to have to expose isDeleted() eventually, but maybe we can get away with zapping it. Interesting! I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck, now that we've theoretically solved the sort cache warming issue. I figured we would continue to need bit-vector files because they're straightforward to mmap, but if we only need iterator access, we can use vbyte encoding instead... Hmm, we still face the problem of outsized write cost when a segment has a large number of deletions and you add one more... > BitVector implement DocIdSet > ---------------------------- > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]