[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662143#action_12662143 ]
Marvin Humphrey commented on LUCENE-1476: ----------------------------------------- Mike McCandless: > So, net/net it seems like "deletes-as-a-filter" approach is compelling? In terms of CPU-cycles, maybe. My gut tells me that it's all but mandatory if we use merged-on-the-fly tombstone streams, but if Lucene goes that route it should cache a BitVector and use a shared pseudo-iterator -- in which case the costs will no longer be significantly more than the current system. Under the current system, I'm not certain that the deletions checks are that excessive. Consider this loop from TermDocs.read(): {code} while (i < length && count < df) { // manually inlined call to next() for speed final int docCode = freqStream.readVInt(); doc += docCode >>> 1; // shift off low bit if ((docCode & 1) != 0) // if low bit is set freq = 1; // freq is one else freq = freqStream.readVInt(); // else read freq count++; if (deletedDocs == null || !deletedDocs.get(doc)) { docs[i] = doc; freqs[i] = freq; ++i; } } {code} The CPU probably does a good job of predicting the result of the null check on deletedDocs. The readVInt() method call is already a pipeline killer. Here's how that loop looks after I patch the deletions check for pseudo-iteration. {code} while (i < length && count < df) { // manually inlined call to next() for speed final int docCode = freqStream.readVInt(); doc += docCode >>> 1; // shift off low bit if ((docCode & 1) != 0) // if low bit is set freq = 1; // freq is one else freq = freqStream.readVInt(); // else read freq count++; if (docNum >= nextDeletion) { if (docNum > nextDeletion) { nextDeletion = deletedDocs.nextSetBit(docNum); } if (docNum == nextDeletion) { continue; } } docs[i] = doc; freqs[i] = freq; ++i; } return i; } {code} Again, the CPU is probably going to do a pretty good job of predicting the results of the deletion check. And even then, we're accessing the same shared BitVector across all TermDocs, and its bits are hopefully a cache hit. To really tighten this loop, you have to do what Nate and I want with Lucy/KS: * Remove all function/method call overhead. * Operate directly on the memory mapped postings file. {code} SegPList_bulk_read(SegPostingList *self, i32_t *doc_nums, i32_t *freqs, u32_t request) { i32_t doc_num = self->doc_num; const u32_t remaining = self->doc_freq - self->count; const u32_t num_got = request < remaining ? request : remaining; char *buf = InStream_Buf(instream, C32_MAX_BYTES * num_got); u32_t i; for (i = 0; i < num_got; i++) { u32_t doc_code = Math_decode_c32(&buf); /* static inline function */ u32_t freq = (doc_code & 1) ? 1 : Math_decode_c32(&buf); doc_num += doc_code >> 1; doc_nums[i] = doc_num; freqs[i] = freq; ++i; } InStream_Advance_Buf(instream, buf); self->doc_num = doc_num; self->count += num_got; return num_got; } {code} (That loop would be even better using PFOR instead of vbyte.) In terms of public API, I don't think it's reasonable to change Lucene's Scorer and TermDocs classes so that their iterators start returning deleted docs. We could potentially make that choice with Lucy/KS, thus allowing us to remove the deletions check in the PostingList iterator (as above) and getting a potential speedup. But even then I hesitate to push the deletions API upwards into a space where users of raw Scorer and TermDocs classes have to deal with it -- especially since iterator-style deletions aren't very user-friendly. > BitVector implement DocIdSet > ---------------------------- > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org