[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Marvin Humphrey (JIRA) Thu, 08 Jan 2009 14:13:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662143#action_12662143
 ]


Marvin Humphrey commented on LUCENE-1476:
-----------------------------------------

Mike McCandless:

> So, net/net it seems like "deletes-as-a-filter" approach is compelling?

In terms of CPU-cycles, maybe.  

My gut tells me that it's all but mandatory if we use merged-on-the-fly
tombstone streams, but if Lucene goes that route it should cache a BitVector
and use a shared pseudo-iterator -- in which case the costs will no longer be
significantly more than the current system.

Under the current system, I'm not certain that the deletions checks are that
excessive.  Consider this loop from TermDocs.read():

{code}
    while (i < length && count < df) {
      // manually inlined call to next() for speed
      final int docCode = freqStream.readVInt();
      doc += docCode >>> 1;       // shift off low bit
      if ((docCode & 1) != 0)       // if low bit is set
        freq = 1;         // freq is one
      else
        freq = freqStream.readVInt();     // else read freq
      count++;

      if (deletedDocs == null || !deletedDocs.get(doc)) {
        docs[i] = doc;
        freqs[i] = freq;
        ++i;
      }
    }
{code}

The CPU probably does a good job of predicting the result of the null check on
deletedDocs.  The readVInt() method call is already a pipeline killer.

Here's how that loop looks after I patch the deletions check for 
pseudo-iteration.

{code}
      while (i < length && count < df) {
        // manually inlined call to next() for speed
        final int docCode = freqStream.readVInt();
        doc += docCode >>> 1;       // shift off low bit
        if ((docCode & 1) != 0)       // if low bit is set
          freq = 1;         // freq is one
        else
          freq = freqStream.readVInt();     // else read freq
        count++;

        if (docNum >= nextDeletion) {
            if (docNum > nextDeletion) {
              nextDeletion = deletedDocs.nextSetBit(docNum);
            }
            if (docNum == nextDeletion) {
              continue;
            }
        }

        docs[i] = doc;
        freqs[i] = freq;
        ++i;
      }
      return i;
    }
{code}

Again, the CPU is probably going to do a pretty good job of predicting the
results of the deletion check.  And even then, we're accessing the same shared
BitVector across all TermDocs, and its bits are hopefully a cache hit.

To really tighten this loop, you have to do what Nate and I want with Lucy/KS:

  * Remove all function/method call overhead.
  * Operate directly on the memory mapped postings file.

{code}
SegPList_bulk_read(SegPostingList *self, i32_t *doc_nums, i32_t *freqs,
                   u32_t request)
{
    i32_t       doc_num   = self->doc_num;
    const u32_t remaining = self->doc_freq - self->count;
    const u32_t num_got   = request < remaining ? request : remaining;
    char       *buf       = InStream_Buf(instream, C32_MAX_BYTES * num_got);
    u32_t       i;

    for (i = 0; i < num_got; i++) {
        u32_t doc_code = Math_decode_c32(&buf); /* static inline function */
        u32_t freq     = (doc_code & 1) ? 1 : Math_decode_c32(&buf);
        doc_num        += doc_code >> 1; 
        doc_nums[i]    = doc_num;
        freqs[i]       = freq;
        ++i;
    }

    InStream_Advance_Buf(instream, buf);
    self->doc_num = doc_num;
    self->count += num_got;

    return num_got;
}
{code}

(That loop would be even better using PFOR instead of vbyte.)

In terms of public API, I don't think it's reasonable to change Lucene's
Scorer and TermDocs classes so that their iterators start returning deleted
docs.  

We could potentially make that choice with Lucy/KS, thus allowing us to remove
the deletions check in the PostingList iterator (as above) and getting a
potential speedup.  But even then I hesitate to push the deletions API upwards
into a space where users of raw Scorer and TermDocs classes have to deal with
it -- especially since iterator-style deletions aren't very user-friendly.


> BitVector implement DocIdSet
> ----------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Reply via email to