Re: [lucy-dev] Book Club this week PFOR-DELTA!

Marvin Humphrey Wed, 16 Jan 2013 22:42:06 -0800

On Tue, Jan 15, 2013 at 12:08 AM, Nathan Kurz <[email protected]> wrote:
> On Sun, Jan 13, 2013 at 7:15 PM, Marvin Humphrey <[email protected]> 
> wrote:
>> The problem with our current system is that even if we migrate to a block
>> encoding for our inverted lists, we can only use a fraction of the block on
>> each call to Matcher_Next().  Not only would we have to buffer the decoded
>> integers in memory between matches, but we would still continue to make a
>> large number of method calls for each match.
>
> I think I understand you, but I'm not sure how much of an issue the
> buffering is.   Maybe this has to do with what you mean by "in memory"
> --- I think it would be very rare that the decoded integers actually
> get written to RAM.  The buffer would be at most a few K.  It would
> depend on much data we access while scoring the match, but for an i7
> unless we flush through the 8MB L3 cache, the decoded data will stay
> on the CPU.  I even think it will usually stay within our 64K L1.


That's probably a fair assessment.

For the record, I think we need to allocate the buffers dynamically
per-posting-list, rather than use a "static" buffer as you specified
originally -- but the query would still have to get pretty big before the
decoded posting list data would spill out of the CPU cache.

> For the second part, are you distinguishing between method calls and
> function overhead here?

More like distinguishing between a loop which calls a method 100 times, vs. a
single invocation of a bulk-processing method which performs equivalent work,
but over 100 iterations of a branchless inner loop.

> While we should avoid overhead when we can, I
> think the cost of the scoring is going to be dominated by the memory
> access to the position data.  While waiting for each random access to
> memory (and unless we are crafty each doc-term scored will at least
> have one) we could execute almost a thousand (pipelined) instructions.

I dunno about that.  I think we're still a long ways from being
memory-bandwidth-bound, especially for complex queries with nested logicical
constructs.  Right now, even for a simple TermQuery we call make numerous
method calls for each match.  At a minimum, a single iteration through the
Matcher#Collect loop for a TermMatcher (actually a ScorePostingMatcher)
sorting by score triggers these invocations:

  METHOD                     CALL SITE                IMPLEMENTATION
  ---------------------------------------------------------------------------
  TermMatcher#Advance        http://s.apache.org/qR   http://s.apache.org/wj1
  SegPostingList#Advance     http://s.apache.org/lKp  http://s.apache.org/DGr
  SegPostingList#Next        http://s.apache.org/FCF  http://s.apache.org/xz
  ScorePosting#Read_Record   http://s.apache.org/P9z  http://s.apache.org/4JG
  InStream#Buf               http://s.apache.org/v6   http://s.apache.org/DdG
  InStream#Advance_Buf       http://s.apache.org/KiH  http://s.apache.org/ZFl
  InStream#Buf               http://s.apache.org/Xzb  http://s.apache.org/DdG
  InStream#Advance_Buf       http://s.apache.org/y6   http://s.apache.org/ZFl
  SegPostingList#GetPosting  http://s.apache.org/Ri6  http://s.apache.org/vi
  SortCollector#Collect      http://s.apache.org/r0   http://s.apache.org/k7
  ScorePostingMatcher#Score  http://s.apache.org/bgo  http://s.apache.org/NVU

(That's assuming that a deleted doc was not encountered and that the score was
low enough that the SortCollector discarded the hit.)

>> The layout of BucketTable could take many forms.  In Lucene, it's a fixed 
>> size
>> object with set fields -- but we might choose to make it a block of opaque
>> memory with private space allocated for every subscorer... or in some
>> configurations, it could even be a bit vector[2]. :)
>
> Could you explain further, perhaps with a more detailed example?

ORMatcher's Bulk_Score() method might look something like this:

    int32_t
    ORMatcher_bulk_score(ORMatcher *self, MatchAccumulator *accumulator,
                         int32_t start) {
        const int32_t num_kids = VA_Get_Size(self->children);
        int32_t count = 0;

        // Accumulate doc IDs and scores of child matchers into accumulator.
        for (int32_t i = 0; i < num_kids; i++) {
            Matcher *child = (Matcher*)VA_Fetch(self->children, i);
            Matcher_Bulk_Score(child, accumulator, start);
        }

        // Follow linked list of matches, add up scores of child matchers.
        for (int32_t tick = accumulator->first; tick != 0; ) {
            ScoringSlot *const slot = accumulator->slots[tick];
            for (int32_t i = 0; i < num_kids; i++) {
                slot->score += slot->subscores[i];
            }
            count++;
            tick = slot->next;
        }

        return count;
    }

Does that make sense, or should I sketch out some other bulk methods?

> Is the gain that each subscorer can score multiple subqueries in a loop,
> rather than being called by a loop?

Yes, exactly.

Marvin Humphrey

Re: [lucy-dev] Book Club this week PFOR-DELTA!

Reply via email to