On Tue, Jan 15, 2013 at 12:08 AM, Nathan Kurz <[email protected]> wrote: > On Sun, Jan 13, 2013 at 7:15 PM, Marvin Humphrey <[email protected]> > wrote: >> The problem with our current system is that even if we migrate to a block >> encoding for our inverted lists, we can only use a fraction of the block on >> each call to Matcher_Next(). Not only would we have to buffer the decoded >> integers in memory between matches, but we would still continue to make a >> large number of method calls for each match. > > I think I understand you, but I'm not sure how much of an issue the > buffering is. Maybe this has to do with what you mean by "in memory" > --- I think it would be very rare that the decoded integers actually > get written to RAM. The buffer would be at most a few K. It would > depend on much data we access while scoring the match, but for an i7 > unless we flush through the 8MB L3 cache, the decoded data will stay > on the CPU. I even think it will usually stay within our 64K L1.
That's probably a fair assessment. For the record, I think we need to allocate the buffers dynamically per-posting-list, rather than use a "static" buffer as you specified originally -- but the query would still have to get pretty big before the decoded posting list data would spill out of the CPU cache. > For the second part, are you distinguishing between method calls and > function overhead here? More like distinguishing between a loop which calls a method 100 times, vs. a single invocation of a bulk-processing method which performs equivalent work, but over 100 iterations of a branchless inner loop. > While we should avoid overhead when we can, I > think the cost of the scoring is going to be dominated by the memory > access to the position data. While waiting for each random access to > memory (and unless we are crafty each doc-term scored will at least > have one) we could execute almost a thousand (pipelined) instructions. I dunno about that. I think we're still a long ways from being memory-bandwidth-bound, especially for complex queries with nested logicical constructs. Right now, even for a simple TermQuery we call make numerous method calls for each match. At a minimum, a single iteration through the Matcher#Collect loop for a TermMatcher (actually a ScorePostingMatcher) sorting by score triggers these invocations: METHOD CALL SITE IMPLEMENTATION --------------------------------------------------------------------------- TermMatcher#Advance http://s.apache.org/qR http://s.apache.org/wj1 SegPostingList#Advance http://s.apache.org/lKp http://s.apache.org/DGr SegPostingList#Next http://s.apache.org/FCF http://s.apache.org/xz ScorePosting#Read_Record http://s.apache.org/P9z http://s.apache.org/4JG InStream#Buf http://s.apache.org/v6 http://s.apache.org/DdG InStream#Advance_Buf http://s.apache.org/KiH http://s.apache.org/ZFl InStream#Buf http://s.apache.org/Xzb http://s.apache.org/DdG InStream#Advance_Buf http://s.apache.org/y6 http://s.apache.org/ZFl SegPostingList#GetPosting http://s.apache.org/Ri6 http://s.apache.org/vi SortCollector#Collect http://s.apache.org/r0 http://s.apache.org/k7 ScorePostingMatcher#Score http://s.apache.org/bgo http://s.apache.org/NVU (That's assuming that a deleted doc was not encountered and that the score was low enough that the SortCollector discarded the hit.) >> The layout of BucketTable could take many forms. In Lucene, it's a fixed >> size >> object with set fields -- but we might choose to make it a block of opaque >> memory with private space allocated for every subscorer... or in some >> configurations, it could even be a bit vector[2]. :) > > Could you explain further, perhaps with a more detailed example? ORMatcher's Bulk_Score() method might look something like this: int32_t ORMatcher_bulk_score(ORMatcher *self, MatchAccumulator *accumulator, int32_t start) { const int32_t num_kids = VA_Get_Size(self->children); int32_t count = 0; // Accumulate doc IDs and scores of child matchers into accumulator. for (int32_t i = 0; i < num_kids; i++) { Matcher *child = (Matcher*)VA_Fetch(self->children, i); Matcher_Bulk_Score(child, accumulator, start); } // Follow linked list of matches, add up scores of child matchers. for (int32_t tick = accumulator->first; tick != 0; ) { ScoringSlot *const slot = accumulator->slots[tick]; for (int32_t i = 0; i < num_kids; i++) { slot->score += slot->subscores[i]; } count++; tick = slot->next; } return count; } Does that make sense, or should I sketch out some other bulk methods? > Is the gain that each subscorer can score multiple subqueries in a loop, > rather than being called by a loop? Yes, exactly. Marvin Humphrey
