On Sat, Jan 12, 2013 at 12:01 PM, Dan <[email protected]> wrote:
> An even more challenging than understanding the papers
> is how to best incorporate it into Lucy as efficiently as we can.
The defining stratagem of the "patched" family of "lightweight compression"
algorithms[1] is to maximize instruction-level parallelism by avoiding control
hazards and minimizing data hazards in inner loops. To make full use of this
device, Lucy's doc-at-a-time matching mechanism needs to to be modified along
these lines:
http://oai.cwi.nl/oai/asset/15564/15564B.pdf [section 2.3]
Contrary to other implementations, the next() method yields not one new
tuple, but a vector –- an array typically consisting of a few hundreds of
values. Consequently, the primitive functions, called by the relational
operators to perform computations, are tight loops that perform very
simple operations (e.g. addition) on arrays of values, producing arrays of
results. The benefits are: (i) the compiler can use loop-pipelining, and
(ii) function call cost (in general around 20 cycles) is only paid once
per vector.
The problem with our current system is that even if we migrate to a block
encoding for our inverted lists, we can only use a fraction of the block on
each call to Matcher_Next(). Not only would we have to buffer the decoded
integers in memory between matches, but we would still continue to make a
large number of method calls for each match.
A couple months back, Nate proposed an "interchange format" which posting list
decoders could write into at high speed:
http://markmail.org/message/gckwc5wcdy7gqo5s
I still think that there is a fine solution where we define an interchange
data format that each plugin decodes to, and each scorer knows how to read
from. Paying particular attention to cache lines, the plugin fetches a
compressed format from memory, and decodes it to a static buffer with as
few mispredicted branches as possible. The scorer then operates on this
decoded block directly, no layers involved. So long as this static buffer
lives in L1, the matching runs at CPU speed rather than RAM speed, and by
having a fixed format, the scorers can be optimized in advance. If we can
sufficiently index into our format so as to avoid unnecessary
reads/decodes, I think we can fly, no hotspot required or MatchEngine
required.
This is helpful to minimize the cost of decoding posting lists, but it does
not address the cost of doc-at-a-time Matcher method-call overhead.
How about we move the interchange format to a later stage, so that Matchers
can pass around doc id and scoring data in bulk? That's apparently what
MonetDB does, with its next() call which returns a "vector" instead of a
single tuple.
The old Lucene BooleanScorer has multiple, serious limitations, but it uses a
multi-slot accumulator called a _BucketTable_ which might serve as a point of
departure.
http://s.apache.org/t5d
/* Description from Doug Cutting (excerpted from LUCENE-1483):
*
* BooleanScorer uses an array to score windows of 2K docs. So it scores
* docs 0-2K first, then docs 2K-4K, etc. For each window it iterates
* through all query terms and accumulates a score in table[doc%2K]. It
* also stores in the table a bitmask representing which terms contributed
* to the score. Non-zero scores are chained in a linked list. At the end
* of scoring each window it then iterates through the linked list and, if
* the bitmask matches the boolean constraints, collects a hit. For
* boolean queries with lots of frequent terms this can be much faster,
* since it does not need to update a priority queue for each posting,
* instead performing constant-time operations per posting. The only
* downside is that it results in hits being delivered out-of-order within
* the window, which means it cannot be nested within other scorers. But
* it works well as a top-level scorer.
*
* The new BooleanScorer2 implementation instead works by merging priority
* queues of postings, albeit with some clever tricks. For example, a pure
* conjunction (all terms required) does not require a priority queue.
* Instead it sorts the posting streams at the start, then repeatedly
* skips the first to to the last. If the first ever equals the last, then
* there's a hit. When some terms are required and some terms are
* optional, the conjunction can be evaluated first, then the optional
* terms can all skip to the match and be added to the score. Thus the
* conjunction can reduce the number of priority queue updates for the
* optional terms. */
The BucketTable has 2048 slots; BooleanScorer runs each subscorer over a
2048-doc-id range before moving on to the next. To cut down on the number of
method calls, we would need to pass the BucketTable to each subscorer to be
written into directly.
The layout of BucketTable could take many forms. In Lucene, it's a fixed size
object with set fields -- but we might choose to make it a block of opaque
memory with private space allocated for every subscorer... or in some
configurations, it could even be a bit vector[2]. :)
Marvin Humphrey
[1] "Patched" family of codecs:
PFOR => Patched Frame Of Reference.
PFOR-DELTA => Patched Frame Of Reference with delta encoding.
PDICT => Patched dictionary compression.
[2] Dan suggested this possibility during a discussion in the office.