On Fri, Aug 24, 2012 at 11:33 AM, Peter Karman <[email protected]> wrote:
> Just ran across this:
>
> http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thanks.html
>
> comments on whether there is anything to be learned here for Lucy?

FWIW: it's not public, but we have a pluggable postings API in Lucy which
would allow us to prototype a block posting format such as PFORdelta.

As far as what we might have learned from the Lucene implementation, I think
there are two things:

1.  Block posting formats are more suitable for modern pipelining processors
    than the bytewise compression we use now.
2.  Pluggable posting formats are the wrong level of abstraction.

If we transition to a block posting format like PFORdelta, we will save on
decoding posting lists, but gains will be limited because we still have to
access the decoded integers one at a time inside TermMatchers wrapped in
expensive ORMatchers, etc.  (If you look at profiling data for our
scoring/matching methods, the bottlenecks are not in InStream, but in higher
level Matchers like OrMatcher etc.)

Java has an advantage here because hotspot compilers enable inlining of method
calls in a way we can't duplicate from C -- so those higher level calls may
get streamlined away to an extent.  However, we can change our architecture so
that it becomes possible to inline procedures manually.

If we really want to tear it up, I believe we need to implement the
"MatchEngine" level of abstraction, as proposed on this list a while back.

     my $query    = $queryparser->parse($query_string);
-    my $compiler = $query->make_compiler(searcher => $searcher);
+    my $compiler = $match_engine->make_compiler(
+        query    => $query,
+        searcher => $searcher,
+    );

MatchEngine abstracts the entire matching and scoring engine.  If we make a
MatchEngine implementation a closed system, then all potential Matcher classes
become known, and we can build specialized loop code.

    if (Matcher_Get_VTable(sub_matcher) == PHRASEMATCHER) {
        /* tight loop optimized for PhraseMatcher goes here */
    }
    else if (Matcher_Get_VTable(sub_matcher) == TERMMATCHER) {
        /* tight loop optimized for PhraseMatcher goes here */
    }
    ...

If we are able to e.g. discover that an ORMatcher consists of only two
TermMatcher children, we can reach in and grab their posting lists and access
the integer blocks directly rather than having to go through an OO interface
that is flexible, but comparatively expensive.

At that point, the gains yielded by block posting formats would loom larger.

Marvin Humphrey

Reply via email to