On Tue, Aug 28, 2012 at 7:47 AM, Peter Karman <[email protected]> wrote:
> On 8/28/12 1:00 AM, Marvin Humphrey wrote:
>> FWIW, speed only matters for Matchers.
>
> even when iterating over a PostingList? That's where I was expecting the
> biggest perf hit. But that's with no evidence at all.

Oh, sheesh, you're right about that.  Correction:

Speed generally doesn't matter for any activity that scales with the number of
segments, rather than the number of documents.

The make_matcher() method gets invoked once per segment, so it ordinarily is
not a performance bottleneck.  But in your case, you're iterating over a
posting list within make_matcher(), an activity which scales with the number
of documents.

FWIW, if we make SortReader and SortCache public, there's a more efficient way
to do what you're doing: wrap a child Matcher and modify or override the score
depending on what we find is the value in the SortCache for the doc id being
scored.

    package MyMatcher;
    use base qw( Lucy::Search::Matcher );

    my %child;
    my %sort_cache;

    sub new {
        my ($class, %args) = @_;
        my $sort_cache = delete $args{sort_cache};
        my $child      = delete $args{child};
        my $self       = $class->SUPER::new(%args);
        $sort_cache{$$self} = $sort_cache;
        $child{$$self}      = $child;
        return $self;
    }

    sub DESTROY {
        my $self = shift;
        delete $child{$$self};
        delete $sort_cache{$$self};
        $self->SUPER::DESTROY;
    }

    # Delegate next() and get_doc_id() to the child Matcher.
    sub next       { $child{ ${+shift} }->next }
    sub get_doc_id { $child{ ${+shift} }->get_doc_id }

    my %magic_scores = (
        a => 100,
        b => 200,
        c => 300,
        d => 400,
    );

    sub score {
        my $self = shift;

        # Try for special score.
        my $doc_id = $self->get_doc_id;
        my $ord    = $sort_cache{$$self}->ordinal($doc_id);
        my $value  = $sort_cache{$$self}->value($ord);
        if ($value) {
            my $magic_score = $magic_scores{$value};
            return $magic_score if $magic_score;
        }

        # Fall back to child Matcher's score.
        return $child_matcher{$$self}->score;
    }

Wrapping a child Query would also allow you to eliminate all that TFIDF stuff
in your subclass.

The same basic approach can work when fetching values with a DocReader instead
of a SortCache.  There would be some kind of speed loss for deserializing the
entire document rather than fetching a value from a sort cache, but DocReader
is already public.

Marvin Humphrey

Reply via email to