On Tue, Aug 28, 2012 at 7:47 AM, Peter Karman <[email protected]> wrote:
> On 8/28/12 1:00 AM, Marvin Humphrey wrote:
>> FWIW, speed only matters for Matchers.
>
> even when iterating over a PostingList? That's where I was expecting the
> biggest perf hit. But that's with no evidence at all.
Oh, sheesh, you're right about that. Correction:
Speed generally doesn't matter for any activity that scales with the number of
segments, rather than the number of documents.
The make_matcher() method gets invoked once per segment, so it ordinarily is
not a performance bottleneck. But in your case, you're iterating over a
posting list within make_matcher(), an activity which scales with the number
of documents.
FWIW, if we make SortReader and SortCache public, there's a more efficient way
to do what you're doing: wrap a child Matcher and modify or override the score
depending on what we find is the value in the SortCache for the doc id being
scored.
package MyMatcher;
use base qw( Lucy::Search::Matcher );
my %child;
my %sort_cache;
sub new {
my ($class, %args) = @_;
my $sort_cache = delete $args{sort_cache};
my $child = delete $args{child};
my $self = $class->SUPER::new(%args);
$sort_cache{$$self} = $sort_cache;
$child{$$self} = $child;
return $self;
}
sub DESTROY {
my $self = shift;
delete $child{$$self};
delete $sort_cache{$$self};
$self->SUPER::DESTROY;
}
# Delegate next() and get_doc_id() to the child Matcher.
sub next { $child{ ${+shift} }->next }
sub get_doc_id { $child{ ${+shift} }->get_doc_id }
my %magic_scores = (
a => 100,
b => 200,
c => 300,
d => 400,
);
sub score {
my $self = shift;
# Try for special score.
my $doc_id = $self->get_doc_id;
my $ord = $sort_cache{$$self}->ordinal($doc_id);
my $value = $sort_cache{$$self}->value($ord);
if ($value) {
my $magic_score = $magic_scores{$value};
return $magic_score if $magic_score;
}
# Fall back to child Matcher's score.
return $child_matcher{$$self}->score;
}
Wrapping a child Query would also allow you to eliminate all that TFIDF stuff
in your subclass.
The same basic approach can work when fetching values with a DocReader instead
of a SortCache. There would be some kind of speed loss for deserializing the
entire document rather than fetching a value from a sort cache, but DocReader
is already public.
Marvin Humphrey