Marvin Humphrey wrote on 8/28/12 11:14 AM:
> On Tue, Aug 28, 2012 at 7:47 AM, Peter Karman <[email protected]> wrote:
>> On 8/28/12 1:00 AM, Marvin Humphrey wrote:
>>> FWIW, speed only matters for Matchers.
>>
>> even when iterating over a PostingList? That's where I was expecting the
>> biggest perf hit. But that's with no evidence at all.
>
> Oh, sheesh, you're right about that. Correction:
>
> Speed generally doesn't matter for any activity that scales with the number of
> segments, rather than the number of documents.
>
> The make_matcher() method gets invoked once per segment, so it ordinarily is
> not a performance bottleneck. But in your case, you're iterating over a
> posting list within make_matcher(), an activity which scales with the number
> of documents.
/me nods
>
> FWIW, if we make SortReader and SortCache public, there's a more efficient way
> to do what you're doing: wrap a child Matcher and modify or override the score
> depending on what we find is the value in the SortCache for the doc id being
> scored.
>
> package MyMatcher;
> use base qw( Lucy::Search::Matcher );
>
> my %child;
> my %sort_cache;
>
> sub new {
> my ($class, %args) = @_;
> my $sort_cache = delete $args{sort_cache};
> my $child = delete $args{child};
in the current class setup, I would need something like this in make_matcher()
in MyCompiler then?
sub make_matcher {
my $self = shift;
my $child_matcher = $self->SUPER::make_matcher(@_);
return MyMatcher->new( child => $child_matcher );
}
or am I misunderstanding?
I don't think that SUPER::make_matcher() call works unless I've inherited from
TermCompiler instead of just Compiler.
> my $self = $class->SUPER::new(%args);
> $sort_cache{$$self} = $sort_cache;
> $child{$$self} = $child;
> return $self;
> }
>
> sub DESTROY {
> my $self = shift;
> delete $child{$$self};
> delete $sort_cache{$$self};
> $self->SUPER::DESTROY;
> }
>
> # Delegate next() and get_doc_id() to the child Matcher.
> sub next { $child{ ${+shift} }->next }
> sub get_doc_id { $child{ ${+shift} }->get_doc_id }
>
> my %magic_scores = (
> a => 100,
> b => 200,
> c => 300,
> d => 400,
> );
>
> sub score {
> my $self = shift;
>
> # Try for special score.
> my $doc_id = $self->get_doc_id;
> my $ord = $sort_cache{$$self}->ordinal($doc_id);
> my $value = $sort_cache{$$self}->value($ord);
> if ($value) {
> my $magic_score = $magic_scores{$value};
> return $magic_score if $magic_score;
> }
>
> # Fall back to child Matcher's score.
> return $child_matcher{$$self}->score;
> }
>
> Wrapping a child Query would also allow you to eliminate all that TFIDF stuff
> in your subclass.
>
> The same basic approach can work when fetching values with a DocReader instead
> of a SortCache. There would be some kind of speed loss for deserializing the
> entire document rather than fetching a value from a sort cache, but DocReader
> is already public.
>
I like the idea of making SortReader and SortCache public, just because I
imagine it would be nice abuse them like this for other reasons. But I
understand the once-public-always-public threshold.
--
Peter Karman . http://peknet.com/ . [email protected]