Marvin Humphrey wrote on 8/28/12 11:14 AM:
> On Tue, Aug 28, 2012 at 7:47 AM, Peter Karman <[email protected]> wrote:
>> On 8/28/12 1:00 AM, Marvin Humphrey wrote:
>>> FWIW, speed only matters for Matchers.
>>
>> even when iterating over a PostingList? That's where I was expecting the
>> biggest perf hit. But that's with no evidence at all.
> 
> Oh, sheesh, you're right about that.  Correction:
> 
> Speed generally doesn't matter for any activity that scales with the number of
> segments, rather than the number of documents.
> 
> The make_matcher() method gets invoked once per segment, so it ordinarily is
> not a performance bottleneck.  But in your case, you're iterating over a
> posting list within make_matcher(), an activity which scales with the number
> of documents.


/me nods


> 
> FWIW, if we make SortReader and SortCache public, there's a more efficient way
> to do what you're doing: wrap a child Matcher and modify or override the score
> depending on what we find is the value in the SortCache for the doc id being
> scored.
> 
>     package MyMatcher;
>     use base qw( Lucy::Search::Matcher );
> 
>     my %child;
>     my %sort_cache;
> 
>     sub new {
>         my ($class, %args) = @_;
>         my $sort_cache = delete $args{sort_cache};
>         my $child      = delete $args{child};

in the current class setup, I would need something like this in make_matcher()
in MyCompiler then?

 sub make_matcher {
     my $self = shift;
     my $child_matcher = $self->SUPER::make_matcher(@_);
     return MyMatcher->new( child => $child_matcher );
 }

or am I misunderstanding?

I don't think that SUPER::make_matcher() call works unless I've inherited from
TermCompiler instead of just Compiler.


>         my $self       = $class->SUPER::new(%args);
>         $sort_cache{$$self} = $sort_cache;
>         $child{$$self}      = $child;
>         return $self;
>     }
> 
>     sub DESTROY {
>         my $self = shift;
>         delete $child{$$self};
>         delete $sort_cache{$$self};
>         $self->SUPER::DESTROY;
>     }
> 
>     # Delegate next() and get_doc_id() to the child Matcher.
>     sub next       { $child{ ${+shift} }->next }
>     sub get_doc_id { $child{ ${+shift} }->get_doc_id }
> 
>     my %magic_scores = (
>         a => 100,
>         b => 200,
>         c => 300,
>         d => 400,
>     );
> 
>     sub score {
>         my $self = shift;
> 
>         # Try for special score.
>         my $doc_id = $self->get_doc_id;
>         my $ord    = $sort_cache{$$self}->ordinal($doc_id);
>         my $value  = $sort_cache{$$self}->value($ord);
>         if ($value) {
>             my $magic_score = $magic_scores{$value};
>             return $magic_score if $magic_score;
>         }
> 
>         # Fall back to child Matcher's score.
>         return $child_matcher{$$self}->score;
>     }
> 
> Wrapping a child Query would also allow you to eliminate all that TFIDF stuff
> in your subclass.
> 
> The same basic approach can work when fetching values with a DocReader instead
> of a SortCache.  There would be some kind of speed loss for deserializing the
> entire document rather than fetching a value from a sort cache, but DocReader
> is already public.
> 


I like the idea of making SortReader and SortCache public, just because I
imagine it would be nice abuse them like this for other reasons. But I
understand the once-public-always-public threshold.


-- 
Peter Karman  .  http://peknet.com/  .  [email protected]

Reply via email to