I came up with several different subject lines for this email; I settled on the
one I would tend to search for later.

I spent last weekend working on a custom scoring project, involving subclasses
of Query, Compiler and Matcher[0]. It was a fateful trip, a real 3-hour tour.

My goal was to write the least amount of code possible in order to extend the
default TDF/IF weighting scheme, in order to achieve something like this:

 package MyMatcher;
 use base qw( Lucy::Search::Matcher );

 sub score {
    my $self       = shift;
    my $score      = $self->SUPER::score(@_);
    my $doc_reader = $self->get_doc_reader();
    my $doc_id     = $self->get_doc_id();
    my $doc        = $doc_reader->fetch_doc($doc_id);
    return $score * $doc->{my_field_value};
 }

If you grok the code above, you'll see that all I wanted to do was to affect the
score of a Doc at search time based on the value of a field in the Doc. That's
different than sorting by 'my_field_value' because I want the TDF/IF weighting
to still play a part in the score.

If you follow [0] you'll see that the "least amount of code possible" is
actually quite a lot of code. (Although it's quite possible I wrote more than I
needed to -- help very welcome.)

It was a good learning experience for me. However, I don't wish to impose that
learning experience on anyone else. I think it's time we take seriously Marvin's
long-standing desire to refactor how Query/Compiler/Matcher intersect. Reading
over [1] again I now understand more of where Nate was coming from, and I am
grateful for the ongoing dialog Marvin and Nate have had on this subject, as it
kept me company while I spelunked this weekend.

Here are some thoughts, in no particular order:

* I am intrigued by MatchEngine. Would it make what I'm trying to do above any
easier?

* It would be nice to have TermCompiler, TermMatcher (or whatever they end up
being called) made public so that it is easier to extend all the basic Query
types: PhraseQuery, RangeQuery, ProximityQuery. I ended up re-implementing all
the C logic in Perl which I just know is going to be much slower in tight loops
at search time.

* I actually like the original names (Weight, Scorer) more than Compiler and
Matcher. I understand the rationale for the change; the original names just have
more connotative meaning for me. Oh wait. I've been here before.[2] I have
changed my mind.

I started thinking of Compiler as WeightedQuery, and its relationship to Query
as similar to the relationship between Doc and HitDoc. I imagined code like:

 my $query = $queryparser->parse('foo');       # $query isa Query
 $query->apply_weight(searcher => $searcher);  # $query isa WeightedQuery
 if (!$query->weighted) {
     die "can't score a search without weighting the query";
 }

apply_weight() is like make_compiler(), I think. It can/should be called
internally in search(), but helpful when rolling your own.

* I was surprised to find that Compiler isa Query. That didn't really jive for
me with how Compiler keeps accessing parent(). I.e., a Query has-a Compiler. Why
must a Compiler is-a Query?

* The Matcher used by a TermCompiler is not a TermMatcher. It's a
PostlistSomeThingOrOther. Huh? I'm sure that's an optimization, but it still
caught be offguard and I had to hunt awhile.

I want to help make custom scoring easier to do, for me and for others. Is
MatchEngine the way forward? Help me understand.



[0]
https://github.com/karpet/search-query-dialect-lucy-perl/blob/master/t/03-compiler-matcher.t
[1] http://www.mail-archive.com/[email protected]/msg00342.html
[2] http://www.mail-archive.com/[email protected]/msg00310.html

-- 
Peter Karman  .  http://peknet.com/  .  [email protected]

Reply via email to