I came up with several different subject lines for this email; I settled on the
one I would tend to search for later.
I spent last weekend working on a custom scoring project, involving subclasses
of Query, Compiler and Matcher[0]. It was a fateful trip, a real 3-hour tour.
My goal was to write the least amount of code possible in order to extend the
default TDF/IF weighting scheme, in order to achieve something like this:
package MyMatcher;
use base qw( Lucy::Search::Matcher );
sub score {
my $self = shift;
my $score = $self->SUPER::score(@_);
my $doc_reader = $self->get_doc_reader();
my $doc_id = $self->get_doc_id();
my $doc = $doc_reader->fetch_doc($doc_id);
return $score * $doc->{my_field_value};
}
If you grok the code above, you'll see that all I wanted to do was to affect the
score of a Doc at search time based on the value of a field in the Doc. That's
different than sorting by 'my_field_value' because I want the TDF/IF weighting
to still play a part in the score.
If you follow [0] you'll see that the "least amount of code possible" is
actually quite a lot of code. (Although it's quite possible I wrote more than I
needed to -- help very welcome.)
It was a good learning experience for me. However, I don't wish to impose that
learning experience on anyone else. I think it's time we take seriously Marvin's
long-standing desire to refactor how Query/Compiler/Matcher intersect. Reading
over [1] again I now understand more of where Nate was coming from, and I am
grateful for the ongoing dialog Marvin and Nate have had on this subject, as it
kept me company while I spelunked this weekend.
Here are some thoughts, in no particular order:
* I am intrigued by MatchEngine. Would it make what I'm trying to do above any
easier?
* It would be nice to have TermCompiler, TermMatcher (or whatever they end up
being called) made public so that it is easier to extend all the basic Query
types: PhraseQuery, RangeQuery, ProximityQuery. I ended up re-implementing all
the C logic in Perl which I just know is going to be much slower in tight loops
at search time.
* I actually like the original names (Weight, Scorer) more than Compiler and
Matcher. I understand the rationale for the change; the original names just have
more connotative meaning for me. Oh wait. I've been here before.[2] I have
changed my mind.
I started thinking of Compiler as WeightedQuery, and its relationship to Query
as similar to the relationship between Doc and HitDoc. I imagined code like:
my $query = $queryparser->parse('foo'); # $query isa Query
$query->apply_weight(searcher => $searcher); # $query isa WeightedQuery
if (!$query->weighted) {
die "can't score a search without weighting the query";
}
apply_weight() is like make_compiler(), I think. It can/should be called
internally in search(), but helpful when rolling your own.
* I was surprised to find that Compiler isa Query. That didn't really jive for
me with how Compiler keeps accessing parent(). I.e., a Query has-a Compiler. Why
must a Compiler is-a Query?
* The Matcher used by a TermCompiler is not a TermMatcher. It's a
PostlistSomeThingOrOther. Huh? I'm sure that's an optimization, but it still
caught be offguard and I had to hunt awhile.
I want to help make custom scoring easier to do, for me and for others. Is
MatchEngine the way forward? Help me understand.
[0]
https://github.com/karpet/search-query-dialect-lucy-perl/blob/master/t/03-compiler-matcher.t
[1] http://www.mail-archive.com/[email protected]/msg00342.html
[2] http://www.mail-archive.com/[email protected]/msg00310.html
--
Peter Karman . http://peknet.com/ . [email protected]