On Wed, Apr 13, 2011 at 12:21 PM, Marvin Humphrey <[email protected]> wrote: > Moving weighting out of the library and into application space would increase > the complexity of user code, ipso facto: > > my $query = $query_parser->parse($query_string); > + $query = $query->weight(searcher => $searcher); > my $hits = $searcher->hits(query => $query);
It doesn't have to be in the application level --- I'd be perfectly happy to have it happen in the query parser, so long as the query parser was clearly written and self-contained, so that one could confidently rewrite it to use a different weighting scheme without full knowledge of everything that happens afterward. Gravy would be if that query parser was contained in a subclass specific to TFIDF: my $query = new Lucy::TFIDF::Query($query_string); > For TF/IDF, queries should *always* be weighted, so if we made this change the > user would simply become responsible for manually executing a step that Lucy > performs automatically right now. Sure, but so long as the rules are clear it isn't that onerous. The reality is that most new users are going to cut and paste from your sample program, and so long as the sample includes this line they are unlikely to go out of their way to remove it. > I think many users would be surprised and confused if we started requiring > them to take charge of query weighting. I might even have to start counting on my toes! ;) > The proposal makes perfect sense, though, if scoring isn't important to you. Or if scoring is very important to you. It makes less sense if what you want is an out-of-the-box no configuration search box for your text based web site. > What if Lucy was a boolean matching engine, which you could hack to augment > with TF/IDF scores? What if TF/IDF was an add-on, and all TF/IDF weighting > code lived outside of core? What if only a tiny fraction of Lucy's users > needed to weight their queries? There's of course the question about what Core means here. I think TF/IDF should certainly be part of the core distribution, but it would be great if it could be compartmentalized. > If all that were true, Lucy's internals could be simplified considerably. All > of the weighting code would be gone -- we wouldn't have to think about it in > either single-node or search-cluster context. Lucy::Search::Compiler would be > gone and we would all just pass around Query objects. Only the TF/IDF weirdos > would stuff those bizarre calls to $query->weight into their application > code... I can't quite tell how much I'm being mocked here. I guessing you're trying your best to express a point of view that you don't quite share. No offense in either case, though, as I'm sure many things I suggest are quite deserving of considerable mockery. Everyone needs their queries to be weighed in some way, even if that weighting is constant. TF/IDF is a fine and venerable default weighting, if you happen to be indexing books, or blog posts or magazine articles. But if you are indexing something like names, titles, lists of properties, inverse document frequency doesn't have the same resonance. And although it may be largely semantic, I really do like the idea of passing around a Query rather than a Compiler. Especially if we could keep the Query as simply a canonical representation of a search request, and split all the other duties off into their own well contained classes. > If you are browsing through the Lucy code base trying to > understand how everything fits together -- or trying to implement your own > matching framework on top of those Query classes -- that's going to make > things a lot easier. I do think that simplifying the structure would go a long way in making modifications more accessible. Compiler really feels like a catch-all, and yet it's not even in it's own hierarchy. Pop quiz: how many people on this list know that the code for the TF/IDF specific TermCompiler be found in the file TermQuery.c? And how many of those think it belongs there? >> I know this doesn't currently exist, but your MatchEngine and >> Lucy::Score::TFIDF* hierarchy feels like a good direction to explore. > > Groovy. Though I'm not sure where the TF/IDF code will end up yet, I think > simplifying the *Query.c files ought to be one of the goals of this > refactoring round. Sounds like a great goal to me! --nate
