I'm going to try to chip off some small pieces and deal with them individually. As a result, I may have a number of threads going at once. Sorry for the profusion, but I'll try to get back to the big picture by the end.
On Thu, Apr 7, 2011 at 4:29 PM, Marvin Humphrey <[email protected]> wrote: > Exactly. Lucy::Search::QueryParser happens to implement one particular query > language, but that language is not the canonical interface to Lucy -- there > are other ways to specify search criteria. Yes. What I like about this is that we provide the user a tool (QueryParser) to convert text into a Query, but that we don't require them to use it. If they want to create a conforming query by some other means, they are welcome to do so. Equally, if they want to start with a QueryParser generated Query and adjust it, for example by adding an optimization pass, they can do so. Rather than passing in plain text and having the Query hidden in the innards of Lucy, we expose the Query. What I'm suggesting is that we do the same for scoring --- rather giving the user lots of knobs to tweak that affect how the scorer will happen inside a monolithic method, I want this to happen out in the open. We provide tools to do it easily, but other tools can be used. >> TF/IDF (I'm not actually against it, just having it define our >> architecture) requires access to full collection statistics >> (Searcher), but can't this be done at Query creation or just after? >> >> query = new Lucy::Query("this AND that"); >> Lucy::TFIDF::Boost(query, Searcher); >> >> query = new Custom::Query("este & ese"); >> Custom::Boost(query, Searcher, IP, flags, whatever); >> >> query = new One::Stop::Boosted::Query("user input", flags, boost_parameters); > > Query objects are often created directly by a user. We should not modify such > Queries by overwriting the user-supplied boost with a derived, corpus-weighted > boost. Perhaps surprisingly, I agree with this: the Query object should not be changed as a side effect of running a search. But in the examples I'm giving above, "we" are not changing the Query, the user is. We're simply providing tools to let them do so efficiently, and demonstrating the pattern by which other such tools can be written. > Query objects may also be weighted in a PolySearcher and then passed down into > a child Searcher. It is essential that the child Searcher know that weighting > has already been performed and must not be performed again. I feel this is an architectural flaw, and that the correct solution is that weighting should never be performed automatically. It should be an explicit step that happens under the control of the user, with Lucy the library providing the tools to do so. No flags, no checks, just run it how it comes in. I think the parallel with query optimization is accurate. Query optimization is a great thing, but it should not happen behind the scenes. It's OK if the default QueryParser does the optimization, but the engine should run exactly the Query it's passed. In the same way, the weighting needs to be independent of the "engine". Viewing everything as happening on a Child searcher on another physical machine seems like a good approach. Assume there is a machine with an known index schema and a net connection: exactly what do we need to specify over-the-wire to get the results we want? This is the degree of isolation we want when splitting up the phases. I'm suggesting that we should be able to just serialize the Query and specify which results we want returned. Because the corpus statistics are only known by the parent, to me it makes no sense to do the weighting on the child: I think you essentially want to take the Query and the Scorer and combine them into a single entity (Compiler), whereas I want to keep them distinct. But rather than discussing the abstract, I think we can focus on the specific: what information needs to be sent as part of the search request for a specific cases? We want to search ["this" AND "that"], weighted TF/IDF, returning top 10 scores. What bytes form the Request that we need to send to the child? Presuming we know the full corpus statistics on the parent, I think we can just serialize a pre-weighted query, specify the name of a Scorer (one that adds subqueries), and that we want only the top 10 results. I don't think the child needs to know whether we are using TF/IDF, TF/IFC, or BM25. What am I missing? Probably lots. I think I'm presuming that the weighting method can be independent of the scoring method. The methods you've mentioned blend these two, but I think they can be separated. Maybe they can't be separated in general: what if you wanted to specify that you wanted words close the head of a document to be more valuable? But I'm hoping that this can be solved by adding some configuration options to the Scorer name. --nate
