Great response that I haven't had enough time to fully digest yet. A couple preliminary queries though:
> So long as we're going to support TF/IDF, its complexity can only be hidden, > not eliminated. Many alternative weighting and matching schemes (BM25, > TF/ICF, > LSA, etc.) also require corpus-wide statistics. BM25 is pretty clear as such things go: http://en.wikipedia.org/wiki/Okapi_BM25 I hadn't seen TF/ICF before: http://aser.ornl.gov/publications/ICMLA06.pdf I don't yet understand what it's doing differently than TF/IDF. Is it that it's counting the number of documents that use a term rather than the number of term occurrences? I think I understand Latent Semantic Analysis, and how it could be used for search in place of an inverted index. I'm not sure how it could be used for scoring though. Are there other scoring methods that you anticipate as useful? What other corpus-wide data they would require? What other corpus wide data exists? > When weighting an arbitrarily complex query, we have to allow the scoring > model the option of having member variables and methods which perform the > weighting, and we have to allow for the possibility that it will proceed in an > arbitrary number of stages, requiring gradual modifications to complex > internal states before collapsing down to a final "weight" -- if it ever does. Does your "if ever" imply that we indeed should try to support scorers that might return additional information beyond a single float, such as field name, position data, or matched string? I'd like to be able to do this, but don't see an easy framework. Also, do you feel a Scorer needs to be able to do "incremental" scoring, or is it OK if scoring is only possible after a Matcher has finished? Essentially, will it ever be necessary to score a subquery so that a Matcher can decide whether to skip to the next document? More coherent replies to follow in a few days, --nate
