In thinking about & discussing with Robert how to allow Lucene to support other scoring models, eg lnu.ltc, BM25, etc.... I think a relatively contained set of changes can give us a solid step forward. Something like this:
* Store additional per-doc stats in the index, eg in a custom posting list, including length in tokens of the field, avg tf, and boost (boost can be efficiently stored so only if it differs from default is it stored). Do not compute nor store norms in the index. Merging would just concatenate these values (removing deleted docs). * Change IR so on open it generates norms dynamically, ie by walking the stats, computing avgs (eg avg field length in tokens), and computing the final per-field boost, casting to a 1-byte quantized float. We may want to store aggregates in eg SegmentInfo to save the extra pass on IR open... * Change Similarity, to allow field-specific Similarity (I think we have issue open for this already). I think, also, lengthNorm (which is no longer invoked during indexing) would no longer be used. I think we'd make the class that computes norms from these per-doc stats on IR open pluggable. And, someday we could make what stats are gathered/stored during indexing pluggable but for starters I think we should simply support the field length in tokens and avg tf per field. Thoughts? Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org