On Mon, Mar 08, 2010 at 02:23:47PM -0500, Michael McCandless wrote: > For a large index the stats will be stable after re-indexing only a > few more docs.
Well, not if there's been huge churn on other nodes in the interim. > No... the stat is avg tf within the doc. Don't you need the *total* field length -- not just the average tf -- for the docXfield in question to perform length normalization? Or is average term frequency within the docXfield a BM25-specific precursor that you are using as an example stat? > So if I index this doc: > > a a a a b b b c c d > > The avg(tf) = average(4 3 2 1) = 2.5. > > So we'd store 2.5 for that docXfield in a fixed-width dense postings > list (like column stride fields -- every doc has a value). Like column-stride fields, but also analogous to the current "norms" -- only with 4x the space requirements. That is, unless you compress that float down to a byte, as is currently done with the norm (3 bit mantissa, 5 bit exponent). The generation of a "norm" byte involves some pretty intense lossy data-reduction. If you're going to store the pre-data-reduction raw materials, you're going to incur a space penalty unless you can eke out similar savings somewhere. The coarse quantization is justified because we only care about big differences at search-time. If two documents are judged as reasonably close to each other in relevance, the order in which they rank isn't important. It's only when docs are judged as far apart in relevance that their relative rank order matters. I don't know that compressing the raw materials is going to work as well as compressing the final product. Early quantization errors get compounded when used in later calculations. BTW, I think we should refer to these bytes as "boost bytes" rather than "norms". Their purpose is not simply to convey length normalization; they also include document boost and field boost. And the length normalization multiplier is a kind of boost... so "boost byte" has everything covered, and avoids the overloading of the term "norm". Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org