On Tue, Mar 9, 2010 at 2:28 AM, Marvin Humphrey <mar...@rectangular.com> wrote: > On Mon, Mar 08, 2010 at 02:23:47PM -0500, Michael McCandless wrote: >> For a large index the stats will be stable after re-indexing only a >> few more docs. > > Well, not if there's been huge churn on other nodes in the interim.
Right. >> No... the stat is avg tf within the doc. > > Don't you need the *total* field length -- not just the average tf -- for the > docXfield in question to perform length normalization? Yes, I'm proposing Lucene track both stats. > Or is average term frequency within the docXfield a BM25-specific precursor > that you are using as an example stat? BM25 needs the field length in tokens. lnu.ltc needs avg(tf). These 2 stats seem to the "common" ones (according to Robert). So I want to start with them. >> So if I index this doc: >> >> a a a a b b b c c d >> >> The avg(tf) = average(4 3 2 1) = 2.5. >> >> So we'd store 2.5 for that docXfield in a fixed-width dense postings >> list (like column stride fields -- every doc has a value). > > Like column-stride fields, but also analogous to the current "norms" -- only > with 4x the space requirements. That is, unless you compress that float down > to a byte, as is currently done with the norm (3 bit mantissa, 5 bit > exponent). > > The generation of a "norm" byte involves some pretty intense lossy > data-reduction. If you're going to store the pre-data-reduction raw > materials, you're going to incur a space penalty unless you can eke out > similar savings somewhere. > > The coarse quantization is justified because we only care about big > differences at search-time. If two documents are judged as reasonably close > to each other in relevance, the order in which they rank isn't important. > It's only when docs are judged as far apart in relevance that their relative > rank order matters. > > I don't know that compressing the raw materials is going to work as well as > compressing the final product. Early quantization errors get compounded when > used in later calculations. I would not compress for starters... > BTW, I think we should refer to these bytes as "boost bytes" rather than > "norms". Their purpose is not simply to convey length normalization; they > also include document boost and field boost. And the length normalization > multiplier is a kind of boost... so "boost byte" has everything covered, and > avoids the overloading of the term "norm". +1 -- I like that name. Though, I want to devalue them in importance... ie they are a private impl "trick" that the default Sim impl does to save RAM. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org