Re: Multi-node stats within individual nodes (was "Baby steps...")

Michael McCandless Tue, 09 Mar 2010 10:04:50 -0800

On Tue, Mar 9, 2010 at 2:28 AM, Marvin Humphrey <[email protected]> wrote:
> On Mon, Mar 08, 2010 at 02:23:47PM -0500, Michael McCandless wrote:
>> For a large index the stats will be stable after re-indexing only a
>> few more docs.
>
> Well, not if there's been huge churn on other nodes in the interim.


Right.

>> No... the stat is avg tf within the doc.
>
> Don't you need the *total* field length -- not just the average tf -- for the
> docXfield in question to perform length normalization?

Yes, I'm proposing Lucene track both stats.

> Or is average term frequency within the docXfield a BM25-specific precursor
> that you are using as an example stat?

BM25 needs the field length in tokens.  lnu.ltc needs avg(tf).  These
2 stats seem to the "common" ones (according to Robert).  So I want to
start with them.

>> So if I index this doc:
>>
>>   a a a a b b b c c d
>>
>> The avg(tf) = average(4 3 2 1) = 2.5.
>>
>> So we'd store 2.5 for that docXfield in a fixed-width dense postings
>> list (like column stride fields -- every doc has a value).
>
> Like column-stride fields, but also analogous to the current "norms" -- only
> with 4x the space requirements.  That is, unless you compress that float down
> to a byte, as is currently done with the norm (3 bit mantissa, 5 bit
> exponent).
>
> The generation of a "norm" byte involves some pretty intense lossy
> data-reduction.  If you're going to store the pre-data-reduction raw
> materials, you're going to incur a space penalty unless you can eke out
> similar savings somewhere.
>
> The coarse quantization is justified because we only care about big
> differences at search-time.  If two documents are judged as reasonably close
> to each other in relevance, the order in which they rank isn't important.
> It's only when docs are judged as far apart in relevance that their relative
> rank order matters.
>
> I don't know that compressing the raw materials is going to work as well as
> compressing the final product.  Early quantization errors get compounded when
> used in later calculations.

I would not compress for starters...

> BTW, I think we should refer to these bytes as "boost bytes" rather than
> "norms".  Their purpose is not simply to convey length normalization; they
> also include document boost and field boost.  And the length normalization
> multiplier is a kind of boost... so "boost byte" has everything covered, and
> avoids the overloading of the term "norm".

+1 -- I like that name.  Though, I want to devalue them in
importance... ie they are a private impl "trick" that the default Sim
impl does to save RAM.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Multi-node stats within individual nodes (was "Baby steps...")

Reply via email to