Re: Multi-node stats within individual nodes (was "Baby steps...")

Michael McCandless Tue, 09 Mar 2010 12:13:47 -0800

On Tue, Mar 9, 2010 at 2:11 PM, Marvin Humphrey <[email protected]> wrote:
>> > I don't know that compressing the raw materials is going to work as well as
>> > compressing the final product.  Early quantization errors get compounded 
>> > when
>> > used in later calculations.
>>
>> I would not compress for starters...
>
> How about lossless compression, then?  Do you need random access into this
> specialized posting list?  For the use cases you've described so far I don't
> think so, since you're just iterating it top to bottom on segment open.


Don't need random access -- just a full scan (or 2, if avg needs to be
regen'd) on startup.

> You could store the total length of the field in tokens and the number of
> unique terms as integers, compressing with vbyte, PFOR or whatever... then
> divide at search time to get average term frequency.  That way, you also avoid
> committing to a float encoding, which I don't think Lucene has standardized
> yet.

Yeah I think that's a great starting approach...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Multi-node stats within individual nodes (was "Baby steps...")

Reply via email to