On Tue, Mar 9, 2010 at 2:11 PM, Marvin Humphrey <mar...@rectangular.com> wrote: >> > I don't know that compressing the raw materials is going to work as well as >> > compressing the final product. Early quantization errors get compounded >> > when >> > used in later calculations. >> >> I would not compress for starters... > > How about lossless compression, then? Do you need random access into this > specialized posting list? For the use cases you've described so far I don't > think so, since you're just iterating it top to bottom on segment open.
Don't need random access -- just a full scan (or 2, if avg needs to be regen'd) on startup. > You could store the total length of the field in tokens and the number of > unique terms as integers, compressing with vbyte, PFOR or whatever... then > divide at search time to get average term frequency. That way, you also avoid > committing to a float encoding, which I don't think Lucene has standardized > yet. Yeah I think that's a great starting approach... Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org