To some extent that already happens in a rough way in things like BlockPackedWriter (and also postings lists).
For example these things encode blocks (e.g. 128 in the postings, maybe 1024 in docvalues, i forget), and if they encounter blocks of all the same value, they just write a bit marking that and encode the value once. On Mon, Sep 16, 2013 at 1:18 PM, Otis Gospodnetic <[email protected]> wrote: > You guys got it, of course. :) > > I liked the sound of being able to detect how to pack things at run > time and switch between multiple approaches over time.... or at least > that's how I interpreted the announcement. > > Otis > -- > Solr & ElasticSearch Support -- http://sematext.com/ > Performance Monitoring -- http://sematext.com/spm > > > > On Mon, Sep 16, 2013 at 4:29 AM, Adrien Grand <[email protected]> wrote: >> Thanks for pointing this out, Otis! >> >> I think the columnar nature of Parquet makes it more similar to doc >> values than to stored fields, and indeed, if you look at the parquet >> file-format specification [1], it is very similar to what we have for >> doc values [2]. In both cases, we have >> - dictionary encoding (PLAIN_DICTIONARY in parquet, TABLE_COMPRESSED >> in Lucene45DVF), >> - bit-packing (BIT_PACKED(/RLE) in parquet, DELTA_COMPRESSED in >> Lucene45DVF). >> >> Parquet also uses run-length encoding (RLE) which is unfortunately not >> doable for doc values since they need to support random access. >> Parquet's RLE compression is actually closer to what we have for >> postings lists (a postings list of X values is encoded as X/128 blocs >> of 128 packed values and X%128 RLE-encoded (VInt) values). On the >> other hand, doc values have GCD_COMPRESSED (which efficiently >> compresses any sequence of longs where all values can be expressed as >> a * x + b) which is typically useful for storing dates that don't have >> millisecond precision. >> >> About stored fields, it would indeed be possible to store all values >> of a given field in a column-stride fashion per chunk. However, I >> think parquet doesn't optimize for the same thing as stored fields: >> parquet needs to run computations on the values of a few fields of >> many documents (like doc values) while with stored fields, we usually >> need to get all values of a single document. This makes columnar >> storage a bit unconvenient for stored fields, although I think we >> could try it on our chunks of stored documents given that it may >> improve the compression ratio. >> >> I only have a very superficial understanding of parquet so if you know >> I said something which is wrong about parquet, please tell me! >> >> [1] https://github.com/parquet/parquet-format >> [2] >> https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesConsumer.java >> >> -- >> Adrien >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
