Thanks for pointing this out, Otis! I think the columnar nature of Parquet makes it more similar to doc values than to stored fields, and indeed, if you look at the parquet file-format specification [1], it is very similar to what we have for doc values [2]. In both cases, we have - dictionary encoding (PLAIN_DICTIONARY in parquet, TABLE_COMPRESSED in Lucene45DVF), - bit-packing (BIT_PACKED(/RLE) in parquet, DELTA_COMPRESSED in Lucene45DVF).
Parquet also uses run-length encoding (RLE) which is unfortunately not doable for doc values since they need to support random access. Parquet's RLE compression is actually closer to what we have for postings lists (a postings list of X values is encoded as X/128 blocs of 128 packed values and X%128 RLE-encoded (VInt) values). On the other hand, doc values have GCD_COMPRESSED (which efficiently compresses any sequence of longs where all values can be expressed as a * x + b) which is typically useful for storing dates that don't have millisecond precision. About stored fields, it would indeed be possible to store all values of a given field in a column-stride fashion per chunk. However, I think parquet doesn't optimize for the same thing as stored fields: parquet needs to run computations on the values of a few fields of many documents (like doc values) while with stored fields, we usually need to get all values of a single document. This makes columnar storage a bit unconvenient for stored fields, although I think we could try it on our chunks of stored documents given that it may improve the compression ratio. I only have a very superficial understanding of parquet so if you know I said something which is wrong about parquet, please tell me! [1] https://github.com/parquet/parquet-format [2] https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesConsumer.java -- Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
