Thanks for pointing this out, Otis!

I think the columnar nature of Parquet makes it more similar to doc
values than to stored fields, and indeed, if you look at the parquet
file-format specification [1], it is very similar to what we have for
doc values [2]. In both cases, we have
 - dictionary encoding (PLAIN_DICTIONARY in parquet, TABLE_COMPRESSED
in Lucene45DVF),
 - bit-packing (BIT_PACKED(/RLE) in parquet, DELTA_COMPRESSED in Lucene45DVF).

Parquet also uses run-length encoding (RLE) which is unfortunately not
doable for doc values since they need to support random access.
Parquet's RLE compression is actually closer to what we have for
postings lists (a postings list of X values is encoded as X/128 blocs
of 128 packed values and X%128 RLE-encoded (VInt) values). On the
other hand, doc values have GCD_COMPRESSED (which efficiently
compresses any sequence of longs where all values can be expressed as
a * x + b) which is typically useful for storing dates that don't have
millisecond precision.

About stored fields, it would indeed be possible to store all values
of a given field in a column-stride fashion per chunk. However, I
think parquet doesn't optimize for the same thing as stored fields:
parquet needs to run computations on the values of a few fields of
many documents (like doc values) while with stored fields, we usually
need to get all values of a single document. This makes columnar
storage a bit unconvenient for stored fields, although I think we
could try it on our chunks of stored documents given that it may
improve the compression ratio.

I only have a very superficial understanding of parquet so if you know
I said something which is wrong about parquet, please tell me!

[1] https://github.com/parquet/parquet-format
[2] 
https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesConsumer.java

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to