You guys got it, of course. :)

I liked the sound of being able to detect how to pack things at run
time and switch between multiple approaches over time.... or at least
that's how I interpreted the announcement.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Sep 16, 2013 at 4:29 AM, Adrien Grand <[email protected]> wrote:
> Thanks for pointing this out, Otis!
>
> I think the columnar nature of Parquet makes it more similar to doc
> values than to stored fields, and indeed, if you look at the parquet
> file-format specification [1], it is very similar to what we have for
> doc values [2]. In both cases, we have
>  - dictionary encoding (PLAIN_DICTIONARY in parquet, TABLE_COMPRESSED
> in Lucene45DVF),
>  - bit-packing (BIT_PACKED(/RLE) in parquet, DELTA_COMPRESSED in Lucene45DVF).
>
> Parquet also uses run-length encoding (RLE) which is unfortunately not
> doable for doc values since they need to support random access.
> Parquet's RLE compression is actually closer to what we have for
> postings lists (a postings list of X values is encoded as X/128 blocs
> of 128 packed values and X%128 RLE-encoded (VInt) values). On the
> other hand, doc values have GCD_COMPRESSED (which efficiently
> compresses any sequence of longs where all values can be expressed as
> a * x + b) which is typically useful for storing dates that don't have
> millisecond precision.
>
> About stored fields, it would indeed be possible to store all values
> of a given field in a column-stride fashion per chunk. However, I
> think parquet doesn't optimize for the same thing as stored fields:
> parquet needs to run computations on the values of a few fields of
> many documents (like doc values) while with stored fields, we usually
> need to get all values of a single document. This makes columnar
> storage a bit unconvenient for stored fields, although I think we
> could try it on our chunks of stored documents given that it may
> improve the compression ratio.
>
> I only have a very superficial understanding of parquet so if you know
> I said something which is wrong about parquet, please tell me!
>
> [1] https://github.com/parquet/parquet-format
> [2] 
> https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesConsumer.java
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to