Re: Parquet dictionary encoding & bit packing

Robert Muir Mon, 16 Sep 2013 10:29:43 -0700

To some extent that already happens in a rough way in things like
BlockPackedWriter (and also postings lists).


For example these things encode blocks (e.g. 128 in the postings,
maybe 1024 in docvalues, i forget), and if they encounter blocks of
all the same value, they just write a bit marking that and encode the
value once.

On Mon, Sep 16, 2013 at 1:18 PM, Otis Gospodnetic
<[email protected]> wrote:
> You guys got it, of course. :)
>
> I liked the sound of being able to detect how to pack things at run
> time and switch between multiple approaches over time.... or at least
> that's how I interpreted the announcement.
>
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
>
>
>
> On Mon, Sep 16, 2013 at 4:29 AM, Adrien Grand <[email protected]> wrote:
>> Thanks for pointing this out, Otis!
>>
>> I think the columnar nature of Parquet makes it more similar to doc
>> values than to stored fields, and indeed, if you look at the parquet
>> file-format specification [1], it is very similar to what we have for
>> doc values [2]. In both cases, we have
>>  - dictionary encoding (PLAIN_DICTIONARY in parquet, TABLE_COMPRESSED
>> in Lucene45DVF),
>>  - bit-packing (BIT_PACKED(/RLE) in parquet, DELTA_COMPRESSED in 
>> Lucene45DVF).
>>
>> Parquet also uses run-length encoding (RLE) which is unfortunately not
>> doable for doc values since they need to support random access.
>> Parquet's RLE compression is actually closer to what we have for
>> postings lists (a postings list of X values is encoded as X/128 blocs
>> of 128 packed values and X%128 RLE-encoded (VInt) values). On the
>> other hand, doc values have GCD_COMPRESSED (which efficiently
>> compresses any sequence of longs where all values can be expressed as
>> a * x + b) which is typically useful for storing dates that don't have
>> millisecond precision.
>>
>> About stored fields, it would indeed be possible to store all values
>> of a given field in a column-stride fashion per chunk. However, I
>> think parquet doesn't optimize for the same thing as stored fields:
>> parquet needs to run computations on the values of a few fields of
>> many documents (like doc values) while with stored fields, we usually
>> need to get all values of a single document. This makes columnar
>> storage a bit unconvenient for stored fields, although I think we
>> could try it on our chunks of stored documents given that it may
>> improve the compression ratio.
>>
>> I only have a very superficial understanding of parquet so if you know
>> I said something which is wrong about parquet, please tell me!
>>
>> [1] https://github.com/parquet/parquet-format
>> [2] 
>> https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesConsumer.java
>>
>> --
>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Parquet dictionary encoding & bit packing

Reply via email to