Re: Encoding decisions

Sergio Pena Wed, 06 Jan 2016 12:18:18 -0800

1. To clarify, in the PARQUET_1_0 encoding version _all_ columns in a
schema must use the same encoding?


Each column has its own encoding, however, most of the columns in
PARQUET_1_0 use the same encoding. When dictionary
is enabled, dictionary encoding will be used on each page only if the
dictionary page (per row group) hasn't grown bigger than
the ParquetProperties.DEFAULT_DICTIONARY_PAGE_SIZE.

2. How can I enable the PARQUET_2_0 encoding version? Or
alternatively, is there a maven repo with 2.x artifacts floating
around?

You can use PARQUET_2_0 when creating the ParquetWriter. Just pass
WriterVersion.PARQUET_2_0 to the constructor parameters
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L220

, or the builder parameters.
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L482

On Wed, Jan 6, 2016 at 12:52 PM, Benjamin Anderson <[email protected]> wrote:

> Hi Sergio - I'm writing my own application using the AvroParquetWriter
> with [email protected]. A gist of my application is at [1].
>
> Two questions for you:
>
> 1. To clarify, in the PARQUET_1_0 encoding version _all_ columns in a
> schema must use the same encoding?
> 2. How can I enable the PARQUET_2_0 encoding version? Or
> alternatively, is there a maven repo with 2.x artifacts floating
> around?
>
> Cheers,
> --
> b
>
> [1]: https://gist.github.com/banjiewen/c6a5d4af0854764d54d2
>
> On Wed, Jan 6, 2016 at 9:34 AM, Sergio Pena <[email protected]>
> wrote:
> > Hi Benjamin, Several people were on vacation due to the holidays, that's
> > why you got a slow response on the dev@ email. The issue you're
> reporting
> > is not a bug but you might be using a different encoding version of
> Parquet.
> >
> > Currently, Parquet has two encoding versions, PARQUET_1_0 and
> PARQUET_2_0.
> > PARQUET_2_0 is an experimental feature where different types of encodings
> > are applied per column type such the ones you are mentioning and also
> > mentioned in
> > https://github.com/apache/parquet-format/blob/master/Encodings.md. Only
> > parquet 2.x versions have PARQUET_2_0 enabled by default. Parquet 1.x
> > versions have PARQUET_1_0 enabled by default, but PARQUET_2_0 should be
> > supported I think.
> >
> > How are you writing your data to Parquet? Did you write your own
> > application, or using Hive, Impala, or anything else?
> >
> > On Tue, Jan 5, 2016 at 4:39 PM, Nong Li <[email protected]> wrote:
> >
> >> Have we enabled the 2.0 encodings?
> >>
> >> On Wed, Dec 30, 2015 at 5:34 PM, Benjamin Anderson <[email protected]>
> >> wrote:
> >>
> >> > Hi there - I'm working on a small Parquet project and encountering
> >> > some surprising results with regard to encoding decisions.
> >> >
> >> > My dataset consists of ~1.5MM log lines parsed to an Avro schema and
> >> > written to a Parquet file via AvroParquetWriter. According to its log
> >> > output, Parquet is writing all int/long columns out with either
> >> > [BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised
> >> > me - at least one of those columns is an epoch value that should be
> >> > quite amenable to the DELTA_BINARY_PACKED. What's the best way to
> >> > understand Parquet's encoding choices?
> >> >
> >> > Secondary question: Is  DELTA_BINARY_PACKED supported for INT64
> >> > columns? The documentation[1] says it is, but the code[2] suggests
> >> > otherwise.
> >> >
> >> > Cheers,
> >> > --
> >> > b
> >> >
> >> > [1]:
> >> >
> >>
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
> >> > [2]:
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168
> >> >
> >>
>

Re: Encoding decisions

Reply via email to