Hi Sergio - I'm writing my own application using the AvroParquetWriter
with [email protected]. A gist of my application is at [1].

Two questions for you:

1. To clarify, in the PARQUET_1_0 encoding version _all_ columns in a
schema must use the same encoding?
2. How can I enable the PARQUET_2_0 encoding version? Or
alternatively, is there a maven repo with 2.x artifacts floating
around?

Cheers,
--
b

[1]: https://gist.github.com/banjiewen/c6a5d4af0854764d54d2

On Wed, Jan 6, 2016 at 9:34 AM, Sergio Pena <[email protected]> wrote:
> Hi Benjamin, Several people were on vacation due to the holidays, that's
> why you got a slow response on the dev@ email. The issue you're reporting
> is not a bug but you might be using a different encoding version of Parquet.
>
> Currently, Parquet has two encoding versions, PARQUET_1_0 and PARQUET_2_0.
> PARQUET_2_0 is an experimental feature where different types of encodings
> are applied per column type such the ones you are mentioning and also
> mentioned in
> https://github.com/apache/parquet-format/blob/master/Encodings.md. Only
> parquet 2.x versions have PARQUET_2_0 enabled by default. Parquet 1.x
> versions have PARQUET_1_0 enabled by default, but PARQUET_2_0 should be
> supported I think.
>
> How are you writing your data to Parquet? Did you write your own
> application, or using Hive, Impala, or anything else?
>
> On Tue, Jan 5, 2016 at 4:39 PM, Nong Li <[email protected]> wrote:
>
>> Have we enabled the 2.0 encodings?
>>
>> On Wed, Dec 30, 2015 at 5:34 PM, Benjamin Anderson <[email protected]>
>> wrote:
>>
>> > Hi there - I'm working on a small Parquet project and encountering
>> > some surprising results with regard to encoding decisions.
>> >
>> > My dataset consists of ~1.5MM log lines parsed to an Avro schema and
>> > written to a Parquet file via AvroParquetWriter. According to its log
>> > output, Parquet is writing all int/long columns out with either
>> > [BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised
>> > me - at least one of those columns is an epoch value that should be
>> > quite amenable to the DELTA_BINARY_PACKED. What's the best way to
>> > understand Parquet's encoding choices?
>> >
>> > Secondary question: Is  DELTA_BINARY_PACKED supported for INT64
>> > columns? The documentation[1] says it is, but the code[2] suggests
>> > otherwise.
>> >
>> > Cheers,
>> > --
>> > b
>> >
>> > [1]:
>> >
>> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
>> > [2]:
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168
>> >
>>

Reply via email to