Hi Sergio - I'm writing my own application using the AvroParquetWriter with [email protected]. A gist of my application is at [1].
Two questions for you: 1. To clarify, in the PARQUET_1_0 encoding version _all_ columns in a schema must use the same encoding? 2. How can I enable the PARQUET_2_0 encoding version? Or alternatively, is there a maven repo with 2.x artifacts floating around? Cheers, -- b [1]: https://gist.github.com/banjiewen/c6a5d4af0854764d54d2 On Wed, Jan 6, 2016 at 9:34 AM, Sergio Pena <[email protected]> wrote: > Hi Benjamin, Several people were on vacation due to the holidays, that's > why you got a slow response on the dev@ email. The issue you're reporting > is not a bug but you might be using a different encoding version of Parquet. > > Currently, Parquet has two encoding versions, PARQUET_1_0 and PARQUET_2_0. > PARQUET_2_0 is an experimental feature where different types of encodings > are applied per column type such the ones you are mentioning and also > mentioned in > https://github.com/apache/parquet-format/blob/master/Encodings.md. Only > parquet 2.x versions have PARQUET_2_0 enabled by default. Parquet 1.x > versions have PARQUET_1_0 enabled by default, but PARQUET_2_0 should be > supported I think. > > How are you writing your data to Parquet? Did you write your own > application, or using Hive, Impala, or anything else? > > On Tue, Jan 5, 2016 at 4:39 PM, Nong Li <[email protected]> wrote: > >> Have we enabled the 2.0 encodings? >> >> On Wed, Dec 30, 2015 at 5:34 PM, Benjamin Anderson <[email protected]> >> wrote: >> >> > Hi there - I'm working on a small Parquet project and encountering >> > some surprising results with regard to encoding decisions. >> > >> > My dataset consists of ~1.5MM log lines parsed to an Avro schema and >> > written to a Parquet file via AvroParquetWriter. According to its log >> > output, Parquet is writing all int/long columns out with either >> > [BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised >> > me - at least one of those columns is an epoch value that should be >> > quite amenable to the DELTA_BINARY_PACKED. What's the best way to >> > understand Parquet's encoding choices? >> > >> > Secondary question: Is DELTA_BINARY_PACKED supported for INT64 >> > columns? The documentation[1] says it is, but the code[2] suggests >> > otherwise. >> > >> > Cheers, >> > -- >> > b >> > >> > [1]: >> > >> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5 >> > [2]: >> > >> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168 >> > >>
