Hi All,

This looks like it got dropped. Is there anything else I should do before
this can be considered for review?

Thanks for the help, Lars

On Thu, Jan 5, 2017 at 8:08 PM, Lars Volker <l...@cloudera.com> wrote:

> I created PARQUET-826 <https://issues.apache.org/jira/browse/PARQUET-826>
> to track this and submitted PR #48
> <https://github.com/apache/parquet-format/pull/48> to address it.
>
> On Fri, Dec 16, 2016 at 8:06 PM, Lars Volker <l...@cloudera.com> wrote:
>
>> Hi All,
>>
>> I'm currently working on adding support for writing min/max statistics to
>> Parquet files to Impala (IMPALA-3909
>> <https://issues.cloudera.org/browse/IMPALA-3909>). I noticed, that the
>> comments in parquet.thrift#L201
>> <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201>
>>  don't
>> seem to match the implementations in parquet-mr and Hive.
>>
>> The comments ask for min/max statistics to be "*encoded in PLAIN
>> encoding*". For strings (BYTE_ARRAY), this should be "*4 byte length
>> stored as little endian, followed by bytes*".
>>
>> Looking at BinaryStatistics.java#L61
>> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61>,
>> it seems to return the bytes without a length-prefix. Writing a parquet
>> file with Hive also shows this behavior.
>>
>> Is this the intended behavior? If so, we might want to add a description
>> to the Statistics struct in parquet.thrift to elaborate on the intrinsics
>> of storing string values there.
>>
>> Similarly, but less ambiguous, PLAIN encoding for booleans uses
>> bit-packing. It seems to be implied that for a single bit (min/max of a
>> boolean column) it means setting the least significant bit of a single
>> byte. This could be made more clear in the parquet.thrift file, too.
>>
>> I'm curious to hear your feedback. Let me know if you think we should
>> change the parquet.thrift file and I'll happily send a PR.
>>
>> Cheers, Lars
>>
>
>

Reply via email to