Hi All, This looks like it got dropped. Is there anything else I should do before this can be considered for review?
Thanks for the help, Lars On Thu, Jan 5, 2017 at 8:08 PM, Lars Volker <l...@cloudera.com> wrote: > I created PARQUET-826 <https://issues.apache.org/jira/browse/PARQUET-826> > to track this and submitted PR #48 > <https://github.com/apache/parquet-format/pull/48> to address it. > > On Fri, Dec 16, 2016 at 8:06 PM, Lars Volker <l...@cloudera.com> wrote: > >> Hi All, >> >> I'm currently working on adding support for writing min/max statistics to >> Parquet files to Impala (IMPALA-3909 >> <https://issues.cloudera.org/browse/IMPALA-3909>). I noticed, that the >> comments in parquet.thrift#L201 >> <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201> >> don't >> seem to match the implementations in parquet-mr and Hive. >> >> The comments ask for min/max statistics to be "*encoded in PLAIN >> encoding*". For strings (BYTE_ARRAY), this should be "*4 byte length >> stored as little endian, followed by bytes*". >> >> Looking at BinaryStatistics.java#L61 >> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61>, >> it seems to return the bytes without a length-prefix. Writing a parquet >> file with Hive also shows this behavior. >> >> Is this the intended behavior? If so, we might want to add a description >> to the Statistics struct in parquet.thrift to elaborate on the intrinsics >> of storing string values there. >> >> Similarly, but less ambiguous, PLAIN encoding for booleans uses >> bit-packing. It seems to be implied that for a single bit (min/max of a >> boolean column) it means setting the least significant bit of a single >> byte. This could be made more clear in the parquet.thrift file, too. >> >> I'm curious to hear your feedback. Let me know if you think we should >> change the parquet.thrift file and I'll happily send a PR. >> >> Cheers, Lars >> > >