[ 
https://issues.apache.org/jira/browse/PARQUET-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker updated PARQUET-826:
--------------------------------
    Fix Version/s: 1.9.0

> parquet.thrift comments for Statistics are not consistent with parquet-mr and 
> Hive implementations
> --------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-826
>                 URL: https://issues.apache.org/jira/browse/PARQUET-826
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Lars Volker
>            Assignee: Lars Volker
>             Fix For: 1.9.0
>
>
> I'm currently working on adding support for writing min/max statistics to 
> Parquet files to Impala 
> ([IMPALA-3909|https://issues.cloudera.org/browse/IMPALA-3909]). I noticed, 
> that the comments in 
> [parquet.thrift#L201|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201]
>  don't seem to match the implementations in parquet-mr and Hive.
> The comments ask for min/max statistics to be "encoded in PLAIN encoding". 
> For strings (BYTE_ARRAY), this should be "4 byte length stored as little 
> endian, followed by bytes".
> Looking at 
> [BinaryStatistics.java#L61|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61],
>  it seems to return the bytes without a length-prefix. Writing a parquet file 
> with Hive also shows this behavior.
> Similarly, but less ambiguous, PLAIN encoding for booleans uses bit-packing. 
> It seems to be implied that for a single bit (min/max of a boolean column) it 
> means setting the least significant bit of a single byte. This could be made 
> more clear in the parquet.thrift file, too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to