[ 
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527414#comment-14527414
 ] 

Ashish K Singh commented on PARQUET-251:
----------------------------------------

Hey guys, I am planning to take a stab on this. After following the detailed 
discussion so far, I think (feel free to correct me) following needs to be done.

1. We need a clone() or copyToByteArray() method in Binary. Having the method 
should make it obvious to the user that the byte[] should not be mutated and if 
it plans on doing it clone() or copyToByteArray() should be used. However, this 
should be explicitly mentioned in the docs as well.

2. Update BinaryStatistics to use clone() or copyToByteArray() on the Binary 
value passed to it.

3. Ignore min/max byte arrays for data written with 1.6.0 and earlier.

If you guys agree with it, I should be able to submit a PR soon.

> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
>                 Key: PARQUET-251
>                 URL: https://issues.apache.org/jira/browse/PARQUET-251
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Yijie Shen
>            Assignee: Ashish K Singh
>            Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file, 
> one would always reuse the same object among rows, and if a column is byte[] 
> of fixed length, the byte[] would also be reused. 
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row 
> groups created by a single task would have the same max & min binary value, 
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary 
> references, since I use ByteArrayBackedBinary for byte[], the real content of 
> max & min would always point to the reused byte[], therefore the latest row's 
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for 
> Binary type?  If it doesn't, I think it's a bug and can be reproduced by 
> [Spark SQL's RowWriteSupport 
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket: 
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to