[ 
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493386#comment-14493386
 ] 

Jason Altekruse commented on PARQUET-251:
-----------------------------------------

[~yijieshen] It looks like the issue was only solved in the PR for the case 
where the new ByteBuffer backed implementation returned by the fromByteBuffer() 
method in the Binary.java class is used (or others that make a copy 
themselves). For most implementations of Binary it looks to be making a copy to 
return a byte[] from the getBytes() method. It looks like a more general 
approach will be needed if the Binary object may return an array it plans on 
re-using. That being said, as there is no way to resize an array, it would seem 
like this getBytes() method might be best to be defined to be required to 
return a copy, unless the Binary object is immutable or you are only sharing 
the byte[] where the values are of equal length. This seems to be the entire 
purpose for having the interface in the first place, which allows a buffer of a 
large length to stay allocated for all values and have the length set 
appropriately with each new value.

If this was enforced at the Binary interface level than I think the PR would 
solve the functional issue.

It does however create the unfortunate overhead of making a ton of copies if a 
column is full of increasing or decreasing values, it might be worth processing 
a small list of values to see if it finds a new min/max and make one defensive 
copy for that group.

> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
>                 Key: PARQUET-251
>                 URL: https://issues.apache.org/jira/browse/PARQUET-251
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Yijie Shen
>            Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file, 
> one would always reuse the same object among rows, and if a column is byte[] 
> of fixed length, the byte[] would also be reused. 
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row 
> groups created by a single task would have the same max & min binary value, 
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary 
> references, since I use ByteArrayBackedBinary for byte[], the real content of 
> max & min would always point to the reused byte[], therefore the latest row's 
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for 
> Binary type?  If it doesn't, I think it's a bug and can be reproduced by 
> [Spark SQL's RowWriteSupport 
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket: 
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to