[
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493386#comment-14493386
]
Jason Altekruse commented on PARQUET-251:
-----------------------------------------
[~yijieshen] It looks like the issue was only solved in the PR for the case
where the new ByteBuffer backed implementation returned by the fromByteBuffer()
method in the Binary.java class is used (or others that make a copy
themselves). For most implementations of Binary it looks to be making a copy to
return a byte[] from the getBytes() method. It looks like a more general
approach will be needed if the Binary object may return an array it plans on
re-using. That being said, as there is no way to resize an array, it would seem
like this getBytes() method might be best to be defined to be required to
return a copy, unless the Binary object is immutable or you are only sharing
the byte[] where the values are of equal length. This seems to be the entire
purpose for having the interface in the first place, which allows a buffer of a
large length to stay allocated for all values and have the length set
appropriately with each new value.
If this was enforced at the Binary interface level than I think the PR would
solve the functional issue.
It does however create the unfortunate overhead of making a ton of copies if a
column is full of increasing or decreasing values, it might be worth processing
a small list of values to see if it finds a new min/max and make one defensive
copy for that group.
> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
> Key: PARQUET-251
> URL: https://issues.apache.org/jira/browse/PARQUET-251
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.6.0
> Reporter: Yijie Shen
> Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file,
> one would always reuse the same object among rows, and if a column is byte[]
> of fixed length, the byte[] would also be reused.
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row
> groups created by a single task would have the same max & min binary value,
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary
> references, since I use ByteArrayBackedBinary for byte[], the real content of
> max & min would always point to the reused byte[], therefore the latest row's
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for
> Binary type? If it doesn't, I think it's a bug and can be reproduced by
> [Spark SQL's RowWriteSupport
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket:
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)