[
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529875#comment-14529875
]
Alex Levenson commented on PARQUET-251:
---------------------------------------
The only thing I want to check is whether we've got this backwards or not.
Binary already has a getBytes() method and I think that is exactly what we
need, I don't think we need a clone() method do we? All the getBytes() methods
do a copy except for ByteArrayBackedBinary. Maybe it should do a copy. Or maybe
a copy should happen when ByteArrayBackedBinary is constructed.
The real question is, who "owns" the byte array passed to
ByteArrayBackedBinary? It seems a little odd to "give" this array to
ByteArrayBackedBinary and then continue to mutate it right?
But again, one way to think of Binary is as a *view* to some other byte source.
So the real bug is in the statistics collection code path which should be
making a defensive copy. But do we need Binary to manage this or not?
Another approach would be to make getBytes() always do a copy (which it
currently does for everything except ByteArrayBackedBinary). Most code paths
should be using the .writeTo methods anyway right?
> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
> Key: PARQUET-251
> URL: https://issues.apache.org/jira/browse/PARQUET-251
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.6.0
> Reporter: Yijie Shen
> Assignee: Ashish K Singh
> Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file,
> one would always reuse the same object among rows, and if a column is byte[]
> of fixed length, the byte[] would also be reused.
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row
> groups created by a single task would have the same max & min binary value,
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary
> references, since I use ByteArrayBackedBinary for byte[], the real content of
> max & min would always point to the reused byte[], therefore the latest row's
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for
> Binary type? If it doesn't, I think it's a bug and can be reproduced by
> [Spark SQL's RowWriteSupport
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket:
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)