[
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530324#comment-14530324
]
Konstantin Shaposhnikov commented on PARQUET-251:
-------------------------------------------------
I am still learning parquet code base so please take my comments with a grain
of salt.
Doing a copy in BinaryStatistics is probably better as it potentially reduces
number of copies (and garbage). Though in the worst case when binary field is
sorted it will still create a copy for each update.
I think on the write path Binary.getBytes() almost always needs to return a
copy. E.g. PlainBinaryDictionaryValuesWriter already does copy() before storing
a Binary in the internal dictionary.
However I think that this is not needed on the read path. Which suggests that
two methods are required.
One more possible cleanup could be to remove public methods from Statistics
classes that are not used or used in tests only (e.g. genericGetMin/Max,
getMax, getMin, etc). This will break backward compatibility so I am not sure
if this can be done though.
> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
> Key: PARQUET-251
> URL: https://issues.apache.org/jira/browse/PARQUET-251
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.6.0
> Reporter: Yijie Shen
> Assignee: Ashish K Singh
> Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file,
> one would always reuse the same object among rows, and if a column is byte[]
> of fixed length, the byte[] would also be reused.
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row
> groups created by a single task would have the same max & min binary value,
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary
> references, since I use ByteArrayBackedBinary for byte[], the real content of
> max & min would always point to the reused byte[], therefore the latest row's
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for
> Binary type? If it doesn't, I think it's a bug and can be reproduced by
> [Spark SQL's RowWriteSupport
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket:
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)