[
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491548#comment-14491548
]
Cheng Lian commented on SPARK-6859:
-----------------------------------
[~yijieshen] Thanks for reporting! And yes, please also open a JIRA ticket for
Parquet and link it with this one so that it's easier to track.
[~marmbrus] I guess we should disable pushing down filters involving binary
type before this bug is fixed in Parquet.
> Parquet File Binary column statistics error when reuse byte[] among rows
> ------------------------------------------------------------------------
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.2.0, 1.3.0, 1.4.0
> Reporter: Yijie Shen
> Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is
> reused among rows but has different content each time. When I convert it to a
> dataFrame and save it as Parquet File, the file's row group statistic(max &
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as
> parquet.io.api.Binary references, Spark sql would generate a new Binary
> backed by the same Array\[Byte\] passed from row.
>
> | |reference| |backed| |
> |max: Binary|---------->|ByteArrayBackedBinary|---------->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would
> always refer to the same Array\[Byte\], which has new content each time. When
> parquet decides to save it into file, the last row's content would be saved
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]