Yijie Shen created SPARK-6859:
---------------------------------
Summary: Parquet File Binary column statistics error when reuse
byte[] among rows
Key: SPARK-6859
URL: https://issues.apache.org/jira/browse/SPARK-6859
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.3.0, 1.2.0, 1.4.0
Reporter: Yijie Shen
Priority: Minor
Suppose I create a dataRDD which extends RDD\[Row\], and each row is
GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is
reused among rows but has different content each time. When I convert it to a
dataFrame and save it as Parquet File, the file's row group statistic(max &
min) of Binary column would be wrong.
\\
\\
Here is the reason: In Parquet, BinaryStatistic just keep max & min as
parquet.io.api.Binary references, Spark sql would generate a new Binary backed
by the same Array\[Byte\] passed from row.
| |reference| |backed| |
|max: Binary|---------->|ByteArrayBackedBinary|---------->|Array\[Byte\]|
Therefore, each time parquet updating row group's statistic, max & min would
always refer to the same Array\[Byte\], which has new content each time. When
parquet decides to save it into file, the last row's content would be saved as
both max & min.
\\
\\
It seems it is a parquet bug because it's parquet's responsibility to update
statistics correctly.
But not quite sure. Should I report it as a bug in parquet JIRA?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]