Yijie Shen created SPARK-6859:
---------------------------------

             Summary: Parquet File Binary column statistics error when reuse 
byte[] among rows
                 Key: SPARK-6859
                 URL: https://issues.apache.org/jira/browse/SPARK-6859
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.3.0, 1.2.0, 1.4.0
            Reporter: Yijie Shen
            Priority: Minor


Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
reused among rows but has different content each time. When I convert it to a 
dataFrame and save it as Parquet File, the file's row group statistic(max & 
min) of Binary column would be wrong.

\\
\\
Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
parquet.io.api.Binary references, Spark sql would generate a new Binary backed 
by the same Array\[Byte\] passed from row.

                                                                        
| |reference| |backed| |        
|max: Binary|---------->|ByteArrayBackedBinary|---------->|Array\[Byte\]|

Therefore, each time parquet updating row group's statistic, max & min would 
always refer to the same Array\[Byte\], which has new content each time. When 
parquet decides to save it into file, the last row's content would be saved as 
both max & min.

\\
\\
It seems it is a parquet bug because it's parquet's responsibility to update 
statistics correctly.
But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to