[
https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000222#comment-15000222
]
Cheng Lian commented on SPARK-10954:
------------------------------------
Figured out the reason why {{created_by}} is wrong in Spark while investigating
another issue.
It turned out that parquet-mr 1.7.0 still looks for
{{META-INF/maven/com.twitter/parquet-column/pom.properties}} from the jar file
where class {{o.a.parquet.Version}} is loaded for version information. This is
a bug which doesn't affect normal Parquet users since the properties file
doesn't exist (com.twitter is a wrong package name since parquet-mr 1.7.0 has
moved to org.apache.parquet). However, Spark 1.5 included parquet-hadoop-bundle
1.6.0 to fix a Hive compatibility issue ([PR
#7867|https://github.com/apache/spark/pull/7867]), and this dependency happens
to contain the missing properties file. The fact that Spark bundles all
dependencies into an uber assembly jar makes Parquet read out the wrong version
information and write it into generated Parquet files.
Haven't figured out a workaround for this issue though.
> Parquet version in the "created_by" metadata field of Parquet files written
> by Spark 1.5 and 1.6 is wrong
> ---------------------------------------------------------------------------------------------------------
>
> Key: SPARK-10954
> URL: https://issues.apache.org/jira/browse/SPARK-10954
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0, 1.5.1, 1.6.0
> Reporter: Cheng Lian
> Assignee: Gayathri Murali
> Priority: Minor
>
> We've upgraded to parquet-mr 1.7.0 in Spark 1.5, but the {{created_by}} field
> still says 1.6.0. This issue can be reproduced by generating any Parquet file
> with Spark 1.5, and then check the metadata with {{parquet-meta}} CLI tool:
> {noformat}
> $ parquet-meta /tmp/parquet/dec
> file:
> file:/tmp/parquet/dec/part-r-00000-f210e968-1be5-40bc-bcbc-007f935e6dc7.gz.parquet
> creator: parquet-mr version 1.6.0
> extra: org.apache.spark.sql.parquet.row.metadata =
> {"type":"struct","fields":[{"name":"dec","type":"decimal(20,2)","nullable":true,"metadata":{}}]}
> file schema: spark_schema
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> dec: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1: RC:10 TS:140 OFFSET:4
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> dec: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:4 SZ:99/140/1.41 VC:10
> ENC:PLAIN,BIT_PACKED,RLE
> {noformat}
> Note that this field is written by parquet-mr rather than Spark. However,
> writing Parquet files using parquet-mr 1.7.0 directly without Spark 1.5 only
> shows {{parquet-mr}} without any version number. Files written by parquet-mr
> 1.8.1 without Spark look fine though.
> Currently this isn't a big issue. But parquet-mr 1.8 checks for this field to
> workaround PARQUET-251.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]