[ 
https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000222#comment-15000222
 ] 

Cheng Lian commented on SPARK-10954:
------------------------------------

Figured out the reason why {{created_by}} is wrong in Spark while investigating 
another issue.

It turned out that parquet-mr 1.7.0 still looks for 
{{META-INF/maven/com.twitter/parquet-column/pom.properties}} from the jar file 
where class {{o.a.parquet.Version}} is loaded for version information. This is 
a bug which doesn't affect normal Parquet users since the properties file 
doesn't exist (com.twitter is a wrong package name since parquet-mr 1.7.0 has 
moved to org.apache.parquet). However, Spark 1.5 included parquet-hadoop-bundle 
1.6.0 to fix a Hive compatibility issue ([PR 
#7867|https://github.com/apache/spark/pull/7867]), and this dependency happens 
to contain the missing properties file. The fact that Spark bundles all 
dependencies into an uber assembly jar makes Parquet read out the wrong version 
information and write it into generated Parquet files.

Haven't figured out a workaround for this issue though.

> Parquet version in the "created_by" metadata field of Parquet files written 
> by Spark 1.5 and 1.6 is wrong
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10954
>                 URL: https://issues.apache.org/jira/browse/SPARK-10954
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0, 1.5.1, 1.6.0
>            Reporter: Cheng Lian
>            Assignee: Gayathri Murali
>            Priority: Minor
>
> We've upgraded to parquet-mr 1.7.0 in Spark 1.5, but the {{created_by}} field 
> still says 1.6.0. This issue can be reproduced by generating any Parquet file 
> with Spark 1.5, and then check the metadata with {{parquet-meta}} CLI tool:
> {noformat}
> $ parquet-meta /tmp/parquet/dec
> file:        
> file:/tmp/parquet/dec/part-r-00000-f210e968-1be5-40bc-bcbc-007f935e6dc7.gz.parquet
> creator:     parquet-mr version 1.6.0
> extra:       org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"dec","type":"decimal(20,2)","nullable":true,"metadata":{}}]}
> file schema: spark_schema
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> dec:         OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1: RC:10 TS:140 OFFSET:4
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> dec:          FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:4 SZ:99/140/1.41 VC:10 
> ENC:PLAIN,BIT_PACKED,RLE
> {noformat}
> Note that this field is written by parquet-mr rather than Spark. However, 
> writing Parquet files using parquet-mr 1.7.0 directly without Spark 1.5 only 
> shows {{parquet-mr}} without any version number. Files written by parquet-mr 
> 1.8.1 without Spark look fine though.
> Currently this isn't a big issue. But parquet-mr 1.8 checks for this field to 
> workaround PARQUET-251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to