[jira] [Resolved] (PARQUET-359) Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files

Ryan Blue (JIRA) Fri, 20 Nov 2015 17:06:07 -0800

     [ 
https://issues.apache.org/jira/browse/PARQUET-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ryan Blue resolved PARQUET-359.
-------------------------------
    Resolution: Fixed
      Assignee: Cheng Lian

> Existing _common_metadata should be deleted when ParquetOutputCommitter fails 
> to write summary files
> ----------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-359
>                 URL: https://issues.apache.org/jira/browse/PARQUET-359
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0, 1.7.0, 1.8.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>
> {{ParquetOutputCommitter}} only deletes {{_metadata}} when fails to write 
> summary files. This may leave inconsistent existing {{_common_metadata}} out 
> there.
> This issue can be reproduced via the following Spark shell snippet:
> {noformat}
> import sqlContext.implicits._
> val path = "file:///tmp/foo"
> (0 until 3).map(i => Tuple1((s"a_$i", 
> s"b_$i"))).toDF().coalesce(1).write.mode("overwrite").parquet(path)
> (0 until 3).map(i => Tuple1((s"a_$i", s"b_$i", 
> s"c_$i"))).toDF().coalesce(1).write.mode("append").parquet(path)
> {noformat}
> The 2nd write job fails to write the summary file because two written Parquet 
> files contain different user-defined metadata (Spark SQL schema). We can find 
> out that there is an {{_common_metadata}} left there:
> {noformat}
> $ tree /tmp/foo
> /tmp/foo
> ├── _SUCCESS
> ├── _common_metadata
> ├── part-r-00000-1c8bcb7f-84cf-43e3-9cd6-04d371322d95.gz.parquet
> └── part-r-00000-d759c53f-d12f-4555-9b27-8b03a8343b17.gz.parquet
> {noformat}
> Check its schema, the nested group contains only 2 fields, which is wrong:
> {noformat}
> $ parquet-schema /tmp/foo/_common_metadata
> message root {
>   optional group _1 {
>     optional binary _1 (UTF8);
>     optional binary _2 (UTF8);
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (PARQUET-359) Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files

Reply via email to