[ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-8121.
-----------------------------
       Resolution: Fixed
    Fix Version/s: 1.4.1

Issue resolved by pull request 6705
[https://github.com/apache/spark/pull/6705]

> When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is 
> overriden by "spark.sql.sources.outputCommitterClass"
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-8121
>                 URL: https://issues.apache.org/jira/browse/SPARK-8121
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>             Fix For: 1.4.1
>
>
> When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
> {{spark.sql.sources.outputCommitterClass}} is configured, 
> {{spark.sql.parquet.output.committer.class}} will be overriden. 
> For example, if {{spark.sql.parquet.output.committer.class}} is set to 
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
> {{_common_metadata}} will be written because {{FileOutputCommitter}} 
> overrides {{DirectParquetOutputCommitter}}.
> The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
> {{TaskAttemptContext}} before calling 
> {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
> committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
> constructor clones the job configuration, thus doesn't share the job 
> configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.
> This issue can be fixed by simply [switching these two 
> lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].
> Here is a Spark shell snippet for reproducing this issue:
> {code}
> import sqlContext._
> sc.hadoopConfiguration.set(
>   "spark.sql.sources.outputCommitterClass",
>   "org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter")
> sc.hadoopConfiguration.set(
>   "spark.sql.parquet.output.committer.class",
>   "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
> range(0, 1).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> Then check {{/tmp/foo}}, Parquet summary files are missing:
> {noformat}
> /tmp/foo
> ├── _SUCCESS
> ├── part-r-00001.gz.parquet
> ├── part-r-00002.gz.parquet
> ├── part-r-00003.gz.parquet
> ├── part-r-00004.gz.parquet
> ├── part-r-00005.gz.parquet
> ├── part-r-00006.gz.parquet
> ├── part-r-00007.gz.parquet
> └── part-r-00008.gz.parquet
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to