[
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin Huai resolved SPARK-8121.
-----------------------------
Resolution: Fixed
Fix Version/s: 1.4.1
Issue resolved by pull request 6705
[https://github.com/apache/spark/pull/6705]
> When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is
> overriden by "spark.sql.sources.outputCommitterClass"
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-8121
> URL: https://issues.apache.org/jira/browse/SPARK-8121
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.0
> Reporter: Cheng Lian
> Assignee: Cheng Lian
> Fix For: 1.4.1
>
>
> When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and
> {{spark.sql.sources.outputCommitterClass}} is configured,
> {{spark.sql.parquet.output.committer.class}} will be overriden.
> For example, if {{spark.sql.parquet.output.committer.class}} is set to
> {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is
> set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor
> {{_common_metadata}} will be written because {{FileOutputCommitter}}
> overrides {{DirectParquetOutputCommitter}}.
> The reason is that, {{InsertIntoHadoopFsRelation}} initializes the
> {{TaskAttemptContext}} before calling
> {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output
> committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}}
> constructor clones the job configuration, thus doesn't share the job
> configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.
> This issue can be fixed by simply [switching these two
> lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].
> Here is a Spark shell snippet for reproducing this issue:
> {code}
> import sqlContext._
> sc.hadoopConfiguration.set(
> "spark.sql.sources.outputCommitterClass",
> "org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter")
> sc.hadoopConfiguration.set(
> "spark.sql.parquet.output.committer.class",
> "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
> range(0, 1).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> Then check {{/tmp/foo}}, Parquet summary files are missing:
> {noformat}
> /tmp/foo
> ├── _SUCCESS
> ├── part-r-00001.gz.parquet
> ├── part-r-00002.gz.parquet
> ├── part-r-00003.gz.parquet
> ├── part-r-00004.gz.parquet
> ├── part-r-00005.gz.parquet
> ├── part-r-00006.gz.parquet
> ├── part-r-00007.gz.parquet
> └── part-r-00008.gz.parquet
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]