[ https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yin Huai resolved SPARK-8121. ----------------------------- Resolution: Fixed Fix Version/s: 1.4.1 Issue resolved by pull request 6705 [https://github.com/apache/spark/pull/6705] > When using with Hadoop 1.x, "spark.sql.parquet.output.committer.class" is > overriden by "spark.sql.sources.outputCommitterClass" > ------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-8121 > URL: https://issues.apache.org/jira/browse/SPARK-8121 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.0 > Reporter: Cheng Lian > Assignee: Cheng Lian > Fix For: 1.4.1 > > > When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and > {{spark.sql.sources.outputCommitterClass}} is configured, > {{spark.sql.parquet.output.committer.class}} will be overriden. > For example, if {{spark.sql.parquet.output.committer.class}} is set to > {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is > set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor > {{_common_metadata}} will be written because {{FileOutputCommitter}} > overrides {{DirectParquetOutputCommitter}}. > The reason is that, {{InsertIntoHadoopFsRelation}} initializes the > {{TaskAttemptContext}} before calling > {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output > committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} > constructor clones the job configuration, thus doesn't share the job > configuration passed to {{ParquetRelation2.prepareForWriteJob()}}. > This issue can be fixed by simply [switching these two > lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286]. > Here is a Spark shell snippet for reproducing this issue: > {code} > import sqlContext._ > sc.hadoopConfiguration.set( > "spark.sql.sources.outputCommitterClass", > "org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter") > sc.hadoopConfiguration.set( > "spark.sql.parquet.output.committer.class", > "org.apache.spark.sql.parquet.DirectParquetOutputCommitter") > range(0, 1).write.mode("overwrite").parquet("file:///tmp/foo") > {code} > Then check {{/tmp/foo}}, Parquet summary files are missing: > {noformat} > /tmp/foo > ├── _SUCCESS > ├── part-r-00001.gz.parquet > ├── part-r-00002.gz.parquet > ├── part-r-00003.gz.parquet > ├── part-r-00004.gz.parquet > ├── part-r-00005.gz.parquet > ├── part-r-00006.gz.parquet > ├── part-r-00007.gz.parquet > └── part-r-00008.gz.parquet > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org