[
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055435#comment-17055435
]
Felix Kizhakkel Jose commented on SPARK-31072:
----------------------------------------------
Could you please provide some insights?
> Default to ParquetOutputCommitter even after configuring s3a committer as
> "partitioned"
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
> Issue Type: Bug
> Components: Java API
> Affects Versions: 2.4.5
> Reporter: Felix Kizhakkel Jose
> Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_
> even after I configure to use "PartitionedStagingCommitter" with the
> following configuration:
> *
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
> "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
> * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
> * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode",
> "append");
> * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
> * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata",
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup
> _temporary folders under output directory:false, ignore cleanup failures:
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup
> _temporary folders under output directory:false, ignore cleanup failures:
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup
> _temporary folders under output directory:false, ignore cleanup failures:
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer
> partitioned to output data to s3a:************
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter
> PartitionedStagingCommitter**********
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with
> partitionBy() two columns - I was getting file not found exceptions
> intermittently.
> So how could I avoid this issue with *Parquet using Spark to S3 using s3A
> without s3aGuard?*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]