steveloughran commented on a change in pull request #24970: [SPARK-23977][SQL] Support High Performance S3A committers [test-hadoop3.2] URL: https://github.com/apache/spark/pull/24970#discussion_r309250745
########## File path: docs/cloud-integration.md ########## @@ -190,6 +213,42 @@ while they are still being written. Applications can write straight to the monit atomic `rename()` operation. Otherwise the checkpointing may be slow and potentially unreliable. +## Committing work into cloud storage safely and fast. + +As covered earlier, commit-by-rename is dangerous on any object store which +exhibits eventual consistency (example: S3), and often slower than classic +filesystem renames. + +Some object store connectors provide custom committers to commit tasks and +jobs without using rename. In versions of Spark built with Hadoop-3.2 or later, +the S3A connector for AWS S3 is such a committer. + +Instead of writing data to a temporary directory on the store for renaming, +these committers write the files to the final destination, but do not issue +the final POST command to make a large "multi-part" upload visible. Those +operations are postponed until the job commit itself. As a result, task and +job commit are much faster, and task failures do not affect the result. + +To switch to the S3A committers, use a version of Spark which includes the +Hadoop-3.1+ binaries, and switch the committers through the following +options. + +``` +spark.hadoop.fs.s3a.committer.name directory +spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol +spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter +``` + +Normal dataframe write commands will then use this committer for any format +which does not its own custom committer. Output formats which are known Review comment: added "have". Also capitalized ORC and CSV & the first letters of Parquet and Avro ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
