steveloughran commented on a change in pull request #24970: [SPARK-23977][SQL] 
Support High Performance S3A committers [test-hadoop3.2]
URL: https://github.com/apache/spark/pull/24970#discussion_r310122020
 
 

 ##########
 File path: docs/cloud-integration.md
 ##########
 @@ -190,15 +212,50 @@ while they are still being written. Applications can 
write straight to the monit
 atomic `rename()` operation.
 Otherwise the checkpointing may be slow and potentially unreliable.
 
+## Committing work into cloud storage safely and fast.
+
+As covered earlier, commit-by-rename is dangerous on any object store which
+exhibits eventual consistency (example: S3), and often slower than classic
+filesystem renames.
+
+Some object store connectors provide custom committers to commit tasks and
+jobs without using rename. In versions of Spark built with Hadoop-3.1 or later,
+the S3A connector for AWS S3 is such a committer.
+
+Instead of writing data to a temporary directory on the store for renaming,
+these committers write the files to the final destination, but do not issue
+the final POST command to make a large "multi-part" upload visible. Those
+operations are postponed until the job commit itself. As a result, task and
+job commit are much faster, and task failures do not affect the result. 
+
+To switch to the S3A committers, use a version of Spark which includes the
+Hadoop-3.1+ binaries, and switch the committers through the following
+options.
+
+```
+spark.hadoop.fs.s3a.committer.name directory
+spark.sql.sources.commitProtocolClass 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
+spark.sql.parquet.output.committer.class 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
+```
+
+It has been tested with the most common formats supported by Spark.
+
+```python
+mydataframe.write.format("parquet").save("s3a://bucket/destination")
+```
+
+More details on these committers can be found in the latest Hadoop 
documentation.
+
 ## Further Reading
 
 Here is the documentation on the standard connectors both from Apache and the 
cloud providers.
 
-* [OpenStack 
Swift](https://hadoop.apache.org/docs/current/hadoop-openstack/index.html). 
Hadoop 2.6+
-* [Azure Blob 
Storage](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
 Since Hadoop 2.7
-* [Azure Data 
Lake](https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html). 
Since Hadoop 2.8
-* [Amazon S3 via S3A and 
S3N](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
 Hadoop 2.6+
+* [OpenStack 
Swift](https://hadoop.apache.org/docs/current/hadoop-openstack/index.html).
+* [Azure Blob Storage and Azure Datalake Gen 
2](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
+* [Azure Data Lake Gen 
1](https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html).
+* [Hadoop-AWS module (Hadoop 
3-.x)](https://hadoop.apache.org/docs/current3/hadoop-aws/tools/hadoop-aws/index.html).
 
 Review comment:
   done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to