vanzin commented on a change in pull request #24970: [SPARK-23977][SQL] Support High Performance S3A committers URL: https://github.com/apache/spark/pull/24970#discussion_r297866372
########## File path: docs/cloud-integration.md ########## @@ -143,8 +144,34 @@ job failure: spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true ``` +The original v1 commit algorithm renames the output of successful tasks +to a job attempt directory, and then renames all the files in that directory +into the final destination during the job commit phase + +``` +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 1 +``` + +The slow performance of mimicked renames on Amazon S3 makes this algorithm +very, very slow. The recommended solution to this is switch to an S3 "Zero Rename" +committer (see below). + +For reference, here are the performance and safety characteristics of +different stores and connectors + + +For the other object stores, their characteristics are + +| Store | Connector | directory rename safety | rename performance | +|---------------|-----------|-------------------------|--------------------| +| Amazon S3 | S3A | Unsafe | O(data) | +| Azure Storage | wasb | Safe | O(files) | +| Azure Datalake Gen 2 | abfs | Safe | O(1) | +| Google GCS | gs | Saf e | O(1) | Review comment: Safe ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
