steveloughran commented on a change in pull request #30714:
URL: https://github.com/apache/spark/pull/30714#discussion_r577537838
##########
File path: docs/cloud-integration.md
##########
@@ -176,12 +181,19 @@ different stores and connectors when renaming directories:
| Amazon S3 | s3a | Unsafe | O(data) |
| Azure Storage | wasb | Safe | O(files) |
| Azure Datalake Gen 2 | abfs | Safe | O(1) |
-| Google Cloud Storage | gs | Safe | O(1) |
+| Google Cloud Storage | gs | Mixed | O(files) |
1. As storing temporary files can run up charges; delete
directories called `"_temporary"` on a regular basis.
1. For AWS S3, set a limit on how long multipart uploads can remain
outstanding.
This avoids incurring bills from incompleted uploads.
+1. For Google cloud, directory rename is file-by-file. Consider using the v2
committer
+and only write code which generates idemportent output -including filenames,
+as it is *no more unsafe* than the v1 committer, and faster.
+
+```
+spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
+```
Review comment:
done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]