Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/20704 @megaserg : if you are writing to GCS, Azure, algorithm 2 is fine. If S3 is the target, then it's only safe to use with a consistent store (Hadoop 3.0 +S3Guard, Amazon Consistent EMR); you still take a major perf hit from that copy. The S3A committers in Hadoop 3.1 deliver that high performance commit semantics, and Netflix committers don't (directly) need a consistent store âthough to chain together work you will. BTW, how to verify that the v2 algorithm version is being opted for? : set the version = 3 and expect a stack trace from the version switch code. It's what I do to make sure that the FileOutputCommitter isn't actually being picked up.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org