Hello!

I noticed that in the documentation starting with 2.2.0 it states that the
parameter spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1
by default:
https://issues.apache.org/jira/browse/SPARK-20107

I don't actually see this being set anywhere explicitly in the Spark code
and so the documentation isn't entirely accurate in case you run on an
environment that has MAPREDUCE-6406 implemented (starting with Hadoop 3.0).

The default version was explicitly set to 2 in the FileOutputCommitter
class, so any output committer that inherits from this class
(ParquetOutputCommitter for example) would use v2 in a Hadoop 3.0
environment and v1 in the older Hadoop environments.

Would it make sense for us to consider setting v1 as the default in code in
case the configuration was not set by a user?

Regards,

Waleed

Reply via email to