Github user megaserg commented on the issue:
https://github.com/apache/spark/pull/20704
Thank you @dongjoon-hyun! This was also affecting our Spark job performance!
We're using `mapreduce.fileoutputcommitter.algorithm.version=2` in our
Spark job config, as recommended e.g. here:
http://spark.apache.org/docs/latest/cloud-integration.html. We're using
user-provided Hadoop 2.9.0.
However, since this 2.6.5 JAR was in spark/jars, it was given priority in
the classpath over Hadoop-distributed 2.9.0 JAR. The 2.6.5 was silently
ignoring the `mapreduce.fileoutputcommitter.algorithm.version` setting and used
the default, slow algorithm (I believe hadoop-mapreduce-client-core only had
one, slow, algorithm until 2.7.0).
I believe this affects everyone who uses any mapreduce settings with Spark
2.3.0. Great job!
Can we double-check that this JAR is not present in the "without-hadoop"
Spark distribution anymore?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]