[GitHub] spark issue #20704: [SPARK-23551][BUILD] Exclude `hadoop-mapreduce-client-co...

megaserg Fri, 13 Apr 2018 19:51:05 -0700

Github user megaserg commented on the issue:

    https://github.com/apache/spark/pull/20704
  
    Thank you @dongjoon-hyun! This was also affecting our Spark job performance!
    
    We're using `mapreduce.fileoutputcommitter.algorithm.version=2` in our 
Spark job config, as recommended e.g. here: 
http://spark.apache.org/docs/latest/cloud-integration.html. We're using 
user-provided Hadoop 2.9.0.
    
    However, since this 2.6.5 JAR was in spark/jars, it was given priority in 
the classpath over Hadoop-distributed 2.9.0 JAR. The 2.6.5 was silently 
ignoring the `mapreduce.fileoutputcommitter.algorithm.version` setting and used 
the default, slow algorithm (I believe hadoop-mapreduce-client-core only had 
one, slow, algorithm until 2.7.0).
    
    I believe this affects everyone who uses any mapreduce settings with Spark 
2.3.0. Great job!
    
    Can we double-check that this JAR is not present in the "without-hadoop" 
Spark distribution anymore?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20704: [SPARK-23551][BUILD] Exclude `hadoop-mapreduce-client-co...

Reply via email to