[spark] branch master updated: [SPARK-31582][YARN] Being able to not populate Hadoop classpath

dbtsai Wed, 29 Apr 2020 14:13:17 -0700

This is an automated email from the ASF dual-hosted git repository.

dbtsai pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new ecfee82  [SPARK-31582][YARN] Being able to not populate Hadoop 
classpath
ecfee82 is described below

commit ecfee82fda5f0403024ff64f16bc767b8d1e3e3d
Author: DB Tsai <[email protected]>
AuthorDate: Wed Apr 29 21:10:40 2020 +0000

    [SPARK-31582][YARN] Being able to not populate Hadoop classpath
    
    ### What changes were proposed in this pull request?
    We are adding a new Spark Yarn configuration, 
`spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from 
`yarn.application.classpath` and `mapreduce.application.classpath`.
    
    ### Why are the changes needed?
    Spark Yarn client populates extra Hadoop classpath from 
`yarn.application.classpath` and `mapreduce.application.classpath` when a job 
is submitted to a Yarn Hadoop cluster.
    
    However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can 
cause jar conflicts because Spark distribution can contain different version of 
Hadoop jars.
    
    One case we have is when a user uses an Apache Spark distribution with 
its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn 
clusters, because of two different incompatible Hadoop jars in the classpath, 
it runs into errors.
    
    By not populating the Hadoop classpath from the clusters can address this 
issue.
    
    ### Does this PR introduce any user-facing change?
    No.
    
    ### How was this patch tested?
    An UT is added, but very hard to add a new integration test since this 
requires using different incompatible versions of Hadoop.
    
    We also manually tested this PR, and we are able to submit a Spark job 
using Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without 
populating CDH classpath.
    
    Closes #28376 from dbtsai/yarn-classpath.
    
    Authored-by: DB Tsai <[email protected]>
    Signed-off-by: DB Tsai <[email protected]>
---
 docs/running-on-yarn.md                               | 11 +++++++++++
 .../scala/org/apache/spark/deploy/yarn/Client.scala   |  5 ++++-
 .../scala/org/apache/spark/deploy/yarn/config.scala   |  9 +++++++++
 .../org/apache/spark/deploy/yarn/ClientSuite.scala    | 19 +++++++++++++++++++
 4 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index eec73e8..166fb87 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -386,6 +386,17 @@ To use a custom metrics.properties for the application 
master and executors, upd
   <td>1.4.0</td>
 </tr>
 <tr>
+  <td><code>spark.yarn.populateHadoopClasspath</code></td>
+  <td>true</td>
+  <td>
+    Whether to populate Hadoop classpath from 
<code>yarn.application.classpath</code> and
+    <code>mapreduce.application.classpath</code> Note that if this is set to 
<code>false</code>, 
+    it requires a <code>with-Hadoop</code> Spark distribution that bundles 
Hadoop runtime or
+    user has to provide a Hadoop installation separately.
+  </td>
+  <td>2.4.6</td>
+</tr>
+<tr>
   <td><code>spark.yarn.maxAppAttempts</code></td>
   <td><code>yarn.resourcemanager.am.max-attempts</code> in YARN</td>
   <td>
diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index 696afaa..6da6a8d 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -1353,7 +1353,10 @@ private object Client extends Logging {
       }
     }
 
-    populateHadoopClasspath(conf, env)
+    if (sparkConf.get(POPULATE_HADOOP_CLASSPATH)) {
+      populateHadoopClasspath(conf, env)
+    }
+
     sys.env.get(ENV_DIST_CLASSPATH).foreach { cp =>
       addClasspathEntry(getClusterPath(sparkConf, cp), env)
     }
diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
index b3a3570..3797491 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
@@ -70,6 +70,15 @@ package object config {
     .booleanConf
     .createWithDefault(false)
 
+  private[spark] val POPULATE_HADOOP_CLASSPATH = 
ConfigBuilder("spark.yarn.populateHadoopClasspath")
+    .doc("Whether to populate Hadoop classpath from 
`yarn.application.classpath` and " +
+      "`mapreduce.application.classpath` Note that if this is set to `false`, 
it requires " +
+      "a `with-Hadoop` Spark distribution that bundles Hadoop runtime or user 
has to provide " +
+      "a Hadoop installation separately.")
+    .version("2.4.6")
+    .booleanConf
+    .createWithDefault(true)
+
   private[spark] val GATEWAY_ROOT_PATH = 
ConfigBuilder("spark.yarn.config.gatewayPath")
     .doc("Root of configuration paths that is present on gateway nodes, and 
will be replaced " +
       "with the corresponding path in cluster machines.")
diff --git 
a/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
 
b/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
index b42c8b9..680ff99 100644
--- 
a/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
+++ 
b/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
@@ -485,6 +485,25 @@ class ClientSuite extends SparkFunSuite with Matchers {
     }
   }
 
+  test("SPARK-31582 Being able to not populate Hadoop classpath") {
+    Seq(true, false).foreach { populateHadoopClassPath =>
+      withAppConf(Fixtures.mapAppConf) { conf =>
+        val sparkConf = new SparkConf()
+          .set(POPULATE_HADOOP_CLASSPATH, populateHadoopClassPath)
+        val env = new MutableHashMap[String, String]()
+        val args = new ClientArguments(Array("--jar", USER))
+        populateClasspath(args, conf, sparkConf, env)
+        if (populateHadoopClassPath) {
+          classpath(env) should
+            (contain (Fixtures.knownYARNAppCP) and contain 
(Fixtures.knownMRAppCP))
+        } else {
+          classpath(env) should
+            (not contain (Fixtures.knownYARNAppCP) and not contain 
(Fixtures.knownMRAppCP))
+        }
+      }
+    }
+  }
+
   private val matching = Seq(
     ("files URI match test1", "file:///file1", "file:///file2"),
     ("files URI match test2", "file:///c:file1", "file://c:file2"),


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-31582][YARN] Being able to not populate Hadoop classpath

Reply via email to