spark git commit: [SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn
Repository: spark Updated Branches: refs/heads/master 31ca741ae -> 91575cac3 [SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn ## What changes were proposed in this pull request? Currently when running spark on yarn, jars specified with --jars, --packages will be added twice, one is Spark's own file server, another is yarn's distributed cache, this can be seen from log: for example: ``` ./bin/spark-shell --master yarn-client --jars examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar ``` If specified the jar to be added is scopt jar, it will added twice: ``` ... 16/07/14 15:06:48 INFO Server: Started 5603ms 16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040 16/07/14 15:06:48 INFO SparkContext: Added JAR file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 1468480008637 16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 NodeManagers 16/07/14 15:06:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM 16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM container 16/07/14 15:06:49 INFO Client: Preparing resources for our AM container 16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 16/07/14 15:06:50 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g4gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip 16/07/14 15:06:51 INFO Client: Uploading resource file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar 16/07/14 15:06:51 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g4gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip ... ``` So here try to avoid adding jars to Spark's fileserver unnecessarily. ## How was this patch tested? Manually verified both in yarn client and cluster mode, also in standalone mode. Author: jerryshaoCloses #14196 from jerryshao/SPARK-16540. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/91575cac Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/91575cac Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/91575cac Branch: refs/heads/master Commit: 91575cac32e470d7079a55fb86d66332aba599d0 Parents: 31ca741 Author: jerryshao Authored: Thu Jul 14 10:40:59 2016 -0700 Committer: Marcelo Vanzin Committed: Thu Jul 14 10:40:59 2016 -0700 -- core/src/main/scala/org/apache/spark/util/Utils.scala| 4 ++-- .../src/main/scala/org/apache/spark/repl/SparkILoop.scala| 2 +- repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/91575cac/core/src/main/scala/org/apache/spark/util/Utils.scala -- diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala b/core/src/main/scala/org/apache/spark/util/Utils.scala index 2e4ec4c..6ab9e99 100644 --- a/core/src/main/scala/org/apache/spark/util/Utils.scala +++ b/core/src/main/scala/org/apache/spark/util/Utils.scala @@ -2409,9 +2409,9 @@ private[spark] object Utils extends Logging { * "spark.yarn.dist.jars" properties, while in other modes it returns the jar files pointed by * only the "spark.jars" property. */ - def getUserJars(conf: SparkConf): Seq[String] = { + def getUserJars(conf: SparkConf, isShell: Boolean = false): Seq[String] = { val sparkJars = conf.getOption("spark.jars") -if (conf.get("spark.master") == "yarn") { +if (conf.get("spark.master") == "yarn" && isShell) { val yarnJars = conf.getOption("spark.yarn.dist.jars")
spark git commit: [SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn
Repository: spark Updated Branches: refs/heads/branch-2.0 23e1ab9c7 -> 1fe0bcdd0 [SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn ## What changes were proposed in this pull request? Currently when running spark on yarn, jars specified with --jars, --packages will be added twice, one is Spark's own file server, another is yarn's distributed cache, this can be seen from log: for example: ``` ./bin/spark-shell --master yarn-client --jars examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar ``` If specified the jar to be added is scopt jar, it will added twice: ``` ... 16/07/14 15:06:48 INFO Server: Started 5603ms 16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040 16/07/14 15:06:48 INFO SparkContext: Added JAR file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 1468480008637 16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 NodeManagers 16/07/14 15:06:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM 16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM container 16/07/14 15:06:49 INFO Client: Preparing resources for our AM container 16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 16/07/14 15:06:50 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g4gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip 16/07/14 15:06:51 INFO Client: Uploading resource file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar 16/07/14 15:06:51 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g4gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip ... ``` So here try to avoid adding jars to Spark's fileserver unnecessarily. ## How was this patch tested? Manually verified both in yarn client and cluster mode, also in standalone mode. Author: jerryshaoCloses #14196 from jerryshao/SPARK-16540. (cherry picked from commit 91575cac32e470d7079a55fb86d66332aba599d0) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1fe0bcdd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1fe0bcdd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1fe0bcdd Branch: refs/heads/branch-2.0 Commit: 1fe0bcdd0bf39dd4993bf2ec35f66eec1b949f5b Parents: 23e1ab9 Author: jerryshao Authored: Thu Jul 14 10:40:59 2016 -0700 Committer: Marcelo Vanzin Committed: Thu Jul 14 10:41:17 2016 -0700 -- core/src/main/scala/org/apache/spark/util/Utils.scala| 4 ++-- .../src/main/scala/org/apache/spark/repl/SparkILoop.scala| 2 +- repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1fe0bcdd/core/src/main/scala/org/apache/spark/util/Utils.scala -- diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala b/core/src/main/scala/org/apache/spark/util/Utils.scala index a79d195..be1ae40 100644 --- a/core/src/main/scala/org/apache/spark/util/Utils.scala +++ b/core/src/main/scala/org/apache/spark/util/Utils.scala @@ -2405,9 +2405,9 @@ private[spark] object Utils extends Logging { * "spark.yarn.dist.jars" properties, while in other modes it returns the jar files pointed by * only the "spark.jars" property. */ - def getUserJars(conf: SparkConf): Seq[String] = { + def getUserJars(conf: SparkConf, isShell: Boolean = false): Seq[String] = { val sparkJars = conf.getOption("spark.jars") -if (conf.get("spark.master") == "yarn") { +if