spark git commit: [SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn

2016-07-14 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/master 31ca741ae -> 91575cac3


[SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn

## What changes were proposed in this pull request?

Currently when running spark on yarn, jars specified with --jars, --packages 
will be added twice, one is Spark's own file server, another is yarn's 
distributed cache, this can be seen from log:
for example:

```
./bin/spark-shell --master yarn-client --jars 
examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
```

If specified the jar to be added is scopt jar, it will added twice:

```
...
16/07/14 15:06:48 INFO Server: Started 5603ms
16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://192.168.0.102:4040
16/07/14 15:06:48 INFO SparkContext: Added JAR 
file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
 at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 
1468480008637
16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 
NodeManagers
16/07/14 15:06:49 INFO Client: Verifying our application has not requested more 
than the maximum memory capability of the cluster (8192 MB per container)
16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory 
including 384 MB overhead
16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM
16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM 
container
16/07/14 15:06:49 INFO Client: Preparing resources for our AM container
16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
16/07/14 15:06:50 INFO Client: Uploading resource 
file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g4gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip
 -> 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip
16/07/14 15:06:51 INFO Client: Uploading resource 
file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
 -> 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar
16/07/14 15:06:51 INFO Client: Uploading resource 
file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g4gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip
 -> 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip
...
```

So here try to avoid adding jars to Spark's fileserver unnecessarily.

## How was this patch tested?

Manually verified both in yarn client and cluster mode, also in standalone mode.

Author: jerryshao 

Closes #14196 from jerryshao/SPARK-16540.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/91575cac
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/91575cac
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/91575cac

Branch: refs/heads/master
Commit: 91575cac32e470d7079a55fb86d66332aba599d0
Parents: 31ca741
Author: jerryshao 
Authored: Thu Jul 14 10:40:59 2016 -0700
Committer: Marcelo Vanzin 
Committed: Thu Jul 14 10:40:59 2016 -0700

--
 core/src/main/scala/org/apache/spark/util/Utils.scala| 4 ++--
 .../src/main/scala/org/apache/spark/repl/SparkILoop.scala| 2 +-
 repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/91575cac/core/src/main/scala/org/apache/spark/util/Utils.scala
--
diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala 
b/core/src/main/scala/org/apache/spark/util/Utils.scala
index 2e4ec4c..6ab9e99 100644
--- a/core/src/main/scala/org/apache/spark/util/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/util/Utils.scala
@@ -2409,9 +2409,9 @@ private[spark] object Utils extends Logging {
* "spark.yarn.dist.jars" properties, while in other modes it returns the 
jar files pointed by
* only the "spark.jars" property.
*/
-  def getUserJars(conf: SparkConf): Seq[String] = {
+  def getUserJars(conf: SparkConf, isShell: Boolean = false): Seq[String] = {
 val sparkJars = conf.getOption("spark.jars")
-if (conf.get("spark.master") == "yarn") {
+if (conf.get("spark.master") == "yarn" && isShell) {
   val yarnJars = conf.getOption("spark.yarn.dist.jars")
   

spark git commit: [SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn

2016-07-14 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 23e1ab9c7 -> 1fe0bcdd0


[SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn

## What changes were proposed in this pull request?

Currently when running spark on yarn, jars specified with --jars, --packages 
will be added twice, one is Spark's own file server, another is yarn's 
distributed cache, this can be seen from log:
for example:

```
./bin/spark-shell --master yarn-client --jars 
examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
```

If specified the jar to be added is scopt jar, it will added twice:

```
...
16/07/14 15:06:48 INFO Server: Started 5603ms
16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://192.168.0.102:4040
16/07/14 15:06:48 INFO SparkContext: Added JAR 
file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
 at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 
1468480008637
16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 
NodeManagers
16/07/14 15:06:49 INFO Client: Verifying our application has not requested more 
than the maximum memory capability of the cluster (8192 MB per container)
16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory 
including 384 MB overhead
16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM
16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM 
container
16/07/14 15:06:49 INFO Client: Preparing resources for our AM container
16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
16/07/14 15:06:50 INFO Client: Uploading resource 
file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g4gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip
 -> 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip
16/07/14 15:06:51 INFO Client: Uploading resource 
file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
 -> 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar
16/07/14 15:06:51 INFO Client: Uploading resource 
file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g4gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip
 -> 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip
...
```

So here try to avoid adding jars to Spark's fileserver unnecessarily.

## How was this patch tested?

Manually verified both in yarn client and cluster mode, also in standalone mode.

Author: jerryshao 

Closes #14196 from jerryshao/SPARK-16540.

(cherry picked from commit 91575cac32e470d7079a55fb86d66332aba599d0)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1fe0bcdd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1fe0bcdd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1fe0bcdd

Branch: refs/heads/branch-2.0
Commit: 1fe0bcdd0bf39dd4993bf2ec35f66eec1b949f5b
Parents: 23e1ab9
Author: jerryshao 
Authored: Thu Jul 14 10:40:59 2016 -0700
Committer: Marcelo Vanzin 
Committed: Thu Jul 14 10:41:17 2016 -0700

--
 core/src/main/scala/org/apache/spark/util/Utils.scala| 4 ++--
 .../src/main/scala/org/apache/spark/repl/SparkILoop.scala| 2 +-
 repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1fe0bcdd/core/src/main/scala/org/apache/spark/util/Utils.scala
--
diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala 
b/core/src/main/scala/org/apache/spark/util/Utils.scala
index a79d195..be1ae40 100644
--- a/core/src/main/scala/org/apache/spark/util/Utils.scala
+++ b/core/src/main/scala/org/apache/spark/util/Utils.scala
@@ -2405,9 +2405,9 @@ private[spark] object Utils extends Logging {
* "spark.yarn.dist.jars" properties, while in other modes it returns the 
jar files pointed by
* only the "spark.jars" property.
*/
-  def getUserJars(conf: SparkConf): Seq[String] = {
+  def getUserJars(conf: SparkConf, isShell: Boolean = false): Seq[String] = {
 val sparkJars = conf.getOption("spark.jars")
-if (conf.get("spark.master") == "yarn") {
+if