GitHub user jerryshao opened a pull request:

    https://github.com/apache/spark/pull/14196

    [SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn

    ## What changes were proposed in this pull request?
    
    Currently when running spark on yarn, jars specified with --jars, 
--packages will be added twice, one is Spark's own file server, another is 
yarn's distributed cache, this can be seen from log:
    for example:
    
    ```
    ./bin/spark-shell --master yarn-client --jars 
examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
    ```
    
    If specified the jar to be added is scopt jar, it will added twice:
    
    ```
    ...
    16/07/14 15:06:48 INFO Server: Started @5603ms
    16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on 
port 4040.
    16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://192.168.0.102:4040
    16/07/14 15:06:48 INFO SparkContext: Added JAR 
file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
 at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 
1468480008637
    16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8032
    16/07/14 15:06:49 INFO Client: Requesting a new application from cluster 
with 1 NodeManagers
    16/07/14 15:06:49 INFO Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (8192 MB per container)
    16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB 
memory including 384 MB overhead
    16/07/14 15:06:49 INFO Client: Setting up container launch context for our 
AM
    16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM 
container
    16/07/14 15:06:49 INFO Client: Preparing resources for our AM container
    16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
    16/07/14 15:06:50 INFO Client: Uploading resource 
file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip
 -> 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip
    16/07/14 15:06:51 INFO Client: Uploading resource 
file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
 -> 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar
    16/07/14 15:06:51 INFO Client: Uploading resource 
file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip
 -> 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip
    ...
    ```
    
    So here try to avoid adding jars to Spark's fileserver unnecessarily. 
    
    ## How was this patch tested?
    
    Manually verified both in yarn client and cluster mode, also in standalone 
mode.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jerryshao/apache-spark SPARK-16540

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14196.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14196
    
----
commit 86205fcef29515ba72809fc2541e5d6aacfa76a7
Author: jerryshao <[email protected]>
Date:   2016-07-14T06:56:22Z

    Avoid adding jars twice for Spark running on yarn

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to