[GitHub] spark pull request: [SPARK-2669] [yarn] Distribute client configur...

vanzin Wed, 21 Jan 2015 11:48:40 -0800

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/4142


    [SPARK-2669] [yarn] Distribute client configuration to AM.

    Currently, when Spark launches the Yarn AM, the process will use
    the local Hadoop configuration on the node where the AM launches,
    if one is present. A more correct approach is to use the same
    configuration used to launch the Spark job, since the user may
    have made modifications (such as adding app-specific configs).
    
    The approach taken here is to use the distributed cache to make
    all files in the Hadoop configuration directory available to the
    AM. This is a little overkill since only the AM needs them (the
    executors use the broadcast Hadoop configuration from the driver),
    but is the easier approach.
    
    Even though only a few files in that directory may end up being
    used, all of them are uploaded. This allows supporting use cases
    such as when auxiliary configuration files are used for SSL
    configuration, or when uploading a Hive configuration directory.
    Not all of these may be reflected in a o.a.h.conf.Configuration object,
    but may be needed when a driver in cluster mode instantiates, for
    example, a HiveConf object instead.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-2669

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4142.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4142
    
----
commit 79221c712e9d8794a3e36f08fcd29f247bb9138f
Author: Marcelo Vanzin <[email protected]>
Date:   2015-01-21T19:40:49Z

    [SPARK-2669] [yarn] Distribute client configuration to AM.
    
    Currently, when Spark launches the Yarn AM, the process will use
    the local Hadoop configuration on the node where the AM launches,
    if one is present. A more correct approach is to use the same
    configuration used to launch the Spark job, since the user may
    have made modifications (such as adding app-specific configs).
    
    The approach taken here is to use the distributed cache to make
    all files in the Hadoop configuration directory available to the
    AM. This is a little overkill since only the AM needs them (the
    executors use the broadcast Hadoop configuration from the driver),
    but is the easier approach.
    
    Even though only a few files in that directory may end up being
    used, all of them are uploaded. This allows supporting use cases
    such as when auxiliary configuration files are used for SSL
    configuration, or when uploading a Hive configuration directory.
    Not all of these may be reflected in a o.a.h.conf.Configuration object,
    but may be needed when a driver in cluster mode instantiates, for
    example, a HiveConf object instead.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2669] [yarn] Distribute client configur...

Reply via email to