GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/12487

    [SPARK-14602][yarn] Use SparkConf to propagate the list of cached files.

    This change avoids using the environment to pass this information, since
    with many jars it's easy to hit limits on certain OSes. Instead, it encodes
    the information into the Spark configuration propagated to the AM.
    
    The first problem that needed to be solved is a chicken & egg issue: the
    config file is distributed using the cache, and it needs to contain 
information
    about the files that are being distributed. To solve that, the code now 
treats
    the config archive especially, and uses slightly different code to 
distribute
    it, so that only its cache path needs to be saved to the config file.
    
    The second problem is that the extra information would show up in the Web 
UI,
    which made the environment tab even more noisy than it already is when lots
    of jars are listed. This is solved by two changes: the list of cached files
    is now read only once in the AM, and propageted down to the ExecutorRunnable
    code (which actually sends the list to the NMs when starting containers). 
The
    second change is to unset those config entries after the list is read, so 
that
    the SparkContext never sees them.
    
    Tested with both client and cluster mode by running "run-example SparkPi". 
This
    uploades a whole lot of files when run from a build dir (instead of a 
distribution,
    where the list is cleaned up), and I verified that the configs do not show
    up in the UI.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-14602

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12487.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12487
    
----
commit 2a2761a1a870c6a761bff230dec589b1c043bf39
Author: Marcelo Vanzin <[email protected]>
Date:   2016-04-18T18:38:20Z

    [SPARK-14602][yarn] Use SparkConf to propagate the list of cached files.
    
    This change avoids using the environment to pass this information, since
    with many jars it's easy to hit limits on certain OSes. Instead, it encodes
    the information into the Spark configuration propagated to the AM.
    
    The first problem that needed to be solved is a chicken & egg issue: the
    config file is distributed using the cache, and it needs to contain 
information
    about the files that are being distributed. To solve that, the code now 
treats
    the config archive especially, and uses slightly different code to 
distribute
    it, so that only its cache path needs to be saved to the config file.
    
    The second problem is that the extra information would show up in the Web 
UI,
    which made the environment tab even more noisy than it already is when lots
    of jars are listed. This is solved by two changes: the list of cached files
    is now read only once in the AM, and propageted down to the ExecutorRunnable
    code (which actually sends the list to the NMs when starting containers). 
The
    second change is to unset those config entries after the list is read, so 
that
    the SparkContext never sees them.
    
    Tested with both client and cluster mode by running "run-example SparkPi". 
This
    uploades a whole lot of files when run from a build dir (instead of a 
distribution,
    where the list is cleaned up), and I verified that the configs do not show
    up in the UI.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to