[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

andrewor14 Wed, 10 Sep 2014 16:55:49 -0700

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/2350


    [SPARK-3477] Clean up code in Yarn Client / ClientBase

    This is part of a broader effort to clean up the Yarn integration code 
after #2020.
    
    The high-level changes in this PR include:
    - Removing duplicate code, especially across the alpha and stable APIs
    - Simplify unnecessarily complex method signatures and hierarchies
    - Rename unclear variable and method names
    - Organize logging output produced when the user runs Spark on Yarn
    - Extensively add documentation
    - Privatize classes where possible
    
    I have tested the stable API on a Hadoop 2.4 cluster. I tested submitting a 
jar that references classes in other jars in both client and cluster mode. I 
also made changes in the alpha API, though I do not have access to an alpha 
cluster. I have verified that it compiles, but it would be ideal if others can 
help test it.
    
    For those interested in some examples in detail, please read on.
    
    
--------------------------------------------------------------------------------------------------------
    
    ***Appendix***
    
    - The loop to `getApplicationReport` from the RM is duplicated in 4 places: 
in the stable `Client`, alpha `Client`, and twice in 
`YarnClientSchedulerBackend`. We should not have different loops for client and 
cluster deploy modes.
    - There are many fragmented small helper methods that are only used once 
and should just be inlined. For instance, `ClientBase#getLocalPath` returns 
`null` on certain conditions, and its only caller 
`ClientBase#addFileToClasspath` checks whether the value returned is `null`. We 
could just have the caller check on that same condition to avoid passing 
`null`s around.
    - In `YarnSparkHadoopUtil#addToEnvironment`, we take in an argument 
`classpathSeparator` that always has the same value upstream (i.e. 
`File.separator`, which by the way is simply wrong; it should be 
`File.pathSeparator`). This argument is now removed from the signature and all 
callers of this method upstream.
    - `ClientBase#copyRemoteFile` is now renamed to `copyFileToRemote`. It was 
unclear whether we are copying a remote file to our local file system, or 
copying a locally visible file to a remote file system. Also, even the content 
of the method has inaccurately named variables. We use `val remoteFs` to 
signify the file system of the locally visible file and `val fs` to signify the 
remote, destination file system. These are now renamed `srcFs` and `destFs` 
respectively.
    - We currently log the AM container's environment and resource mappings 
directly as Scala collections. This is incredibly hard to read and probably too 
verbose for the average Spark user. In other modes (e.g. standalone), we also 
don't log the launch commands by default, so the logging level of these 
information is now set to `DEBUG`.
    - None of these classes (`Client`, `ClientBase`, `YarnSparkHadoopUtil` 
etc.) is intended to be used by a Spark application (the user should go through 
Spark submit instead). At the very least they should be `private[spark]`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark yarn-cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2350.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2350
    
----
commit 3f941dc0f3ba3ed43cc82e993c828ce8af28c93b
Author: Andrew Or <[email protected]>
Date:   2014-09-09T05:21:43Z

    First cut at simplifying the Client (stable and alpha)
    
    Clearly, a lot of documentation and further work needs to be done
    in ClientBase.scala. This is only the first checkpoint.

commit fabe4c43780c6d11e70d6ce11cfb7b94e54d56fd
Author: Andrew Or <[email protected]>
Date:   2014-09-09T19:26:35Z

    Reuse more code in YarnClientSchedulerBackend
    
    We implement a while loop to monitor an application's state in
    four separate places (stable/Client, alpha/Client, and twice in
    YarnClientSchedulerBackend). This commit reduces this to one.

commit 6de9072d8cc3efbbc1c29a7fc0be6dab51e11a04
Author: Andrew Or <[email protected]>
Date:   2014-09-09T19:48:49Z

    Guard against potential NPE in debug logging mode
    
    The getClientToken (or getClientToAMToken) method in Hadoop
    apparently returns null sometimes. We need to prevent NPEs that
    result from this. Yay documentation.

commit ef7069aee933a97ae60c112d4d2b581678f27e87
Author: Andrew Or <[email protected]>
Date:   2014-09-09T20:32:22Z

    Clean up YarnClientSchedulerBackend more

commit 6c94d79cfe6c007c1ac4f21aeb7e7ed37dc8316c
Author: Andrew Or <[email protected]>
Date:   2014-09-09T23:00:59Z

    Various cleanups in ClientBase and ClientArguments

commit 8766d377e24904c12c820175f8db17e8549e9835
Author: Andrew Or <[email protected]>
Date:   2014-09-10T02:05:51Z

    Heavily add documentation to Client* classes + various clean-ups

commit e4779b6846c9c6228513d03db9f8dfcc595e53de
Author: Andrew Or <[email protected]>
Date:   2014-09-10T20:18:24Z

    Clean up log messages + variable naming in ClientBase

commit 6573c1d4d175ed3c103a9979f87cf9853421b8e0
Author: Andrew Or <[email protected]>
Date:   2014-09-10T22:30:54Z

    Clean up, simplify and document code for setting classpaths

commit 6d74888478140a794be719d1dbe60aed9883ef9b
Author: Andrew Or <[email protected]>
Date:   2014-09-10T22:36:22Z

    Minor comment changes

commit c0587b4d734b5fe6a669eef6ba1a8b6803ced73f
Author: Andrew Or <[email protected]>
Date:   2014-09-10T22:38:05Z

    Merge branch 'master' of github.com:apache/spark into yarn-cleanup

commit ed0b42d8f375fd2219f9626c280063a103aa15a0
Author: Andrew Or <[email protected]>
Date:   2014-09-10T23:04:18Z

    Fix alpha compilation error

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3477] Clean up code in Yarn Client / Cl...

Reply via email to