GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/2350
[SPARK-3477] Clean up code in Yarn Client / ClientBase
This is part of a broader effort to clean up the Yarn integration code
after #2020.
The high-level changes in this PR include:
- Removing duplicate code, especially across the alpha and stable APIs
- Simplify unnecessarily complex method signatures and hierarchies
- Rename unclear variable and method names
- Organize logging output produced when the user runs Spark on Yarn
- Extensively add documentation
- Privatize classes where possible
I have tested the stable API on a Hadoop 2.4 cluster. I tested submitting a
jar that references classes in other jars in both client and cluster mode. I
also made changes in the alpha API, though I do not have access to an alpha
cluster. I have verified that it compiles, but it would be ideal if others can
help test it.
For those interested in some examples in detail, please read on.
--------------------------------------------------------------------------------------------------------
***Appendix***
- The loop to `getApplicationReport` from the RM is duplicated in 4 places:
in the stable `Client`, alpha `Client`, and twice in
`YarnClientSchedulerBackend`. We should not have different loops for client and
cluster deploy modes.
- There are many fragmented small helper methods that are only used once
and should just be inlined. For instance, `ClientBase#getLocalPath` returns
`null` on certain conditions, and its only caller
`ClientBase#addFileToClasspath` checks whether the value returned is `null`. We
could just have the caller check on that same condition to avoid passing
`null`s around.
- In `YarnSparkHadoopUtil#addToEnvironment`, we take in an argument
`classpathSeparator` that always has the same value upstream (i.e.
`File.separator`, which by the way is simply wrong; it should be
`File.pathSeparator`). This argument is now removed from the signature and all
callers of this method upstream.
- `ClientBase#copyRemoteFile` is now renamed to `copyFileToRemote`. It was
unclear whether we are copying a remote file to our local file system, or
copying a locally visible file to a remote file system. Also, even the content
of the method has inaccurately named variables. We use `val remoteFs` to
signify the file system of the locally visible file and `val fs` to signify the
remote, destination file system. These are now renamed `srcFs` and `destFs`
respectively.
- We currently log the AM container's environment and resource mappings
directly as Scala collections. This is incredibly hard to read and probably too
verbose for the average Spark user. In other modes (e.g. standalone), we also
don't log the launch commands by default, so the logging level of these
information is now set to `DEBUG`.
- None of these classes (`Client`, `ClientBase`, `YarnSparkHadoopUtil`
etc.) is intended to be used by a Spark application (the user should go through
Spark submit instead). At the very least they should be `private[spark]`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark yarn-cleanup
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2350.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2350
----
commit 3f941dc0f3ba3ed43cc82e993c828ce8af28c93b
Author: Andrew Or <[email protected]>
Date: 2014-09-09T05:21:43Z
First cut at simplifying the Client (stable and alpha)
Clearly, a lot of documentation and further work needs to be done
in ClientBase.scala. This is only the first checkpoint.
commit fabe4c43780c6d11e70d6ce11cfb7b94e54d56fd
Author: Andrew Or <[email protected]>
Date: 2014-09-09T19:26:35Z
Reuse more code in YarnClientSchedulerBackend
We implement a while loop to monitor an application's state in
four separate places (stable/Client, alpha/Client, and twice in
YarnClientSchedulerBackend). This commit reduces this to one.
commit 6de9072d8cc3efbbc1c29a7fc0be6dab51e11a04
Author: Andrew Or <[email protected]>
Date: 2014-09-09T19:48:49Z
Guard against potential NPE in debug logging mode
The getClientToken (or getClientToAMToken) method in Hadoop
apparently returns null sometimes. We need to prevent NPEs that
result from this. Yay documentation.
commit ef7069aee933a97ae60c112d4d2b581678f27e87
Author: Andrew Or <[email protected]>
Date: 2014-09-09T20:32:22Z
Clean up YarnClientSchedulerBackend more
commit 6c94d79cfe6c007c1ac4f21aeb7e7ed37dc8316c
Author: Andrew Or <[email protected]>
Date: 2014-09-09T23:00:59Z
Various cleanups in ClientBase and ClientArguments
commit 8766d377e24904c12c820175f8db17e8549e9835
Author: Andrew Or <[email protected]>
Date: 2014-09-10T02:05:51Z
Heavily add documentation to Client* classes + various clean-ups
commit e4779b6846c9c6228513d03db9f8dfcc595e53de
Author: Andrew Or <[email protected]>
Date: 2014-09-10T20:18:24Z
Clean up log messages + variable naming in ClientBase
commit 6573c1d4d175ed3c103a9979f87cf9853421b8e0
Author: Andrew Or <[email protected]>
Date: 2014-09-10T22:30:54Z
Clean up, simplify and document code for setting classpaths
commit 6d74888478140a794be719d1dbe60aed9883ef9b
Author: Andrew Or <[email protected]>
Date: 2014-09-10T22:36:22Z
Minor comment changes
commit c0587b4d734b5fe6a669eef6ba1a8b6803ced73f
Author: Andrew Or <[email protected]>
Date: 2014-09-10T22:38:05Z
Merge branch 'master' of github.com:apache/spark into yarn-cleanup
commit ed0b42d8f375fd2219f9626c280063a103aa15a0
Author: Andrew Or <[email protected]>
Date: 2014-09-10T23:04:18Z
Fix alpha compilation error
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]