GitHub user vanzin opened a pull request:
https://github.com/apache/spark/pull/3916
[SPARK-4924] Add a library for launching Spark jobs programatically.
This change encapsulates all the logic involved in launching a Spark job
into a small Java library that can be easily embedded into other
applications.
The overall goal of this change is twofold, as described in the bug:
- Provide a public API for launching Spark processes. This is a common
request
from users and currently there's no good answer for it.
- Remove a lot of the duplicated code and other coupling that exists in the
different parts of Spark that deal with launching processes.
A lot of the duplication was due to different code needed to build an
application's classpath (and the bootstrapper needed to run the driver in
certain situations), and also different code needed to parse spark-submit
command line options in different contexts. The change centralizes those
as much as possible so that all code paths can rely on the library for
handling those appropriately.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vanzin/spark SPARK-4924
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3916.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3916
----
commit 6f70eeaf7fdef89d54b45ec38eb0f7220596e537
Author: Marcelo Vanzin <[email protected]>
Date: 2015-01-06T21:25:34Z
[SPARK-4924] Add a library for launching Spark jobs programatically.
This change encapsulates all the logic involved in launching a Spark job
into a small Java library that can be easily embedded into other
applications.
Only the `SparkLauncher` class is supposed to be public in the new launcher
lib, but to be able to use those classes elsewhere in Spark some of the
visibility modifiers were relaxed. This allows us to automate some checks
in unit tests, when before you just had a comment that was easily missed.
A subsequent commit will change Spark core and all the shell scripts to use
this library instead of custom code that needs to be replicated for
different
OSes and, sometimes, also in Spark code.
commit 27be98a4fe1cbaa227f3694caeff476779089b4f
Author: Marcelo Vanzin <[email protected]>
Date: 2015-01-06T21:27:11Z
Modify Spark to use launcher lib.
Change the existing scripts under bin/ to use the launcher library,
to avoid code duplication and reduce the amount of coupling between scripts
and Spark code.
Also change some Spark core code to use the library instead of relying on
scripts (either by calling them or with comments saying they should be
kept in sync).
While the library is now included in the assembly (by means of the
spark-core
dependency), it's still packaged directly into the final lib/ directory,
because loading a small jar is much faster than the huge assembly jar, and
that makes the start up time of Spark jobs much better.
commit 25c5ae6e0cfe084f96664405331e43a177123b9d
Author: Marcelo Vanzin <[email protected]>
Date: 2014-12-24T19:07:27Z
Centralize SparkSubmit command line parsing.
Use a common base class to parse SparkSubmit command line arguments. This
forces
anyone who wants to add new arguments to modify the shared parser, updating
all
code that needs to know about SparkSubmit options in the process.
Also create some constants to avoid copy & pasting strings around to
actually
process the options.
commit 1b3f6e938ff90e1f0c3fc9e8402f47144834d456
Author: Marcelo Vanzin <[email protected]>
Date: 2015-01-05T18:52:15Z
Call SparkSubmit from spark-class launcher for unknown classes.
For new-style launchers, do the launching using SparkSubmit; hopefully
this will be the preferred method of launching new daemons (if any).
Currently it handles the thrift server daemon.
commit 7a01e4adaeec0e2d8a27c3231bec90a2a09b1cbb
Author: Marcelo Vanzin <[email protected]>
Date: 2015-01-06T21:16:10Z
Fix pyspark on Yarn.
pyspark (at least) relies on SPARK_HOME (the env variable) to be set
for things to work properly. The launcher wasn't making sure that
variable was set in all cases, so do that. Also, separately, the
Yarn backend didn't seem to propagate that variable to the AM for
some reason, so do that too. (Not sure how things worked previously...)
Extra: add ".pyo" files to .gitignore (these are generated by `python -O`).
commit 4d511e76ca15702bb0903fec6f677a11b25b809f
Author: Marcelo Vanzin <[email protected]>
Date: 2015-01-06T21:53:22Z
Fix tools search code.
commit 656374e4c1a2599c1f10c35d2f7e7e08521ee390
Author: Marcelo Vanzin <[email protected]>
Date: 2015-01-06T21:57:23Z
Mima fixes.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]