[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

vanzin Tue, 06 Jan 2015 14:40:44 -0800

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/3916


    [SPARK-4924] Add a library for launching Spark jobs programatically.

    This change encapsulates all the logic involved in launching a Spark job
    into a small Java library that can be easily embedded into other 
applications.
    
    The overall goal of this change is twofold, as described in the bug:
    
    - Provide a public API for launching Spark processes. This is a common 
request
      from users and currently there's no good answer for it.
    
    - Remove a lot of the duplicated code and other coupling that exists in the
      different parts of Spark that deal with launching processes.
    
    A lot of the duplication was due to different code needed to build an
    application's classpath (and the bootstrapper needed to run the driver in
    certain situations), and also different code needed to parse spark-submit
    command line options in different contexts. The change centralizes those
    as much as possible so that all code paths can rely on the library for
    handling those appropriately.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-4924

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3916.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3916
    
----
commit 6f70eeaf7fdef89d54b45ec38eb0f7220596e537
Author: Marcelo Vanzin <[email protected]>
Date:   2015-01-06T21:25:34Z

    [SPARK-4924] Add a library for launching Spark jobs programatically.
    
    This change encapsulates all the logic involved in launching a Spark job
    into a small Java library that can be easily embedded into other 
applications.
    
    Only the `SparkLauncher` class is supposed to be public in the new launcher
    lib, but to be able to use those classes elsewhere in Spark some of the
    visibility modifiers were relaxed. This allows us to automate some checks
    in unit tests, when before you just had a comment that was easily missed.
    
    A subsequent commit will change Spark core and all the shell scripts to use
    this library instead of custom code that needs to be replicated for 
different
    OSes and, sometimes, also in Spark code.

commit 27be98a4fe1cbaa227f3694caeff476779089b4f
Author: Marcelo Vanzin <[email protected]>
Date:   2015-01-06T21:27:11Z

    Modify Spark to use launcher lib.
    
    Change the existing scripts under bin/ to use the launcher library,
    to avoid code duplication and reduce the amount of coupling between scripts
    and Spark code.
    
    Also change some Spark core code to use the library instead of relying on
    scripts (either by calling them or with comments saying they should be
    kept in sync).
    
    While the library is now included in the assembly (by means of the 
spark-core
    dependency), it's still packaged directly into the final lib/ directory,
    because loading a small jar is much faster than the huge assembly jar, and
    that makes the start up time of Spark jobs much better.

commit 25c5ae6e0cfe084f96664405331e43a177123b9d
Author: Marcelo Vanzin <[email protected]>
Date:   2014-12-24T19:07:27Z

    Centralize SparkSubmit command line parsing.
    
    Use a common base class to parse SparkSubmit command line arguments. This 
forces
    anyone who wants to add new arguments to modify the shared parser, updating 
all
    code that needs to know about SparkSubmit options in the process.
    
    Also create some constants to avoid copy & pasting strings around to 
actually
    process the options.

commit 1b3f6e938ff90e1f0c3fc9e8402f47144834d456
Author: Marcelo Vanzin <[email protected]>
Date:   2015-01-05T18:52:15Z

    Call SparkSubmit from spark-class launcher for unknown classes.
    
    For new-style launchers, do the launching using SparkSubmit; hopefully
    this will be the preferred method of launching new daemons (if any).
    Currently it handles the thrift server daemon.

commit 7a01e4adaeec0e2d8a27c3231bec90a2a09b1cbb
Author: Marcelo Vanzin <[email protected]>
Date:   2015-01-06T21:16:10Z

    Fix pyspark on Yarn.
    
    pyspark (at least) relies on SPARK_HOME (the env variable) to be set
    for things to work properly. The launcher wasn't making sure that
    variable was set in all cases, so do that. Also, separately, the
    Yarn backend didn't seem to propagate that variable to the AM for
    some reason, so do that too. (Not sure how things worked previously...)
    
    Extra: add ".pyo" files to .gitignore (these are generated by `python -O`).

commit 4d511e76ca15702bb0903fec6f677a11b25b809f
Author: Marcelo Vanzin <[email protected]>
Date:   2015-01-06T21:53:22Z

    Fix tools search code.

commit 656374e4c1a2599c1f10c35d2f7e7e08521ee390
Author: Marcelo Vanzin <[email protected]>
Date:   2015-01-06T21:57:23Z

    Mima fixes.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

Reply via email to