[
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258578#comment-14258578
]
Marcelo Vanzin commented on SPARK-4924:
---------------------------------------
I've been playing with some code to achieve this and I have it in a working
state now; I'll not post a PR because I still need to test more, especially in
a live cluster (instead of just local mode), need to do some cleanup, and run
all unit tests too. Also, I'll be about for around 2 weeks, so I won't be able
to follow the PR. :-)
But the code is here: https://github.com/vanzin/spark/tree/SPARK-4924
I ran some simple timing tests to compare the performance (since that was
brought up in offline talks), and here are some numbers:
{noformat}
Build options:
-Phive -Phive-thriftserver -Pyarn -Pyarn-timeline -Dhadoop.version=2.5.0
-Phadoop-2.4
Current master:
$ python ~/tmp/timeit.py 100 bin/pyspark 2>/dev/null
avg: 0.00765615224838, min: 0.00733208656311, max: 0.00919508934021
$ python ~/tmp/timeit.py 100 bin/spark-class --master foo 2>/dev/null
avg: 0.408118853569, min: 0.387609004974, max: 0.474602937698
With launcher lib:
$ python ~/tmp/timeit.py 100 bin/pyspark 2>/dev/null
avg: 0.0700361847878, min: 0.0674149990082, max: 0.0962409973145
$ python ~/tmp/timeit.py 100 bin/spark-class --master foo 2>/dev/null
avg: 0.10673923254, min: 0.0927491188049, max: 0.144665002823
{noformat}
The scripts were modified to not run the ultimate command (e.g. java in the
spark-class case or python in the pyspark case). The pyspark numbers are
actually misleading, since to launch the shell you need to run pyspark and, as
part of initializing the gateway, spark-submit too, so the actual overhead of
starting pyspark is the sum of the two values (pyspark + spark-class).
But, long story short, even with the current code, which could use some cleanup
and maybe optimizations, it's much faster than the current scripts.
> Factor out code to launch Spark applications into a separate library
> --------------------------------------------------------------------
>
> Key: SPARK-4924
> URL: https://issues.apache.org/jira/browse/SPARK-4924
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Marcelo Vanzin
> Attachments: spark-launcher.txt
>
>
> One of the questions we run into rather commonly is "how to start a Spark
> application from my Java/Scala program?". There currently isn't a good answer
> to that:
> - Instantiating SparkContext has limitations (e.g., you can only have one
> active context at the moment, plus you lose the ability to submit apps in
> cluster mode)
> - Calling SparkSubmit directly is doable but you lose a lot of the logic
> handled by the shell scripts
> - Calling the shell script directly is doable, but sort of ugly from an API
> point of view.
> I think it would be nice to have a small library that handles that for users.
> On top of that, this library could be used by Spark itself to replace a lot
> of the code in the current shell scripts, which have a lot of duplication.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]