Hey Sandy,

In the long run, the ability to submit driver programs to run in the cluster 
(added in 0.9 as org.apache.spark.deploy.Client) might solve this. This is a 
feature currently available in the standalone mode that runs the driver on a 
worker node, but it is also how YARN works by default, and it wouldn’t be too 
bad to do in Mesos. With this, the user could compile a JAR that excludes Spark 
and still get Spark on the classpath.

This was added in 0.9 as a slightly harder to invoke feature mainly to be used 
for Spark Streaming (since the cluster can also automatically restart your 
driver), but we can create a script around it for submissions. I’d like to see 
a design for such a script that takes into account all the deploy modes though, 
because it would be confusing to use it one way on YARN and another way on 
standalone for instance. Already the YARN submit client kind of does what 
you’re looking for.

Matei

On Feb 22, 2014, at 2:08 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Hey All,
> 
> I've encountered some confusion about how to run a Spark app from a
> compiled jar and wanted to bring up the recommended way.
> 
> It seems like the current standard options are:
> * Build an uber jar that contains the user jar and all of Spark.
> * Explicitly include the locations of the Spark jars on the client
> machine in the classpath.
> 
> Both of these options have a couple issues.
> 
> For the uber jar, this means unnecessarily sending all of Spark (and its
> dependencies) to every executor, as well as including Spark twice in the
> executor classpaths.  This also requires recompiling binaries against the
> latest version whenever the cluster version is upgraded, lest executor
> classpaths include two different versions of Spark at the same time.
> 
> Explicitly including the Spark jars in the classpath is a huge pain because
> their locations can vary significantly between different installations and
> platforms, and makes the invocation more verbose.
> 
> What seems ideal to me is a script that takes a user jar, sets up the Spark
> classpath, and runs it.  This means only the user jar gets shipped across
> the cluster, but the user doesn't need to figure out how to get the Spark
> jars onto the client classpath.  This is similar to the "hadoop jar"
> command commonly used for running MapReduce jobs.
> 
> The spark-class script seems to do almost exactly this, but I've been told
> it's meant only for internal Spark use (with the possible exception of
> yarn-standalone mode). It doesn't take a user jar as an argument, but one
> can be added by setting the SPARK_CLASSPATH variable.  This script could be
> stabilized for user use.
> 
> Another option would be to have a "spark-app" script that does what
> spark-class does, but also masks the decision of whether to run the driver
> in the client process or on the cluster (both standalone and YARN have
> modes for both of these).
> 
> Does this all make sense?
> -Sandy

Reply via email to