Hey All,

I've encountered some confusion about how to run a Spark app from a
compiled jar and wanted to bring up the recommended way.

It seems like the current standard options are:
* Build an uber jar that contains the user jar and all of Spark.
* Explicitly include the locations of the Spark jars on the client
machine in the classpath.

Both of these options have a couple issues.

For the uber jar, this means unnecessarily sending all of Spark (and its
dependencies) to every executor, as well as including Spark twice in the
executor classpaths.  This also requires recompiling binaries against the
latest version whenever the cluster version is upgraded, lest executor
classpaths include two different versions of Spark at the same time.

Explicitly including the Spark jars in the classpath is a huge pain because
their locations can vary significantly between different installations and
platforms, and makes the invocation more verbose.

What seems ideal to me is a script that takes a user jar, sets up the Spark
classpath, and runs it.  This means only the user jar gets shipped across
the cluster, but the user doesn't need to figure out how to get the Spark
jars onto the client classpath.  This is similar to the "hadoop jar"
command commonly used for running MapReduce jobs.

The spark-class script seems to do almost exactly this, but I've been told
it's meant only for internal Spark use (with the possible exception of
yarn-standalone mode). It doesn't take a user jar as an argument, but one
can be added by setting the SPARK_CLASSPATH variable.  This script could be
stabilized for user use.

Another option would be to have a "spark-app" script that does what
spark-class does, but also masks the decision of whether to run the driver
in the client process or on the cluster (both standalone and YARN have
modes for both of these).

Does this all make sense?
-Sandy

Reply via email to