That makes sense to me.  I filed
SPARK-1126<https://spark-project.atlassian.net/browse/SPARK-1126> to
create a spark-jar script that provides a layer over these.


On Sun, Feb 23, 2014 at 6:58 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> Yes, it is a supported option. I'm just wondering whether we want to
> create a script for it specifically. Maybe the same script could also allow
> submitting to the cluster or something.
>
> Matei
>
> On Feb 23, 2014, at 1:55 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
>
> > Is the client=driver mode still a supported option (outside of the
> REPLs),
> > at least for the medium term?  My impression from reading the docs is
> that
> > it's the most common, if not recommended, way to submit jobs.  If that's
> > the case, I still think it's important, or at least helpful, to have
> > something for this mode that addresses the issues below.
> >
> >
> > On Sat, Feb 22, 2014 at 10:48 PM, Matei Zaharia <matei.zaha...@gmail.com
> >wrote:
> >
> >> Hey Sandy,
> >>
> >> In the long run, the ability to submit driver programs to run in the
> >> cluster (added in 0.9 as org.apache.spark.deploy.Client) might solve
> this.
> >> This is a feature currently available in the standalone mode that runs
> the
> >> driver on a worker node, but it is also how YARN works by default, and
> it
> >> wouldn't be too bad to do in Mesos. With this, the user could compile a
> JAR
> >> that excludes Spark and still get Spark on the classpath.
> >>
> >> This was added in 0.9 as a slightly harder to invoke feature mainly to
> be
> >> used for Spark Streaming (since the cluster can also automatically
> restart
> >> your driver), but we can create a script around it for submissions. I'd
> >> like to see a design for such a script that takes into account all the
> >> deploy modes though, because it would be confusing to use it one way on
> >> YARN and another way on standalone for instance. Already the YARN submit
> >> client kind of does what you're looking for.
> >>
> >> Matei
> >>
> >> On Feb 22, 2014, at 2:08 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
> >>
> >>> Hey All,
> >>>
> >>> I've encountered some confusion about how to run a Spark app from a
> >>> compiled jar and wanted to bring up the recommended way.
> >>>
> >>> It seems like the current standard options are:
> >>> * Build an uber jar that contains the user jar and all of Spark.
> >>> * Explicitly include the locations of the Spark jars on the client
> >>> machine in the classpath.
> >>>
> >>> Both of these options have a couple issues.
> >>>
> >>> For the uber jar, this means unnecessarily sending all of Spark (and
> its
> >>> dependencies) to every executor, as well as including Spark twice in
> the
> >>> executor classpaths.  This also requires recompiling binaries against
> the
> >>> latest version whenever the cluster version is upgraded, lest executor
> >>> classpaths include two different versions of Spark at the same time.
> >>>
> >>> Explicitly including the Spark jars in the classpath is a huge pain
> >> because
> >>> their locations can vary significantly between different installations
> >> and
> >>> platforms, and makes the invocation more verbose.
> >>>
> >>> What seems ideal to me is a script that takes a user jar, sets up the
> >> Spark
> >>> classpath, and runs it.  This means only the user jar gets shipped
> across
> >>> the cluster, but the user doesn't need to figure out how to get the
> Spark
> >>> jars onto the client classpath.  This is similar to the "hadoop jar"
> >>> command commonly used for running MapReduce jobs.
> >>>
> >>> The spark-class script seems to do almost exactly this, but I've been
> >> told
> >>> it's meant only for internal Spark use (with the possible exception of
> >>> yarn-standalone mode). It doesn't take a user jar as an argument, but
> one
> >>> can be added by setting the SPARK_CLASSPATH variable.  This script
> could
> >> be
> >>> stabilized for user use.
> >>>
> >>> Another option would be to have a "spark-app" script that does what
> >>> spark-class does, but also masks the decision of whether to run the
> >> driver
> >>> in the client process or on the cluster (both standalone and YARN have
> >>> modes for both of these).
> >>>
> >>> Does this all make sense?
> >>> -Sandy
> >>
> >>
>
>

Reply via email to