That makes sense to me. I filed SPARK-1126<https://spark-project.atlassian.net/browse/SPARK-1126> to create a spark-jar script that provides a layer over these.
On Sun, Feb 23, 2014 at 6:58 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > Yes, it is a supported option. I'm just wondering whether we want to > create a script for it specifically. Maybe the same script could also allow > submitting to the cluster or something. > > Matei > > On Feb 23, 2014, at 1:55 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > > > Is the client=driver mode still a supported option (outside of the > REPLs), > > at least for the medium term? My impression from reading the docs is > that > > it's the most common, if not recommended, way to submit jobs. If that's > > the case, I still think it's important, or at least helpful, to have > > something for this mode that addresses the issues below. > > > > > > On Sat, Feb 22, 2014 at 10:48 PM, Matei Zaharia <matei.zaha...@gmail.com > >wrote: > > > >> Hey Sandy, > >> > >> In the long run, the ability to submit driver programs to run in the > >> cluster (added in 0.9 as org.apache.spark.deploy.Client) might solve > this. > >> This is a feature currently available in the standalone mode that runs > the > >> driver on a worker node, but it is also how YARN works by default, and > it > >> wouldn't be too bad to do in Mesos. With this, the user could compile a > JAR > >> that excludes Spark and still get Spark on the classpath. > >> > >> This was added in 0.9 as a slightly harder to invoke feature mainly to > be > >> used for Spark Streaming (since the cluster can also automatically > restart > >> your driver), but we can create a script around it for submissions. I'd > >> like to see a design for such a script that takes into account all the > >> deploy modes though, because it would be confusing to use it one way on > >> YARN and another way on standalone for instance. Already the YARN submit > >> client kind of does what you're looking for. > >> > >> Matei > >> > >> On Feb 22, 2014, at 2:08 PM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > >> > >>> Hey All, > >>> > >>> I've encountered some confusion about how to run a Spark app from a > >>> compiled jar and wanted to bring up the recommended way. > >>> > >>> It seems like the current standard options are: > >>> * Build an uber jar that contains the user jar and all of Spark. > >>> * Explicitly include the locations of the Spark jars on the client > >>> machine in the classpath. > >>> > >>> Both of these options have a couple issues. > >>> > >>> For the uber jar, this means unnecessarily sending all of Spark (and > its > >>> dependencies) to every executor, as well as including Spark twice in > the > >>> executor classpaths. This also requires recompiling binaries against > the > >>> latest version whenever the cluster version is upgraded, lest executor > >>> classpaths include two different versions of Spark at the same time. > >>> > >>> Explicitly including the Spark jars in the classpath is a huge pain > >> because > >>> their locations can vary significantly between different installations > >> and > >>> platforms, and makes the invocation more verbose. > >>> > >>> What seems ideal to me is a script that takes a user jar, sets up the > >> Spark > >>> classpath, and runs it. This means only the user jar gets shipped > across > >>> the cluster, but the user doesn't need to figure out how to get the > Spark > >>> jars onto the client classpath. This is similar to the "hadoop jar" > >>> command commonly used for running MapReduce jobs. > >>> > >>> The spark-class script seems to do almost exactly this, but I've been > >> told > >>> it's meant only for internal Spark use (with the possible exception of > >>> yarn-standalone mode). It doesn't take a user jar as an argument, but > one > >>> can be added by setting the SPARK_CLASSPATH variable. This script > could > >> be > >>> stabilized for user use. > >>> > >>> Another option would be to have a "spark-app" script that does what > >>> spark-class does, but also masks the decision of whether to run the > >> driver > >>> in the client process or on the cluster (both standalone and YARN have > >>> modes for both of these). > >>> > >>> Does this all make sense? > >>> -Sandy > >> > >> > >