Yes, it is a supported option. I’m just wondering whether we want to create a script for it specifically. Maybe the same script could also allow submitting to the cluster or something.
Matei On Feb 23, 2014, at 1:55 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Is the client=driver mode still a supported option (outside of the REPLs), > at least for the medium term? My impression from reading the docs is that > it's the most common, if not recommended, way to submit jobs. If that's > the case, I still think it's important, or at least helpful, to have > something for this mode that addresses the issues below. > > > On Sat, Feb 22, 2014 at 10:48 PM, Matei Zaharia > <matei.zaha...@gmail.com>wrote: > >> Hey Sandy, >> >> In the long run, the ability to submit driver programs to run in the >> cluster (added in 0.9 as org.apache.spark.deploy.Client) might solve this. >> This is a feature currently available in the standalone mode that runs the >> driver on a worker node, but it is also how YARN works by default, and it >> wouldn't be too bad to do in Mesos. With this, the user could compile a JAR >> that excludes Spark and still get Spark on the classpath. >> >> This was added in 0.9 as a slightly harder to invoke feature mainly to be >> used for Spark Streaming (since the cluster can also automatically restart >> your driver), but we can create a script around it for submissions. I'd >> like to see a design for such a script that takes into account all the >> deploy modes though, because it would be confusing to use it one way on >> YARN and another way on standalone for instance. Already the YARN submit >> client kind of does what you're looking for. >> >> Matei >> >> On Feb 22, 2014, at 2:08 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: >> >>> Hey All, >>> >>> I've encountered some confusion about how to run a Spark app from a >>> compiled jar and wanted to bring up the recommended way. >>> >>> It seems like the current standard options are: >>> * Build an uber jar that contains the user jar and all of Spark. >>> * Explicitly include the locations of the Spark jars on the client >>> machine in the classpath. >>> >>> Both of these options have a couple issues. >>> >>> For the uber jar, this means unnecessarily sending all of Spark (and its >>> dependencies) to every executor, as well as including Spark twice in the >>> executor classpaths. This also requires recompiling binaries against the >>> latest version whenever the cluster version is upgraded, lest executor >>> classpaths include two different versions of Spark at the same time. >>> >>> Explicitly including the Spark jars in the classpath is a huge pain >> because >>> their locations can vary significantly between different installations >> and >>> platforms, and makes the invocation more verbose. >>> >>> What seems ideal to me is a script that takes a user jar, sets up the >> Spark >>> classpath, and runs it. This means only the user jar gets shipped across >>> the cluster, but the user doesn't need to figure out how to get the Spark >>> jars onto the client classpath. This is similar to the "hadoop jar" >>> command commonly used for running MapReduce jobs. >>> >>> The spark-class script seems to do almost exactly this, but I've been >> told >>> it's meant only for internal Spark use (with the possible exception of >>> yarn-standalone mode). It doesn't take a user jar as an argument, but one >>> can be added by setting the SPARK_CLASSPATH variable. This script could >> be >>> stabilized for user use. >>> >>> Another option would be to have a "spark-app" script that does what >>> spark-class does, but also masks the decision of whether to run the >> driver >>> in the client process or on the cluster (both standalone and YARN have >>> modes for both of these). >>> >>> Does this all make sense? >>> -Sandy >> >>