Is the client=driver mode still a supported option (outside of the REPLs), at least for the medium term? My impression from reading the docs is that it's the most common, if not recommended, way to submit jobs. If that's the case, I still think it's important, or at least helpful, to have something for this mode that addresses the issues below.
On Sat, Feb 22, 2014 at 10:48 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > Hey Sandy, > > In the long run, the ability to submit driver programs to run in the > cluster (added in 0.9 as org.apache.spark.deploy.Client) might solve this. > This is a feature currently available in the standalone mode that runs the > driver on a worker node, but it is also how YARN works by default, and it > wouldn't be too bad to do in Mesos. With this, the user could compile a JAR > that excludes Spark and still get Spark on the classpath. > > This was added in 0.9 as a slightly harder to invoke feature mainly to be > used for Spark Streaming (since the cluster can also automatically restart > your driver), but we can create a script around it for submissions. I'd > like to see a design for such a script that takes into account all the > deploy modes though, because it would be confusing to use it one way on > YARN and another way on standalone for instance. Already the YARN submit > client kind of does what you're looking for. > > Matei > > On Feb 22, 2014, at 2:08 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > > > Hey All, > > > > I've encountered some confusion about how to run a Spark app from a > > compiled jar and wanted to bring up the recommended way. > > > > It seems like the current standard options are: > > * Build an uber jar that contains the user jar and all of Spark. > > * Explicitly include the locations of the Spark jars on the client > > machine in the classpath. > > > > Both of these options have a couple issues. > > > > For the uber jar, this means unnecessarily sending all of Spark (and its > > dependencies) to every executor, as well as including Spark twice in the > > executor classpaths. This also requires recompiling binaries against the > > latest version whenever the cluster version is upgraded, lest executor > > classpaths include two different versions of Spark at the same time. > > > > Explicitly including the Spark jars in the classpath is a huge pain > because > > their locations can vary significantly between different installations > and > > platforms, and makes the invocation more verbose. > > > > What seems ideal to me is a script that takes a user jar, sets up the > Spark > > classpath, and runs it. This means only the user jar gets shipped across > > the cluster, but the user doesn't need to figure out how to get the Spark > > jars onto the client classpath. This is similar to the "hadoop jar" > > command commonly used for running MapReduce jobs. > > > > The spark-class script seems to do almost exactly this, but I've been > told > > it's meant only for internal Spark use (with the possible exception of > > yarn-standalone mode). It doesn't take a user jar as an argument, but one > > can be added by setting the SPARK_CLASSPATH variable. This script could > be > > stabilized for user use. > > > > Another option would be to have a "spark-app" script that does what > > spark-class does, but also masks the decision of whether to run the > driver > > in the client process or on the cluster (both standalone and YARN have > > modes for both of these). > > > > Does this all make sense? > > -Sandy > >