Re: standard way of running a compiled jar

Matei Zaharia Sun, 23 Feb 2014 18:59:25 -0800

Yes, it is a supported option. I’m just wondering whether we want to create a 
script for it specifically. Maybe the same script could also allow submitting 
to the cluster or something.


Matei

On Feb 23, 2014, at 1:55 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Is the client=driver mode still a supported option (outside of the REPLs),
> at least for the medium term?  My impression from reading the docs is that
> it's the most common, if not recommended, way to submit jobs.  If that's
> the case, I still think it's important, or at least helpful, to have
> something for this mode that addresses the issues below.
> 
> 
> On Sat, Feb 22, 2014 at 10:48 PM, Matei Zaharia 
> <matei.zaha...@gmail.com>wrote:
> 
>> Hey Sandy,
>> 
>> In the long run, the ability to submit driver programs to run in the
>> cluster (added in 0.9 as org.apache.spark.deploy.Client) might solve this.
>> This is a feature currently available in the standalone mode that runs the
>> driver on a worker node, but it is also how YARN works by default, and it
>> wouldn't be too bad to do in Mesos. With this, the user could compile a JAR
>> that excludes Spark and still get Spark on the classpath.
>> 
>> This was added in 0.9 as a slightly harder to invoke feature mainly to be
>> used for Spark Streaming (since the cluster can also automatically restart
>> your driver), but we can create a script around it for submissions. I'd
>> like to see a design for such a script that takes into account all the
>> deploy modes though, because it would be confusing to use it one way on
>> YARN and another way on standalone for instance. Already the YARN submit
>> client kind of does what you're looking for.
>> 
>> Matei
>> 
>> On Feb 22, 2014, at 2:08 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
>> 
>>> Hey All,
>>> 
>>> I've encountered some confusion about how to run a Spark app from a
>>> compiled jar and wanted to bring up the recommended way.
>>> 
>>> It seems like the current standard options are:
>>> * Build an uber jar that contains the user jar and all of Spark.
>>> * Explicitly include the locations of the Spark jars on the client
>>> machine in the classpath.
>>> 
>>> Both of these options have a couple issues.
>>> 
>>> For the uber jar, this means unnecessarily sending all of Spark (and its
>>> dependencies) to every executor, as well as including Spark twice in the
>>> executor classpaths.  This also requires recompiling binaries against the
>>> latest version whenever the cluster version is upgraded, lest executor
>>> classpaths include two different versions of Spark at the same time.
>>> 
>>> Explicitly including the Spark jars in the classpath is a huge pain
>> because
>>> their locations can vary significantly between different installations
>> and
>>> platforms, and makes the invocation more verbose.
>>> 
>>> What seems ideal to me is a script that takes a user jar, sets up the
>> Spark
>>> classpath, and runs it.  This means only the user jar gets shipped across
>>> the cluster, but the user doesn't need to figure out how to get the Spark
>>> jars onto the client classpath.  This is similar to the "hadoop jar"
>>> command commonly used for running MapReduce jobs.
>>> 
>>> The spark-class script seems to do almost exactly this, but I've been
>> told
>>> it's meant only for internal Spark use (with the possible exception of
>>> yarn-standalone mode). It doesn't take a user jar as an argument, but one
>>> can be added by setting the SPARK_CLASSPATH variable.  This script could
>> be
>>> stabilized for user use.
>>> 
>>> Another option would be to have a "spark-app" script that does what
>>> spark-class does, but also masks the decision of whether to run the
>> driver
>>> in the client process or on the cluster (both standalone and YARN have
>>> modes for both of these).
>>> 
>>> Does this all make sense?
>>> -Sandy
>> 
>>

Re: standard way of running a compiled jar

Reply via email to