Hey All, I've encountered some confusion about how to run a Spark app from a compiled jar and wanted to bring up the recommended way.
It seems like the current standard options are: * Build an uber jar that contains the user jar and all of Spark. * Explicitly include the locations of the Spark jars on the client machine in the classpath. Both of these options have a couple issues. For the uber jar, this means unnecessarily sending all of Spark (and its dependencies) to every executor, as well as including Spark twice in the executor classpaths. This also requires recompiling binaries against the latest version whenever the cluster version is upgraded, lest executor classpaths include two different versions of Spark at the same time. Explicitly including the Spark jars in the classpath is a huge pain because their locations can vary significantly between different installations and platforms, and makes the invocation more verbose. What seems ideal to me is a script that takes a user jar, sets up the Spark classpath, and runs it. This means only the user jar gets shipped across the cluster, but the user doesn't need to figure out how to get the Spark jars onto the client classpath. This is similar to the "hadoop jar" command commonly used for running MapReduce jobs. The spark-class script seems to do almost exactly this, but I've been told it's meant only for internal Spark use (with the possible exception of yarn-standalone mode). It doesn't take a user jar as an argument, but one can be added by setting the SPARK_CLASSPATH variable. This script could be stabilized for user use. Another option would be to have a "spark-app" script that does what spark-class does, but also masks the decision of whether to run the driver in the client process or on the cluster (both standalone and YARN have modes for both of these). Does this all make sense? -Sandy