Re: using spark-submit to launch CLI jobs

Pat Ferrel Sat, 28 Nov 2015 10:56:06 -0800

I use spark-submit also to launch apps that use Mahout so not sure what 
assumptions you are talking about. The first thing is to use spark-submit in 
our own launch script.

The current code calls the CLI mahout script to get classpath info, this should 
be passed in to the spark-submit so if we launch with spark-submit I think the 
call of the mahout script would be unnecessary. This makes it more 
straightforward to use with Yarn cluster mode where the client/driver is 
launched on some cluster machine where there would be no script to call.

If the SparkMahoutContext is a hard requirement that’s fine. As I said, I don’t 
understand all of those ramifications. 

On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <[email protected]> wrote:

I do submits all the time, don't see any problem. It is part of my standard
stress test harness.

Mahout context is conceptual and cannot be removed, nor it is required to
be removed in order to run submitted jobs. Submission and contexts are two
completely separate concepts. One can submit a job that for example doesn't
set up a spark job at all and runs for example a Mr job, or just
manipulates some HDFS directories, or sets up multiple jobs or combinations
of all of the above. All submission means is sending an Uber jar to an
application server and launching a main class there, instead of doing the
same locally. Not sure where these all assumptions are coming from.
On Nov 27, 2015 11:33 AM, "Pat Ferrel" <[email protected]> wrote:

> Currently we create a SparkMahoutContext, and use “mahout -spark
> classpath” to create the SparkContext. the SparkConf is also directly
> accessed. If we move to using spark-submit for launching the Mahout Shell
> and other drivers we would need to refactor some of this and change the
> mahout script. It seems desirable to have and driver code create the Spark
> context and rely on spark-submit for any config overrides and params. This
> implies the possible removal (not sure about this) of SparkMahoutContext.
> In general it would be nice if this were done outside of Mahout, or limited
> to the drivers and shell. Mahout has become a library that is designed to
> be backend independent. This code was designed before this became a goal
> and is beyond my understanding to fully grasp how much work would be
> involved and what would replace it.
> 
> The code refactoring needed is not well understood, by me at least. But
> intuition says that with a growing number of backends it might be good to
> clean up the Spark dependencies for context management. This has also been
> a bit of a problem in creating apps that use Mahout since typical
> spark-submit use cannot be relied on to make config changes, they must be
> made in environment variables only. These arguably non-standard
> manipulation of the context puts limitations and hidden assumptions into
> using Mahout as a library.
> 
> Doing all of this implies a fairly large bit of work, I think. The benefit
> is that it will be more clear how to use Mahout as a library and in
> cleaning up some unneeded code. I’m not sure I have enough time to do all
> of this myself.
> 
> This isn’t so much a proposal as a call for discussion.
> 
> 
>

Re: using spark-submit to launch CLI jobs

Reply via email to