Re: using spark-submit to launch CLI jobs

Pat Ferrel Sun, 29 Nov 2015 10:34:12 -0800

BTW I agree with a later reply from Dmitriy that real use of Mahout generally 
will employ spark-submit so the motivation is primarily related to launching 
app/driver level things in Mahout. But these have broken several times now 
partly due to Mahout not following the spark-submit conventions (ever changing 
though they may be).

One other motivation is that the Spark bindings mahoutSparkContext function 
calls the mahout script to get a classpath and then creates a Spark Context. It 
might be good to make this private to Mahout (used only in the test suites) so 
users don’t see this as the only or preferred way to create a 
SparkMahoutContext, which seems better constructed from a Spark Context.

    implicit val sc = <Spark Context creation code>
    implicit val mc = SparkDistributedContext( sc )

Since the drivers are sometimes used as examples of employing Mahout with Spark 
we could change them to use the above method and for the same reasons employing 
spark-submit to launch them is the right example to give.

If no one is particularly interested in this bit of refactoring or has no 
contrary opinions to the above I’m inclined to do this as I have time.

On Nov 28, 2015, at 10:55 AM, Pat Ferrel <[email protected]> wrote:

I use spark-submit also to launch apps that use Mahout so not sure what 
assumptions you are talking about. The first thing is to use spark-submit in 
our own launch script.

The current code calls the CLI mahout script to get classpath info, this should 
be passed in to the spark-submit so if we launch with spark-submit I think the 
call of the mahout script would be unnecessary. This makes it more 
straightforward to use with Yarn cluster mode where the client/driver is 
launched on some cluster machine where there would be no script to call.

If the SparkMahoutContext is a hard requirement that’s fine. As I said, I don’t 
understand all of those ramifications. 

On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <[email protected]> wrote:

I do submits all the time, don't see any problem. It is part of my standard
stress test harness.

Mahout context is conceptual and cannot be removed, nor it is required to
be removed in order to run submitted jobs. Submission and contexts are two
completely separate concepts. One can submit a job that for example doesn't
set up a spark job at all and runs for example a Mr job, or just
manipulates some HDFS directories, or sets up multiple jobs or combinations
of all of the above. All submission means is sending an Uber jar to an
application server and launching a main class there, instead of doing the
same locally. Not sure where these all assumptions are coming from.
On Nov 27, 2015 11:33 AM, "Pat Ferrel" <[email protected]> wrote:

> Currently we create a SparkMahoutContext, and use “mahout -spark
> classpath” to create the SparkContext. the SparkConf is also directly
> accessed. If we move to using spark-submit for launching the Mahout Shell
> and other drivers we would need to refactor some of this and change the
> mahout script. It seems desirable to have and driver code create the Spark
> context and rely on spark-submit for any config overrides and params. This
> implies the possible removal (not sure about this) of SparkMahoutContext.
> In general it would be nice if this were done outside of Mahout, or limited
> to the drivers and shell. Mahout has become a library that is designed to
> be backend independent. This code was designed before this became a goal
> and is beyond my understanding to fully grasp how much work would be
> involved and what would replace it.
> 
> The code refactoring needed is not well understood, by me at least. But
> intuition says that with a growing number of backends it might be good to
> clean up the Spark dependencies for context management. This has also been
> a bit of a problem in creating apps that use Mahout since typical
> spark-submit use cannot be relied on to make config changes, they must be
> made in environment variables only. These arguably non-standard
> manipulation of the context puts limitations and hidden assumptions into
> using Mahout as a library.
> 
> Doing all of this implies a fairly large bit of work, I think. The benefit
> is that it will be more clear how to use Mahout as a library and in
> cleaning up some unneeded code. I’m not sure I have enough time to do all
> of this myself.
> 
> This isn’t so much a proposal as a call for discussion.
> 
> 
>

Re: using spark-submit to launch CLI jobs

Reply via email to