Re: using spark-submit to launch CLI jobs

Pat Ferrel Thu, 03 Dec 2015 09:18:12 -0800

Rather than respond to these.

I read votes of -1 and neutral to touching this part of Mahout. So I am 
currently uninclined to mess with it. I’ll concentrate on documenting how to 
use Mahout with external apps.

On Nov 29, 2015, at 9:21 PM, Dmitriy Lyubimov <[email protected]> wrote:

On Sun, Nov 29, 2015 at 10:33 AM, Pat Ferrel <[email protected]> wrote:

> BTW I agree with a later reply from Dmitriy that real use of Mahout
> generally will employ spark-submit

I never said that. I said i use it for stress tests to test out certain
components of algorithms under pressure. For the "real thing" i can
unfortunately only use coarse grained long-living driver-side session
management because fine grained scheduling and (god forbid) submit makes
transition from awful strong scale properties of spark in coarse grained
scheduling (in terms of iterations)  to impossible due to my algorithm
specifics and product reqs.Spark submits are inherently evil as it comes to
exploratory analysis.

> so the motivation is primarily related to launching app/driver level
> things in Mahout. But these have broken several times now partly due to
> Mahout not following the spark-submit conventions (ever changing though
> they may be).
> 
> One other motivation is that the Spark bindings mahoutSparkContext
> function calls the mahout script to get a classpath and then creates a
> Spark Context. It might be good to make this private to Mahout (used only
> in the test suites) so users don’t see this as the only or preferred way to
> create a SparkMahoutContext, which seems better constructed from a Spark
> Context.
> 

   implicit val sc = <Spark Context creation code>
>    implicit val mc = SparkDistributedContext( sc )
> 
> 
Again, i don't see a problem.

if we _carefully_ study context factory method [1], we will notice that it
has parameter addMahoutJars which can be set to false, in which case the
factory method doesn't say any of what you imply. It doesn't require
calling out mahout script, it doesn't even require MAHOUT_HOME. On top of
it, it allows you to add your own jars ('customJars' parameter) or even
override whatever in SparkConf directly (parameter sparkConf). I don't know
what more details it can possibly expose to do whatever you want with the
context.

If for some reason you don't have control over context creation parameter
application, and you absolutely must wrap existing spark context, this is
also possible by just doing 'new DistributedSparkContext(sc)' and in fact i
am guilty of having that in a few situations as well. But I think there's a
good reason to discourage that because we do want to assert certain
identities in the context to make sure it is still workable. not many, just
the Kryo serialization for now, but what we assert is immaterial; what is
material is that we do want to retain some control over context parameters,
if nothing else but to validate it.

So yes the factory method is still preferrable, even for spark submit
applications. Perhaps a tutorial how to do that is a good thing to do, but
i don't see what could be essentially better than what it is now.

Maybe when/if Mahout has a much richer library of standard implementations
so that someone wants to actually use command lines to run them, maybe it's
worth to make these having spark-submit options. (as you know i am
sceptical about command lines though).

If you want to implement a submit option for cross recommender in addition
to any cli that exists, sure go ahead. it is easy with existing api. let me
know if you need any further information beyond provided here.

[1]
https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
line 58

Re: using spark-submit to launch CLI jobs

Reply via email to