Rather than respond to these. I read votes of -1 and neutral to touching this part of Mahout. So I am currently uninclined to mess with it. I’ll concentrate on documenting how to use Mahout with external apps.
On Nov 29, 2015, at 9:21 PM, Dmitriy Lyubimov <[email protected]> wrote: On Sun, Nov 29, 2015 at 10:33 AM, Pat Ferrel <[email protected]> wrote: > BTW I agree with a later reply from Dmitriy that real use of Mahout > generally will employ spark-submit I never said that. I said i use it for stress tests to test out certain components of algorithms under pressure. For the "real thing" i can unfortunately only use coarse grained long-living driver-side session management because fine grained scheduling and (god forbid) submit makes transition from awful strong scale properties of spark in coarse grained scheduling (in terms of iterations) to impossible due to my algorithm specifics and product reqs.Spark submits are inherently evil as it comes to exploratory analysis. > so the motivation is primarily related to launching app/driver level > things in Mahout. But these have broken several times now partly due to > Mahout not following the spark-submit conventions (ever changing though > they may be). > > One other motivation is that the Spark bindings mahoutSparkContext > function calls the mahout script to get a classpath and then creates a > Spark Context. It might be good to make this private to Mahout (used only > in the test suites) so users don’t see this as the only or preferred way to > create a SparkMahoutContext, which seems better constructed from a Spark > Context. > implicit val sc = <Spark Context creation code> > implicit val mc = SparkDistributedContext( sc ) > > Again, i don't see a problem. if we _carefully_ study context factory method [1], we will notice that it has parameter addMahoutJars which can be set to false, in which case the factory method doesn't say any of what you imply. It doesn't require calling out mahout script, it doesn't even require MAHOUT_HOME. On top of it, it allows you to add your own jars ('customJars' parameter) or even override whatever in SparkConf directly (parameter sparkConf). I don't know what more details it can possibly expose to do whatever you want with the context. If for some reason you don't have control over context creation parameter application, and you absolutely must wrap existing spark context, this is also possible by just doing 'new DistributedSparkContext(sc)' and in fact i am guilty of having that in a few situations as well. But I think there's a good reason to discourage that because we do want to assert certain identities in the context to make sure it is still workable. not many, just the Kryo serialization for now, but what we assert is immaterial; what is material is that we do want to retain some control over context parameters, if nothing else but to validate it. So yes the factory method is still preferrable, even for spark submit applications. Perhaps a tutorial how to do that is a good thing to do, but i don't see what could be essentially better than what it is now. Maybe when/if Mahout has a much richer library of standard implementations so that someone wants to actually use command lines to run them, maybe it's worth to make these having spark-submit options. (as you know i am sceptical about command lines though). If you want to implement a submit option for cross recommender in addition to any cli that exists, sure go ahead. it is easy with existing api. let me know if you need any further information beyond provided here. [1] https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala line 58
