I use spark-submit also to launch apps that use Mahout so not sure what assumptions you are talking about. The first thing is to use spark-submit in our own launch script.
The current code calls the CLI mahout script to get classpath info, this should be passed in to the spark-submit so if we launch with spark-submit I think the call of the mahout script would be unnecessary. This makes it more straightforward to use with Yarn cluster mode where the client/driver is launched on some cluster machine where there would be no script to call. If the SparkMahoutContext is a hard requirement that’s fine. As I said, I don’t understand all of those ramifications. On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <[email protected]> wrote: I do submits all the time, don't see any problem. It is part of my standard stress test harness. Mahout context is conceptual and cannot be removed, nor it is required to be removed in order to run submitted jobs. Submission and contexts are two completely separate concepts. One can submit a job that for example doesn't set up a spark job at all and runs for example a Mr job, or just manipulates some HDFS directories, or sets up multiple jobs or combinations of all of the above. All submission means is sending an Uber jar to an application server and launching a main class there, instead of doing the same locally. Not sure where these all assumptions are coming from. On Nov 27, 2015 11:33 AM, "Pat Ferrel" <[email protected]> wrote: > Currently we create a SparkMahoutContext, and use “mahout -spark > classpath” to create the SparkContext. the SparkConf is also directly > accessed. If we move to using spark-submit for launching the Mahout Shell > and other drivers we would need to refactor some of this and change the > mahout script. It seems desirable to have and driver code create the Spark > context and rely on spark-submit for any config overrides and params. This > implies the possible removal (not sure about this) of SparkMahoutContext. > In general it would be nice if this were done outside of Mahout, or limited > to the drivers and shell. Mahout has become a library that is designed to > be backend independent. This code was designed before this became a goal > and is beyond my understanding to fully grasp how much work would be > involved and what would replace it. > > The code refactoring needed is not well understood, by me at least. But > intuition says that with a growing number of backends it might be good to > clean up the Spark dependencies for context management. This has also been > a bit of a problem in creating apps that use Mahout since typical > spark-submit use cannot be relied on to make config changes, they must be > made in environment variables only. These arguably non-standard > manipulation of the context puts limitations and hidden assumptions into > using Mahout as a library. > > Doing all of this implies a fairly large bit of work, I think. The benefit > is that it will be more clear how to use Mahout as a library and in > cleaning up some unneeded code. I’m not sure I have enough time to do all > of this myself. > > This isn’t so much a proposal as a call for discussion. > > >
