Sounds good- I hope to be working on one for naive bayes- been a bit hectic lately so hopefully sooner than later. I'll have a better understanding of the CLI code then.
> Subject: Re: Spark options > From: [email protected] > Date: Wed, 12 Nov 2014 10:49:43 -0800 > To: [email protected] > > Andrew, when you get to creating a driver maybe we should take another look > at how to launch them. I’ll add the -Dxxx=yyy option for now. > > > On Nov 12, 2014, at 9:46 AM, Dmitriy Lyubimov <[email protected]> wrote: > > i do not object to driver CLI to use that. I was only skeptical about shell > startup. And i also want these to be part of oficial Spark documented api. > (are these classes it?) If they are not a stable api, we'd have trouble > doing major dependency update. If we only depend on RDD api, the updates > are easier. > > But... if anyone wants to engineer and verify a patch to use these to > launch mahout shell, and it works, I don't have really strong basis for > objection aside for api stability concern. > > On Wed, Nov 12, 2014 at 8:33 AM, Pat Ferrel <[email protected]> wrote: > > > yes, the drivers support executor memory directly too. > > > > What was the reason you didn’t want to use the Spark submit process for > > executing drivers? I understand we have to find our jars and setup kryo. > > > > On Nov 11, 2014, at 6:00 PM, Dmitriy Lyubimov <[email protected]> wrote: > > > > which is why i explicitly configure executor memory on the client. Although > > even that interpretation depends on the resource manager A LOT it seems. > > > > On Tue, Nov 11, 2014 at 5:49 PM, Pat Ferrel <[email protected]> wrote: > > > >> The submit code is the only place that documents which are needed by > >> clients AFAICT. It is pretty complicated and heavily laden with checks > > for > >> which cluster manager is being used. I’d feel a lot better if we were > > using > >> it. There is no way any of us are going to be able to test on all those > >> configurations. > >> > >> spark-env.sh is mostly for launching the cluster not the client but there > >> seem to be exceptions like executor memory. > >> > >> > >> On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> > >> these files if i read it correctly are for spawning yet another process. > > i > >> don't see how it may work for the shell. > >> > >> I am also not convinced that spark-env is important for the client. > >> > >> > >> On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <[email protected]> > > wrote: > >> > >>> I was thinking -Dx=y too, seems like a good idea. > >>> > >>> But we should also support setting them the way Spark documents in > >>> spark-env.sh and the two links Andrew found may solve that in a > >>> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf > >>> function, which handles all env supplied setup. For the drivers it can > > be > >>> done in the base class allowing and CLI overrides later. Then the > >> SparkConf > >>> is finally passed in to mahoutSparkContext where as little as possible > > is > >>> changed in the conf. > >>> > >>> I’ll look at this for the drivers. Should be easy to add to the shell. > >>> > >>> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <[email protected]> > >> wrote: > >>> > >>> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y > >>> parameters to the java startup call and all should be fine. > >>> > >>> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <[email protected]> > >>> wrote: > >>> > >>>> > >>>> > >>>> > >>>> I've run into this problem starting $ mahout shell-script. i.e. > > needing > >>>> to set the spark.kryoserializer.buffer.mb and spark.akka.frameSize. > >>> I've > >>>> been temporarily hard coding them for now while developing. > >>>> > >>>> I'm just getting familiar with What you've done with the CLI drivers. > >>> For > >>>> #2 could we borrow option parsing code/methods from spark [1] [2] at > >> each > >>>> (spark) release and somehow add this to > >>>> MahoutOptionParser.parseSparkOptions? > >>>> > >>>> I'll hopefully be doing some CLI work soon and have a better > >>> understanding. > >>>> > >>>> [1] > >>>> > >>> > >> > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala > >>>> [2] > >>>> > >>> > >> > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala > >>>> > >>>>> From: [email protected] > >>>>> Subject: Spark options > >>>>> Date: Wed, 5 Nov 2014 09:48:59 -0800 > >>>>> To: [email protected] > >>>>> > >>>>> Spark has a launch script as hadoop does. We use the Hadoop launcher > >>>> script but not the Spark one. When starting up your Spark cluster there > >>> is > >>>> a spark-env.sh script that can set a bunch of environment variables. In > >>> our > >>>> own mahoutSparkContext function, which takes the place of the Spark > >>> submit > >>>> script and launcher we don’t account for most of the environment > >>> variables. > >>>>> > >>>>> Unless I missed something this means most of the documented options > >> will > >>>> be ignored unless a user of Mahout parses and sets them in their own > >>>> SparkConf. The Mahout CLI drivers don’t do this for all possible > >> options, > >>>> only supporting a few like job name and spark.executor.memory. > >>>>> > >>>>> The question is how to best handle these Spark options. There seem to > >> be > >>>> two options: > >>>>> 1) use sparks launch mechanism for drivers but allow some to be > >>>> overridden in the CLI > >>>>> 2) add parsing the env for options and set up the SparkConf default in > >>>> mahoutSparkContext with those variables. > >>>>> > >>>>> The downside of #2 is that as variables change we’ll have to reflect > >>>> those in our code. I forget why #1 is not an option but Dmitriy has > > been > >>>> consistently against this—in any case it would mean a fair bit of > >>>> refactoring I believe. > >>>>> > >>>>> Any opinions or corrections? > >>>> > >>>> > >>>> > >>> > >>> > >> > >> > > > > >
