Andrew, when you get to creating a driver maybe we should take another look at how to launch them. I’ll add the -Dxxx=yyy option for now.
On Nov 12, 2014, at 9:46 AM, Dmitriy Lyubimov <[email protected]> wrote: i do not object to driver CLI to use that. I was only skeptical about shell startup. And i also want these to be part of oficial Spark documented api. (are these classes it?) If they are not a stable api, we'd have trouble doing major dependency update. If we only depend on RDD api, the updates are easier. But... if anyone wants to engineer and verify a patch to use these to launch mahout shell, and it works, I don't have really strong basis for objection aside for api stability concern. On Wed, Nov 12, 2014 at 8:33 AM, Pat Ferrel <[email protected]> wrote: > yes, the drivers support executor memory directly too. > > What was the reason you didn’t want to use the Spark submit process for > executing drivers? I understand we have to find our jars and setup kryo. > > On Nov 11, 2014, at 6:00 PM, Dmitriy Lyubimov <[email protected]> wrote: > > which is why i explicitly configure executor memory on the client. Although > even that interpretation depends on the resource manager A LOT it seems. > > On Tue, Nov 11, 2014 at 5:49 PM, Pat Ferrel <[email protected]> wrote: > >> The submit code is the only place that documents which are needed by >> clients AFAICT. It is pretty complicated and heavily laden with checks > for >> which cluster manager is being used. I’d feel a lot better if we were > using >> it. There is no way any of us are going to be able to test on all those >> configurations. >> >> spark-env.sh is mostly for launching the cluster not the client but there >> seem to be exceptions like executor memory. >> >> >> On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >> these files if i read it correctly are for spawning yet another process. > i >> don't see how it may work for the shell. >> >> I am also not convinced that spark-env is important for the client. >> >> >> On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <[email protected]> > wrote: >> >>> I was thinking -Dx=y too, seems like a good idea. >>> >>> But we should also support setting them the way Spark documents in >>> spark-env.sh and the two links Andrew found may solve that in a >>> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf >>> function, which handles all env supplied setup. For the drivers it can > be >>> done in the base class allowing and CLI overrides later. Then the >> SparkConf >>> is finally passed in to mahoutSparkContext where as little as possible > is >>> changed in the conf. >>> >>> I’ll look at this for the drivers. Should be easy to add to the shell. >>> >>> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >>> >>> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y >>> parameters to the java startup call and all should be fine. >>> >>> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <[email protected]> >>> wrote: >>> >>>> >>>> >>>> >>>> I've run into this problem starting $ mahout shell-script. i.e. > needing >>>> to set the spark.kryoserializer.buffer.mb and spark.akka.frameSize. >>> I've >>>> been temporarily hard coding them for now while developing. >>>> >>>> I'm just getting familiar with What you've done with the CLI drivers. >>> For >>>> #2 could we borrow option parsing code/methods from spark [1] [2] at >> each >>>> (spark) release and somehow add this to >>>> MahoutOptionParser.parseSparkOptions? >>>> >>>> I'll hopefully be doing some CLI work soon and have a better >>> understanding. >>>> >>>> [1] >>>> >>> >> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala >>>> [2] >>>> >>> >> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala >>>> >>>>> From: [email protected] >>>>> Subject: Spark options >>>>> Date: Wed, 5 Nov 2014 09:48:59 -0800 >>>>> To: [email protected] >>>>> >>>>> Spark has a launch script as hadoop does. We use the Hadoop launcher >>>> script but not the Spark one. When starting up your Spark cluster there >>> is >>>> a spark-env.sh script that can set a bunch of environment variables. In >>> our >>>> own mahoutSparkContext function, which takes the place of the Spark >>> submit >>>> script and launcher we don’t account for most of the environment >>> variables. >>>>> >>>>> Unless I missed something this means most of the documented options >> will >>>> be ignored unless a user of Mahout parses and sets them in their own >>>> SparkConf. The Mahout CLI drivers don’t do this for all possible >> options, >>>> only supporting a few like job name and spark.executor.memory. >>>>> >>>>> The question is how to best handle these Spark options. There seem to >> be >>>> two options: >>>>> 1) use sparks launch mechanism for drivers but allow some to be >>>> overridden in the CLI >>>>> 2) add parsing the env for options and set up the SparkConf default in >>>> mahoutSparkContext with those variables. >>>>> >>>>> The downside of #2 is that as variables change we’ll have to reflect >>>> those in our code. I forget why #1 is not an option but Dmitriy has > been >>>> consistently against this—in any case it would mean a fair bit of >>>> refactoring I believe. >>>>> >>>>> Any opinions or corrections? >>>> >>>> >>>> >>> >>> >> >> > >
