RE: Spark options

Andrew Palumbo Wed, 12 Nov 2014 19:41:27 -0800

Sounds good-  I hope to be working on one for naive bayes- been a bit hectic 
lately so hopefully sooner than later.  I'll have a better understanding of the 
CLI code then.


> Subject: Re: Spark options
> From: [email protected]
> Date: Wed, 12 Nov 2014 10:49:43 -0800
> To: [email protected]
> 
> Andrew, when you get to creating a driver maybe we should take another look 
> at how to launch them. I’ll add the -Dxxx=yyy option for now.
> 
>  
> On Nov 12, 2014, at 9:46 AM, Dmitriy Lyubimov <[email protected]> wrote:
> 
> i do not object to driver CLI to use that. I was only skeptical about shell
> startup. And i also want these to be part of oficial Spark documented api.
> (are these classes it?) If they are not a stable api, we'd have trouble
> doing major dependency update. If we only depend on RDD api, the updates
> are easier.
> 
> But... if anyone wants to engineer and verify a patch to use these to
> launch mahout shell, and it works, I don't have really strong basis for
> objection aside for api stability concern.
> 
> On Wed, Nov 12, 2014 at 8:33 AM, Pat Ferrel <[email protected]> wrote:
> 
> > yes, the drivers support executor memory directly too.
> > 
> > What was the reason you didn’t want to use the Spark submit process for
> > executing drivers? I understand we have to find our jars and setup kryo.
> > 
> > On Nov 11, 2014, at 6:00 PM, Dmitriy Lyubimov <[email protected]> wrote:
> > 
> > which is why i explicitly configure executor memory on the client. Although
> > even that interpretation  depends on the resource manager A LOT it seems.
> > 
> > On Tue, Nov 11, 2014 at 5:49 PM, Pat Ferrel <[email protected]> wrote:
> > 
> >> The submit code is the only place that documents which are needed by
> >> clients AFAICT. It is pretty complicated and heavily laden with checks
> > for
> >> which cluster manager is being used. I’d feel a lot better if we were
> > using
> >> it. There is no way any of us are going to be able to test on all those
> >> configurations.
> >> 
> >> spark-env.sh is mostly for launching the cluster not the client but there
> >> seem to be exceptions like executor memory.
> >> 
> >> 
> >> On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <[email protected]> wrote:
> >> 
> >> these files if i read it correctly are for spawning yet another process.
> > i
> >> don't see how it may work for the shell.
> >> 
> >> I am also not convinced that spark-env is important for the client.
> >> 
> >> 
> >> On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <[email protected]>
> > wrote:
> >> 
> >>> I was thinking -Dx=y too, seems like a good idea.
> >>> 
> >>> But we should also support setting them the way Spark documents in
> >>> spark-env.sh and the two links Andrew found may solve that in a
> >>> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
> >>> function, which handles all env supplied setup. For the drivers it can
> > be
> >>> done in the base class allowing and CLI overrides later. Then the
> >> SparkConf
> >>> is finally passed in to mahoutSparkContext where as little as possible
> > is
> >>> changed in the conf.
> >>> 
> >>> I’ll look at this for the drivers. Should be easy to add to the shell.
> >>> 
> >>> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <[email protected]>
> >> wrote:
> >>> 
> >>> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
> >>> parameters to the java startup call and all should be fine.
> >>> 
> >>> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <[email protected]>
> >>> wrote:
> >>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> I've run into this problem starting $ mahout shell-script.  i.e.
> > needing
> >>>> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
> >>> I've
> >>>> been temporarily hard coding them for now while developing.
> >>>> 
> >>>> I'm just getting familiar with What you've done with the CLI drivers.
> >>> For
> >>>> #2 could we borrow option parsing code/methods from spark [1] [2] at
> >> each
> >>>> (spark) release and somehow add this to
> >>>> MahoutOptionParser.parseSparkOptions?
> >>>> 
> >>>> I'll hopefully be doing some CLI work soon and have a better
> >>> understanding.
> >>>> 
> >>>> [1]
> >>>> 
> >>> 
> >> 
> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
> >>>> [2]
> >>>> 
> >>> 
> >> 
> > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
> >>>> 
> >>>>> From: [email protected]
> >>>>> Subject: Spark options
> >>>>> Date: Wed, 5 Nov 2014 09:48:59 -0800
> >>>>> To: [email protected]
> >>>>> 
> >>>>> Spark has a launch script as hadoop does. We use the Hadoop launcher
> >>>> script but not the Spark one. When starting up your Spark cluster there
> >>> is
> >>>> a spark-env.sh script that can set a bunch of environment variables. In
> >>> our
> >>>> own mahoutSparkContext function, which takes the place of the Spark
> >>> submit
> >>>> script and launcher we don’t account for most of the environment
> >>> variables.
> >>>>> 
> >>>>> Unless I missed something this means most of the documented options
> >> will
> >>>> be ignored unless a user of Mahout parses and sets them in their own
> >>>> SparkConf. The Mahout CLI drivers don’t do this for all possible
> >> options,
> >>>> only supporting a few like job name and spark.executor.memory.
> >>>>> 
> >>>>> The question is how to best handle these Spark options. There seem to
> >> be
> >>>> two options:
> >>>>> 1) use sparks launch mechanism for drivers but allow some to be
> >>>> overridden in the CLI
> >>>>> 2) add parsing the env for options and set up the SparkConf default in
> >>>> mahoutSparkContext with those variables.
> >>>>> 
> >>>>> The downside of #2 is that as variables change we’ll have to reflect
> >>>> those in our code. I forget why #1 is not an option but Dmitriy has
> > been
> >>>> consistently against this—in any case it would mean a fair bit of
> >>>> refactoring I believe.
> >>>>> 
> >>>>> Any opinions or corrections?
> >>>> 
> >>>> 
> >>>> 
> >>> 
> >>> 
> >> 
> >> 
> > 
> > 
>

RE: Spark options

Reply via email to