RE: Straw poll re: H2O ?

Saikat Kanjilal Mon, 28 Apr 2014 15:45:07 -0700

My main question at this point would be (and I realize this discussion has been 
had in multiple places in a parts but not in enough detail), given that the 
user is staying in the DSL land, when would I use one backend versus another 
and the easy of getting up and running with 1 backend versus the other.   
Keenly following all discussions in the meantime.


> Date: Mon, 28 Apr 2014 14:13:35 -0700
> Subject: Re: Straw poll re: H2O ?
> From: [email protected]
> To: [email protected]
> CC: [email protected]
> 
> Saikat, Pat,
> 
> For background, please refer to the "Mahout DSL vs Spark" discussion the
> for the general direction in which the integration is being explored. With
> that background, I would like to present some counter questions:
> 
> 1. Why is the DSL claiming to have (in its vision) logical vs physical
> separation if not for providing multiple compute backends?
> 
> 2. Does the proposal of having a new DSL backend in the future (for e.g
> stratosphere as suggested elsewhere) make you:
> -- propose mahout-stratosphere as a different top level project?
> -- worry that stratosphere would be a dependency to Mahout?
> -- worry that you won't be able to say "Future of Mahout is Spark .. but it
> also supports stratosphere"?
> -- worry that as a user/commiter/contributor you have to worry about a new
> framework?
> -- resist having a DSL backend for stratosphere because Hadoop vendors may
> not support it?
> 
> Obviously no, since they are all just different DSL backends.
> 
> Have you guys embraced the idea that the DSL allows for multiple backends
> (Spark being the first to get implemented)? or Not? Hence I do not
> understand the "problem" here.
> 
> Thanks
> 
> On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <[email protected]>wrote:
> 
> > I would echo Pat's sentiments spot on related to the goal of supporting
> > both spark and H2O confusing folks that are interested in using, committing
> > to and trying to understand where Mahout is headed small to medium term.
> > I hate to throw this out but given the amount of "sometimes not so nice
> > back and forths I've seen on issue 1500" I really wonder whether we should
> > have mahout-spark and mahout-h2o as two different top level projects
> > potentially supporting a different set of algorithms underneath, yes I know
> > tieing mahout to a particular technology goes against the initial vision
> > but given the churn I'm seeing I'm not sure I understand what the current
> > vision even is :)
> >
> > > Subject: Re: Straw poll re: H2O ?
> > > From: [email protected]
> > > Date: Mon, 28 Apr 2014 13:17:03 -0700
> > > CC: [email protected]
> > > To: [email protected]
> > >
> > > I haven’t heard a good explanation of what this project is. There should
> > be some small step like implementing an algo on h2o to takes the same input
> > as a current Hadoop Mahout job and produce the same result or do one not
> > already in Mahout. At least it will answer some technical questions and
> > shouldn’t take a lot of support from current committers to produce.
> > >
> > > I’m still not convinced that this is the primary thing that should drive
> > making it a Mahout dependency.
> > >
> > > I’m highly dubious of actively supporting and working on Mahout for
> > Spark and h2o. Not for technical reasons but because rebooting Mahout on
> > two platforms seems a non-starter. No project manager in the commercial
> > world would allow that sort of thing. And rightly so, it confuses users,
> > committers, contributors. You shouldn’t have a great deal of redundancy or
> > competing efforts _inside_ a project even an open source one. That’s for
> > separate projects and the incubator, right? There are plenty of examples of
> > going that route, Spark itself is redundant with Hadoop in many ways. Would
> > Apache accept h2o as a parallel project to Spark, if so why not do that?
> > >
> > > Question: Where do we (Mahout user, committer, contributor) invest
> > extremely precious time learning new languages, frameworks, architecture,
> > configurations, optimizations?
> > >
> > > Answer: Many will simply not choose but wait and see, or go elsewhere.
> > >
> > > Why? Because we fail to communicate “the future of Mahout is Spark
> > first—period” It keeps coming out "Spark and, well, h2o too”
> > >
> > > That is a momentum killer.  If we’re agreed on “Spark first” then
> > there’s no need to incubate Mahout 2, Spark and Mahout have already gone
> > through that and though Dmitriy’s DSL and Scala shell work is entirely new,
> > to the end user the jobs, input and output, and functionality will look
> > like a v2. People dealing with internals will see a different world but
> > they should be a minority of users and will hopefully like what they see.
> > >
> > >
> > > Somewhat off subject notes on external politics:
> > >
> > > We really need to make sure Mahout stays in all the big distros. That
> > means Sebastian’s comments are spot on: "The best way to help Mahout is to
> > pick up some of the work that needs to be done with regards to
> > documentation, examples, Hadoop 2 compatibility and designing the future,
> > especially with regards to dataframes”  All the distros are hadoop 2.
> > >
> > > Incubating Mahout 2 as another project is surely a way out of the
> > distros, another momentum killer.
> > >
> > > Another political question is whether an h2o dependency would be an
> > issue to the distros. If we are going to put big efforts into h2o let’s see
> > how that plays out first. Spark is already supported by them, even
> > Hortonworks has taken a first step with 2.1. If Mahout is in a distro the
> > distro will be asked to support it, that’s what they are paid for. Do they
> > want to support h2o? I have no idea how they would react to that but it
> > affects Mahout.
> > >
> > >
> > > For all these reasons I’d be -1 to any big-bang integration.
> > >
> > >
> > > > On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> > > >
> > > > +1. I don't think anyone said anything, privately or publicly, about
> > h20
> > > > integration being a bad idea. It's just there's more than one way to
> > do it,
> > > > so debate is focusing on exploration of pluses and minuses of each
> > > > individual proposal (as they come to light). Part of difficulty here
> > was
> > > > that the expertise intersection of all parts being connected and
> > integrated
> > > > has been pretty poor on individual basis. So we have to go by scenarios
> > > > where a group of specialized experts tries to figure out the solution.
> > > >
> > > > w.r.t to incubation proposals, it seems dubious for a number reasons.
> > > >
> > > > Reason 1 is that these projects are the primary factor moving Mahout
> > > > anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
> > > > frankly not much left in Mahout, so it is reflection of more or less
> > common
> > > > opinion that the project would just spiral down on its own if the
> > things
> > > > stay status-quo.
> > > >
> > > > Reason 2 is that there are good (not irreplaceable, but good)
> > components in
> > > > Mahout that these efforts depend on. Therefore, incubation would be
> > faced
> > > > with a perspective of having dependencies on project that on its own is
> > > > winding down. Not good for incubation side.
> > > >
> > > > Reason 3 is that current effort is (IMO) minimalistic enough not to
> > warrant
> > > > a new project. It simply doesn't, and can't have the scale of things
> > like
> > > > Spark or Hadoop eco. There would be just not enough substance for a new
> > > > project at this point. I don't feel very strong about this point
> > though.
> > > >
> > > >
> > > > On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <[email protected]>
> > wrote:
> > > >
> > > >> We all should calm down here and remind ourselves why we are doing
> > this
> > > >> whole thing: Because we love open source and want to have a vibrant
> > > >> community and a great piece of software.
> > > >>
> > > >> Mahout has come a long way and is at a crossroads right now, so its
> > only
> > > >> natural that there are heated discussions. But, we should immediately
> > stop
> > > >> the fingerpointing and related stuff, we have managed to avoid this
> > since
> > > >> Mahout's inception and we should continue to do so.
> > > >>
> > > >> The best way to help Mahout is to pick up some of the work that needs
> > to
> > > >> be done with regards to documentation, examples, Hadoop 2
> > compatibility and
> > > >> designing the future, especially with regards to dataframes e.g.
> > > >>
> > > >> We agreed to give the h2O guys a shot for exploration of a possible
> > > >> integration into Mahout. We should be grateful that they are
> > investing a
> > > >> lot of time into this, and should help whereever we can. Once they
> > come up
> > > >> with a concrete proposal or patch, we will have a look at it, have a
> > deep,
> > > >> technical and polite discussion, and make a decision afterwards.
> > > >>
> > > >> --sebastian
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 04/28/2014 07:42 PM, Anand Avati wrote:
> > > >>
> > > >>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <[email protected]> wrote:
> > > >>>
> > > >>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
> > > >>>> <[email protected]> wrote:
> > > >>>>
> > > >>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly
> > what
> > > >>>>>
> > > >>>> is encouraging the h2o work to be done a bit apart. It simply isn't
> > > >>>> efficient to have to answer so many off-topic points whenever any
> > reports
> > > >>>> on work in progress are given.
> > > >>>>
> > > >>>>>
> > > >>>>> I think this has been the off-topic here.
> > > >>>>>
> > > >>>>> Calling my comments "emotional" or "non-technical", or _loosely_
> > > >>>>>
> > > >>>> paraphrasing me.
> > > >>>>
> > > >>>> Yes, the personal finger-pointing parts don't belong and don't
> > > >>>> convince anyone, let's skip those.
> > > >>>>
> > > >>>>
> > > >>> +1. Let's skip those.
> > > >>>
> > > >>>
> > > >>> From the sidelines, I see a bunch of work intended for Mahout
> > > >>>
> > > >>>> proceeding outside the community such as it is, and even Apache. Of
> > > >>>> course, contributions are always prepped externally to some degree.
> > I
> > > >>>> create, debug, change patches before posting them, maybe checking in
> > > >>>> early on choices that others may want input on.
> > > >>>>
> > > >>>> This is a large-ish change being proposed, IIUC. I can see one
> > person
> > > >>>> who publicly, and at least two who privately, have clear
> > reservations
> > > >>>> about this direction.
> > > >>>>
> > > >>>
> > > >>>
> > > >>> It will probably be a large-ish change, indeed. But my personal take
> > is
> > > >>> that, non-technical aspects of the debate is unfortunately taking
> > > >>> precedence over real technical parts. Please refer to email thread
> > "Mahout
> > > >>> DSL vs Spark".
> > > >>>
> > > >>>
> > > >>>
> > > >>> It certainly appears funny vis-a-vis the "Apache
> > > >>>> way" to work on a contribution *because* one (or more) other
> > > >>>> committers aren't convinced.
> > > >>>>
> > > >>>>
> > > >>> As mentioned in the referred email thread, a lot of the technical
> > issues
> > > >>> which got addressed in the work which was carried out outside of
> > Apache,
> > > >>> was really sorting out and highlighting build and classloader related
> > > >>> challenges on the H2O side. There was little motivation to carry out
> > those
> > > >>> discussions on the Mahout lists as it was really ~99% H2O specific
> > > >>> discussions and noise/spam to the Mahout community.
> > > >>>
> > > >>> I don't think that's important to dither about. What is, is this: if
> > a
> > > >>>
> > > >>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
> > > >>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
> > > >>>> yet, but it seems like a quite possible outcome.
> > > >>>>
> > > >>>>
> > > >>> As an outsider, my opinion is that the proposed need for a VOTE is a
> > > >>> largely masqueraded problem built around the perception of
> > disagreement
> > > >>> over something vague, abstract and inaccurate. And therefore
> > premature.
> > > >>> That being said the PMC may vote on any issues/non-issues it may
> > please.
> > > >>>
> > > >>> Would be a shame to do a lot of work, intending it for a commit, and
> > > >>>
> > > >>>> then find there is not consensus.
> > > >>>>
> > > >>>>
> > > >>> Exactly the kind of inaccurate perception I meant. While we are (at
> > least
> > > >>> I
> > > >>> am) exploring the best fit model for integration, and exploration by
> > > >>> definition involves taking potentially wrong steps and backtracking
> > if
> > > >>> necessary, the perception unfortunately seems to be that the proposed
> > > >>> intermediate (potentially wrong) steps are some kind of pre-decided
> > plan
> > > >>> of
> > > >>> action. So, no, there WOULDN'T be a lot of work intended for a commit
> > > >>> against consensus.
> > > >>>
> > > >>> So is it better to figure out earlier than later whether these 2+
> > > >>>
> > > >>>> parallel tracks have enough commonality to coexist?
> > > >>>>
> > > >>>
> > > >>>
> > > >>> Whether two parallel tracks (I assume the spark track and the H2O
> > track?)
> > > >>> have enough commonality to exist - one way you surely cannot get the
> > right
> > > >>> answer for this (except by co-incidence) is by taking a vote from a
> > group
> > > >>> who are experts in only either one of those tracks. From what I see,
> > most
> > > >>> of the opposition has been due to a combination of lack of
> > understanding
> > > >>> of
> > > >>> H2O and (welcome) skepticism. If, as a contributor, I find there is
> > no
> > > >>> natural or beneficial way to co-exist with Spark, I wouldn't waste
> > my time
> > > >>> writing code, and for sure am not dependent on another group's vote
> > to
> > > >>> make
> > > >>> that decision for me.
> > > >>>
> > > >>> Avati
> > > >>>
> > > >>>
> > > >>
> > > >
> >
> >

RE: Straw poll re: H2O ?

Reply via email to