Re: Straw poll re: H2O ?

Frank Scholten Tue, 29 Apr 2014 06:04:37 -0700

On Apr 28, 2014, at 23:45, Pat Ferrel <[email protected]> wrote:

> I favor iterative dev (implement quick then refactor to perfection) and so no 
> I would not try to make the DSL or specific imports support more than one 
> backend _until_ we support one.
> Doing two at once? I’ve seen too many projects that fail because of this. Why 
> not tackle equal support for multiple backends as a refactoring task after 
> one backend is fully implemented?
> 
> Dmitriy is free to do as he wishes, of course, but he is spending an awful 
> lot of time dealing with this. 
> 
> Multiple backends sounds fine in principal but the need has not be 
> demonstrated, therefore it seems like a question to be deferred.
> 
> Have you considered doing a much scaled back, single algo implementation as 
> an example? It wouldn’t require changing the DSL or any code already in 
> Mahout and would add something useful quickly. All you need is file level 
> compatibility. Besides it might go towards demonstrating the need.


+1

> 
> 
> On Apr 28, 2014, at 2:13 PM, Anand Avati <[email protected]> wrote:
> 
> Saikat, Pat,
> 
> For background, please refer to the "Mahout DSL vs Spark" discussion the
> for the general direction in which the integration is being explored. With
> that background, I would like to present some counter questions:
> 
> 1. Why is the DSL claiming to have (in its vision) logical vs physical
> separation if not for providing multiple compute backends?
> 
> 2. Does the proposal of having a new DSL backend in the future (for e.g
> stratosphere as suggested elsewhere) make you:
> -- propose mahout-stratosphere as a different top level project?
> -- worry that stratosphere would be a dependency to Mahout?
> -- worry that you won't be able to say "Future of Mahout is Spark .. but it
> also supports stratosphere"?
> -- worry that as a user/commiter/contributor you have to worry about a new
> framework?
> -- resist having a DSL backend for stratosphere because Hadoop vendors may
> not support it?
> 
> Obviously no, since they are all just different DSL backends.
> 
> Have you guys embraced the idea that the DSL allows for multiple backends
> (Spark being the first to get implemented)? or Not? Hence I do not
> understand the "problem" here.
> 
> Thanks
> 
> On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <[email protected]>wrote:
> 
>> I would echo Pat's sentiments spot on related to the goal of supporting
>> both spark and H2O confusing folks that are interested in using, committing
>> to and trying to understand where Mahout is headed small to medium term.
>> I hate to throw this out but given the amount of "sometimes not so nice
>> back and forths I've seen on issue 1500" I really wonder whether we should
>> have mahout-spark and mahout-h2o as two different top level projects
>> potentially supporting a different set of algorithms underneath, yes I know
>> tieing mahout to a particular technology goes against the initial vision
>> but given the churn I'm seeing I'm not sure I understand what the current
>> vision even is :)
>> 
>>> Subject: Re: Straw poll re: H2O ?
>>> From: [email protected]
>>> Date: Mon, 28 Apr 2014 13:17:03 -0700
>>> CC: [email protected]
>>> To: [email protected]
>>> 
>>> I haven’t heard a good explanation of what this project is. There should
>> be some small step like implementing an algo on h2o to takes the same input
>> as a current Hadoop Mahout job and produce the same result or do one not
>> already in Mahout. At least it will answer some technical questions and
>> shouldn’t take a lot of support from current committers to produce.
>>> 
>>> I’m still not convinced that this is the primary thing that should drive
>> making it a Mahout dependency.
>>> 
>>> I’m highly dubious of actively supporting and working on Mahout for
>> Spark and h2o. Not for technical reasons but because rebooting Mahout on
>> two platforms seems a non-starter. No project manager in the commercial
>> world would allow that sort of thing. And rightly so, it confuses users,
>> committers, contributors. You shouldn’t have a great deal of redundancy or
>> competing efforts _inside_ a project even an open source one. That’s for
>> separate projects and the incubator, right? There are plenty of examples of
>> going that route, Spark itself is redundant with Hadoop in many ways. Would
>> Apache accept h2o as a parallel project to Spark, if so why not do that?
>>> 
>>> Question: Where do we (Mahout user, committer, contributor) invest
>> extremely precious time learning new languages, frameworks, architecture,
>> configurations, optimizations?
>>> 
>>> Answer: Many will simply not choose but wait and see, or go elsewhere.
>>> 
>>> Why? Because we fail to communicate “the future of Mahout is Spark
>> first—period” It keeps coming out "Spark and, well, h2o too”
>>> 
>>> That is a momentum killer.  If we’re agreed on “Spark first” then
>> there’s no need to incubate Mahout 2, Spark and Mahout have already gone
>> through that and though Dmitriy’s DSL and Scala shell work is entirely new,
>> to the end user the jobs, input and output, and functionality will look
>> like a v2. People dealing with internals will see a different world but
>> they should be a minority of users and will hopefully like what they see.
>>> 
>>> 
>>> Somewhat off subject notes on external politics:
>>> 
>>> We really need to make sure Mahout stays in all the big distros. That
>> means Sebastian’s comments are spot on: "The best way to help Mahout is to
>> pick up some of the work that needs to be done with regards to
>> documentation, examples, Hadoop 2 compatibility and designing the future,
>> especially with regards to dataframes”  All the distros are hadoop 2.
>>> 
>>> Incubating Mahout 2 as another project is surely a way out of the
>> distros, another momentum killer.
>>> 
>>> Another political question is whether an h2o dependency would be an
>> issue to the distros. If we are going to put big efforts into h2o let’s see
>> how that plays out first. Spark is already supported by them, even
>> Hortonworks has taken a first step with 2.1. If Mahout is in a distro the
>> distro will be asked to support it, that’s what they are paid for. Do they
>> want to support h2o? I have no idea how they would react to that but it
>> affects Mahout.
>>> 
>>> 
>>> For all these reasons I’d be -1 to any big-bang integration.
>>> 
>>> 
>>>> On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>>>> 
>>>> +1. I don't think anyone said anything, privately or publicly, about
>> h20
>>>> integration being a bad idea. It's just there's more than one way to
>> do it,
>>>> so debate is focusing on exploration of pluses and minuses of each
>>>> individual proposal (as they come to light). Part of difficulty here
>> was
>>>> that the expertise intersection of all parts being connected and
>> integrated
>>>> has been pretty poor on individual basis. So we have to go by scenarios
>>>> where a group of specialized experts tries to figure out the solution.
>>>> 
>>>> w.r.t to incubation proposals, it seems dubious for a number reasons.
>>>> 
>>>> Reason 1 is that these projects are the primary factor moving Mahout
>>>> anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
>>>> frankly not much left in Mahout, so it is reflection of more or less
>> common
>>>> opinion that the project would just spiral down on its own if the
>> things
>>>> stay status-quo.
>>>> 
>>>> Reason 2 is that there are good (not irreplaceable, but good)
>> components in
>>>> Mahout that these efforts depend on. Therefore, incubation would be
>> faced
>>>> with a perspective of having dependencies on project that on its own is
>>>> winding down. Not good for incubation side.
>>>> 
>>>> Reason 3 is that current effort is (IMO) minimalistic enough not to
>> warrant
>>>> a new project. It simply doesn't, and can't have the scale of things
>> like
>>>> Spark or Hadoop eco. There would be just not enough substance for a new
>>>> project at this point. I don't feel very strong about this point
>> though.
>>>> 
>>>> 
>>>> On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <[email protected]>
>> wrote:
>>>> 
>>>>> We all should calm down here and remind ourselves why we are doing
>> this
>>>>> whole thing: Because we love open source and want to have a vibrant
>>>>> community and a great piece of software.
>>>>> 
>>>>> Mahout has come a long way and is at a crossroads right now, so its
>> only
>>>>> natural that there are heated discussions. But, we should immediately
>> stop
>>>>> the fingerpointing and related stuff, we have managed to avoid this
>> since
>>>>> Mahout's inception and we should continue to do so.
>>>>> 
>>>>> The best way to help Mahout is to pick up some of the work that needs
>> to
>>>>> be done with regards to documentation, examples, Hadoop 2
>> compatibility and
>>>>> designing the future, especially with regards to dataframes e.g.
>>>>> 
>>>>> We agreed to give the h2O guys a shot for exploration of a possible
>>>>> integration into Mahout. We should be grateful that they are
>> investing a
>>>>> lot of time into this, and should help whereever we can. Once they
>> come up
>>>>> with a concrete proposal or patch, we will have a look at it, have a
>> deep,
>>>>> technical and polite discussion, and make a decision afterwards.
>>>>> 
>>>>> --sebastian
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 04/28/2014 07:42 PM, Anand Avati wrote:
>>>>> 
>>>>>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <[email protected]> wrote:
>>>>>> 
>>>>>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly
>> what
>>>>>>> is encouraging the h2o work to be done a bit apart. It simply isn't
>>>>>>> efficient to have to answer so many off-topic points whenever any
>> reports
>>>>>>> on work in progress are given.
>>>>>>> 
>>>>>>>> 
>>>>>>>> I think this has been the off-topic here.
>>>>>>>> 
>>>>>>>> Calling my comments "emotional" or "non-technical", or _loosely_
>>>>>>> paraphrasing me.
>>>>>>> 
>>>>>>> Yes, the personal finger-pointing parts don't belong and don't
>>>>>>> convince anyone, let's skip those.
>>>>>> +1. Let's skip those.
>>>>>> 
>>>>>> 
>>>>>> From the sidelines, I see a bunch of work intended for Mahout
>>>>>> 
>>>>>>> proceeding outside the community such as it is, and even Apache. Of
>>>>>>> course, contributions are always prepped externally to some degree.
>> I
>>>>>>> create, debug, change patches before posting them, maybe checking in
>>>>>>> early on choices that others may want input on.
>>>>>>> 
>>>>>>> This is a large-ish change being proposed, IIUC. I can see one
>> person
>>>>>>> who publicly, and at least two who privately, have clear
>> reservations
>>>>>>> about this direction.
>>>>>> 
>>>>>> 
>>>>>> It will probably be a large-ish change, indeed. But my personal take
>> is
>>>>>> that, non-technical aspects of the debate is unfortunately taking
>>>>>> precedence over real technical parts. Please refer to email thread
>> "Mahout
>>>>>> DSL vs Spark".
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> It certainly appears funny vis-a-vis the "Apache
>>>>>>> way" to work on a contribution *because* one (or more) other
>>>>>>> committers aren't convinced.
>>>>>> As mentioned in the referred email thread, a lot of the technical
>> issues
>>>>>> which got addressed in the work which was carried out outside of
>> Apache,
>>>>>> was really sorting out and highlighting build and classloader related
>>>>>> challenges on the H2O side. There was little motivation to carry out
>> those
>>>>>> discussions on the Mahout lists as it was really ~99% H2O specific
>>>>>> discussions and noise/spam to the Mahout community.
>>>>>> 
>>>>>> I don't think that's important to dither about. What is, is this: if
>> a
>>>>>> 
>>>>>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
>>>>>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
>>>>>>> yet, but it seems like a quite possible outcome.
>>>>>> As an outsider, my opinion is that the proposed need for a VOTE is a
>>>>>> largely masqueraded problem built around the perception of
>> disagreement
>>>>>> over something vague, abstract and inaccurate. And therefore
>> premature.
>>>>>> That being said the PMC may vote on any issues/non-issues it may
>> please.
>>>>>> 
>>>>>> Would be a shame to do a lot of work, intending it for a commit, and
>>>>>> 
>>>>>>> then find there is not consensus.
>>>>>> Exactly the kind of inaccurate perception I meant. While we are (at
>> least
>>>>>> I
>>>>>> am) exploring the best fit model for integration, and exploration by
>>>>>> definition involves taking potentially wrong steps and backtracking
>> if
>>>>>> necessary, the perception unfortunately seems to be that the proposed
>>>>>> intermediate (potentially wrong) steps are some kind of pre-decided
>> plan
>>>>>> of
>>>>>> action. So, no, there WOULDN'T be a lot of work intended for a commit
>>>>>> against consensus.
>>>>>> 
>>>>>> So is it better to figure out earlier than later whether these 2+
>>>>>> 
>>>>>>> parallel tracks have enough commonality to coexist?
>>>>>> 
>>>>>> 
>>>>>> Whether two parallel tracks (I assume the spark track and the H2O
>> track?)
>>>>>> have enough commonality to exist - one way you surely cannot get the
>> right
>>>>>> answer for this (except by co-incidence) is by taking a vote from a
>> group
>>>>>> who are experts in only either one of those tracks. From what I see,
>> most
>>>>>> of the opposition has been due to a combination of lack of
>> understanding
>>>>>> of
>>>>>> H2O and (welcome) skepticism. If, as a contributor, I find there is
>> no
>>>>>> natural or beneficial way to co-exist with Spark, I wouldn't waste
>> my time
>>>>>> writing code, and for sure am not dependent on another group's vote
>> to
>>>>>> make
>>>>>> that decision for me.
>>>>>> 
>>>>>> Avati
>

Re: Straw poll re: H2O ?

Reply via email to