On Apr 28, 2014, at 23:45, Pat Ferrel <[email protected]> wrote: > I favor iterative dev (implement quick then refactor to perfection) and so no > I would not try to make the DSL or specific imports support more than one > backend _until_ we support one. > Doing two at once? I’ve seen too many projects that fail because of this. Why > not tackle equal support for multiple backends as a refactoring task after > one backend is fully implemented? > > Dmitriy is free to do as he wishes, of course, but he is spending an awful > lot of time dealing with this. > > Multiple backends sounds fine in principal but the need has not be > demonstrated, therefore it seems like a question to be deferred. > > Have you considered doing a much scaled back, single algo implementation as > an example? It wouldn’t require changing the DSL or any code already in > Mahout and would add something useful quickly. All you need is file level > compatibility. Besides it might go towards demonstrating the need.
+1 > > > On Apr 28, 2014, at 2:13 PM, Anand Avati <[email protected]> wrote: > > Saikat, Pat, > > For background, please refer to the "Mahout DSL vs Spark" discussion the > for the general direction in which the integration is being explored. With > that background, I would like to present some counter questions: > > 1. Why is the DSL claiming to have (in its vision) logical vs physical > separation if not for providing multiple compute backends? > > 2. Does the proposal of having a new DSL backend in the future (for e.g > stratosphere as suggested elsewhere) make you: > -- propose mahout-stratosphere as a different top level project? > -- worry that stratosphere would be a dependency to Mahout? > -- worry that you won't be able to say "Future of Mahout is Spark .. but it > also supports stratosphere"? > -- worry that as a user/commiter/contributor you have to worry about a new > framework? > -- resist having a DSL backend for stratosphere because Hadoop vendors may > not support it? > > Obviously no, since they are all just different DSL backends. > > Have you guys embraced the idea that the DSL allows for multiple backends > (Spark being the first to get implemented)? or Not? Hence I do not > understand the "problem" here. > > Thanks > > On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <[email protected]>wrote: > >> I would echo Pat's sentiments spot on related to the goal of supporting >> both spark and H2O confusing folks that are interested in using, committing >> to and trying to understand where Mahout is headed small to medium term. >> I hate to throw this out but given the amount of "sometimes not so nice >> back and forths I've seen on issue 1500" I really wonder whether we should >> have mahout-spark and mahout-h2o as two different top level projects >> potentially supporting a different set of algorithms underneath, yes I know >> tieing mahout to a particular technology goes against the initial vision >> but given the churn I'm seeing I'm not sure I understand what the current >> vision even is :) >> >>> Subject: Re: Straw poll re: H2O ? >>> From: [email protected] >>> Date: Mon, 28 Apr 2014 13:17:03 -0700 >>> CC: [email protected] >>> To: [email protected] >>> >>> I haven’t heard a good explanation of what this project is. There should >> be some small step like implementing an algo on h2o to takes the same input >> as a current Hadoop Mahout job and produce the same result or do one not >> already in Mahout. At least it will answer some technical questions and >> shouldn’t take a lot of support from current committers to produce. >>> >>> I’m still not convinced that this is the primary thing that should drive >> making it a Mahout dependency. >>> >>> I’m highly dubious of actively supporting and working on Mahout for >> Spark and h2o. Not for technical reasons but because rebooting Mahout on >> two platforms seems a non-starter. No project manager in the commercial >> world would allow that sort of thing. And rightly so, it confuses users, >> committers, contributors. You shouldn’t have a great deal of redundancy or >> competing efforts _inside_ a project even an open source one. That’s for >> separate projects and the incubator, right? There are plenty of examples of >> going that route, Spark itself is redundant with Hadoop in many ways. Would >> Apache accept h2o as a parallel project to Spark, if so why not do that? >>> >>> Question: Where do we (Mahout user, committer, contributor) invest >> extremely precious time learning new languages, frameworks, architecture, >> configurations, optimizations? >>> >>> Answer: Many will simply not choose but wait and see, or go elsewhere. >>> >>> Why? Because we fail to communicate “the future of Mahout is Spark >> first—period” It keeps coming out "Spark and, well, h2o too” >>> >>> That is a momentum killer. If we’re agreed on “Spark first” then >> there’s no need to incubate Mahout 2, Spark and Mahout have already gone >> through that and though Dmitriy’s DSL and Scala shell work is entirely new, >> to the end user the jobs, input and output, and functionality will look >> like a v2. People dealing with internals will see a different world but >> they should be a minority of users and will hopefully like what they see. >>> >>> >>> Somewhat off subject notes on external politics: >>> >>> We really need to make sure Mahout stays in all the big distros. That >> means Sebastian’s comments are spot on: "The best way to help Mahout is to >> pick up some of the work that needs to be done with regards to >> documentation, examples, Hadoop 2 compatibility and designing the future, >> especially with regards to dataframes” All the distros are hadoop 2. >>> >>> Incubating Mahout 2 as another project is surely a way out of the >> distros, another momentum killer. >>> >>> Another political question is whether an h2o dependency would be an >> issue to the distros. If we are going to put big efforts into h2o let’s see >> how that plays out first. Spark is already supported by them, even >> Hortonworks has taken a first step with 2.1. If Mahout is in a distro the >> distro will be asked to support it, that’s what they are paid for. Do they >> want to support h2o? I have no idea how they would react to that but it >> affects Mahout. >>> >>> >>> For all these reasons I’d be -1 to any big-bang integration. >>> >>> >>>> On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <[email protected]> >> wrote: >>>> >>>> +1. I don't think anyone said anything, privately or publicly, about >> h20 >>>> integration being a bad idea. It's just there's more than one way to >> do it, >>>> so debate is focusing on exploration of pluses and minuses of each >>>> individual proposal (as they come to light). Part of difficulty here >> was >>>> that the expertise intersection of all parts being connected and >> integrated >>>> has been pretty poor on individual basis. So we have to go by scenarios >>>> where a group of specialized experts tries to figure out the solution. >>>> >>>> w.r.t to incubation proposals, it seems dubious for a number reasons. >>>> >>>> Reason 1 is that these projects are the primary factor moving Mahout >>>> anywhere forward. Without them, given "bye-bye mapreduce" jira, there's >>>> frankly not much left in Mahout, so it is reflection of more or less >> common >>>> opinion that the project would just spiral down on its own if the >> things >>>> stay status-quo. >>>> >>>> Reason 2 is that there are good (not irreplaceable, but good) >> components in >>>> Mahout that these efforts depend on. Therefore, incubation would be >> faced >>>> with a perspective of having dependencies on project that on its own is >>>> winding down. Not good for incubation side. >>>> >>>> Reason 3 is that current effort is (IMO) minimalistic enough not to >> warrant >>>> a new project. It simply doesn't, and can't have the scale of things >> like >>>> Spark or Hadoop eco. There would be just not enough substance for a new >>>> project at this point. I don't feel very strong about this point >> though. >>>> >>>> >>>> On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <[email protected]> >> wrote: >>>> >>>>> We all should calm down here and remind ourselves why we are doing >> this >>>>> whole thing: Because we love open source and want to have a vibrant >>>>> community and a great piece of software. >>>>> >>>>> Mahout has come a long way and is at a crossroads right now, so its >> only >>>>> natural that there are heated discussions. But, we should immediately >> stop >>>>> the fingerpointing and related stuff, we have managed to avoid this >> since >>>>> Mahout's inception and we should continue to do so. >>>>> >>>>> The best way to help Mahout is to pick up some of the work that needs >> to >>>>> be done with regards to documentation, examples, Hadoop 2 >> compatibility and >>>>> designing the future, especially with regards to dataframes e.g. >>>>> >>>>> We agreed to give the h2O guys a shot for exploration of a possible >>>>> integration into Mahout. We should be grateful that they are >> investing a >>>>> lot of time into this, and should help whereever we can. Once they >> come up >>>>> with a concrete proposal or patch, we will have a look at it, have a >> deep, >>>>> technical and polite discussion, and make a decision afterwards. >>>>> >>>>> --sebastian >>>>> >>>>> >>>>> >>>>> >>>>> On 04/28/2014 07:42 PM, Anand Avati wrote: >>>>> >>>>>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <[email protected]> wrote: >>>>>> >>>>>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA) >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly >> what >>>>>>> is encouraging the h2o work to be done a bit apart. It simply isn't >>>>>>> efficient to have to answer so many off-topic points whenever any >> reports >>>>>>> on work in progress are given. >>>>>>> >>>>>>>> >>>>>>>> I think this has been the off-topic here. >>>>>>>> >>>>>>>> Calling my comments "emotional" or "non-technical", or _loosely_ >>>>>>> paraphrasing me. >>>>>>> >>>>>>> Yes, the personal finger-pointing parts don't belong and don't >>>>>>> convince anyone, let's skip those. >>>>>> +1. Let's skip those. >>>>>> >>>>>> >>>>>> From the sidelines, I see a bunch of work intended for Mahout >>>>>> >>>>>>> proceeding outside the community such as it is, and even Apache. Of >>>>>>> course, contributions are always prepped externally to some degree. >> I >>>>>>> create, debug, change patches before posting them, maybe checking in >>>>>>> early on choices that others may want input on. >>>>>>> >>>>>>> This is a large-ish change being proposed, IIUC. I can see one >> person >>>>>>> who publicly, and at least two who privately, have clear >> reservations >>>>>>> about this direction. >>>>>> >>>>>> >>>>>> It will probably be a large-ish change, indeed. But my personal take >> is >>>>>> that, non-technical aspects of the debate is unfortunately taking >>>>>> precedence over real technical parts. Please refer to email thread >> "Mahout >>>>>> DSL vs Spark". >>>>>> >>>>>> >>>>>> >>>>>> It certainly appears funny vis-a-vis the "Apache >>>>>>> way" to work on a contribution *because* one (or more) other >>>>>>> committers aren't convinced. >>>>>> As mentioned in the referred email thread, a lot of the technical >> issues >>>>>> which got addressed in the work which was carried out outside of >> Apache, >>>>>> was really sorting out and highlighting build and classloader related >>>>>> challenges on the H2O side. There was little motivation to carry out >> those >>>>>> discussions on the Mahout lists as it was really ~99% H2O specific >>>>>> discussions and noise/spam to the Mahout community. >>>>>> >>>>>> I don't think that's important to dither about. What is, is this: if >> a >>>>>> >>>>>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE? >>>>>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled >>>>>>> yet, but it seems like a quite possible outcome. >>>>>> As an outsider, my opinion is that the proposed need for a VOTE is a >>>>>> largely masqueraded problem built around the perception of >> disagreement >>>>>> over something vague, abstract and inaccurate. And therefore >> premature. >>>>>> That being said the PMC may vote on any issues/non-issues it may >> please. >>>>>> >>>>>> Would be a shame to do a lot of work, intending it for a commit, and >>>>>> >>>>>>> then find there is not consensus. >>>>>> Exactly the kind of inaccurate perception I meant. While we are (at >> least >>>>>> I >>>>>> am) exploring the best fit model for integration, and exploration by >>>>>> definition involves taking potentially wrong steps and backtracking >> if >>>>>> necessary, the perception unfortunately seems to be that the proposed >>>>>> intermediate (potentially wrong) steps are some kind of pre-decided >> plan >>>>>> of >>>>>> action. So, no, there WOULDN'T be a lot of work intended for a commit >>>>>> against consensus. >>>>>> >>>>>> So is it better to figure out earlier than later whether these 2+ >>>>>> >>>>>>> parallel tracks have enough commonality to coexist? >>>>>> >>>>>> >>>>>> Whether two parallel tracks (I assume the spark track and the H2O >> track?) >>>>>> have enough commonality to exist - one way you surely cannot get the >> right >>>>>> answer for this (except by co-incidence) is by taking a vote from a >> group >>>>>> who are experts in only either one of those tracks. From what I see, >> most >>>>>> of the opposition has been due to a combination of lack of >> understanding >>>>>> of >>>>>> H2O and (welcome) skepticism. If, as a contributor, I find there is >> no >>>>>> natural or beneficial way to co-exist with Spark, I wouldn't waste >> my time >>>>>> writing code, and for sure am not dependent on another group's vote >> to >>>>>> make >>>>>> that decision for me. >>>>>> >>>>>> Avati >
