On Mon, Apr 28, 2014 at 2:45 PM, Pat Ferrel <[email protected]> wrote:
> I favor iterative dev (implement quick then refactor to perfection) and so > no I would not try to make the DSL or specific imports support more than > one backend _until_ we support one. Doing two at once? I’ve seen too many > projects that fail because of this. Why not tackle equal support for > multiple backends as a refactoring task after one backend is fully > implemented? > > Dmitriy is free to do as he wishes, of course, but he is spending an awful > lot of time dealing with this. > > Multiple backends sounds fine in principal but the need has not be > demonstrated, therefore it seems like a question to be deferred. > I'm explicitly staying away from responding to these points before we rat-hole down a discussion of best principles and opinions. As you say, it is best deferred till we have demonstrable work. > Have you considered doing a much scaled back, single algo implementation > as an example? It wouldn’t require changing the DSL or any code already in > Mahout and would add something useful quickly. All you need is file level > compatibility. Besides it might go towards demonstrating the need. The initial approach of allowing implementation of algo in Java language did follow this approach. There is an example KMeans implemented in Java in "Mahout style" - https://github.com/tdunning/h2o-matrix/blob/master/src/main/java/ai/h2o/algo/KMeans.java- that works, even in a distributed way. But the focus now has been to do something similar through the DSL rather than Java lang - which is really the interesting part of the integration. I can only hope that the entire effort does not get prematurely dismissed or killed purely on a matter of principle! Thanks On Apr 28, 2014, at 2:13 PM, Anand Avati <[email protected]> wrote: > > Saikat, Pat, > > For background, please refer to the "Mahout DSL vs Spark" discussion the > for the general direction in which the integration is being explored. With > that background, I would like to present some counter questions: > > 1. Why is the DSL claiming to have (in its vision) logical vs physical > separation if not for providing multiple compute backends? > > 2. Does the proposal of having a new DSL backend in the future (for e.g > stratosphere as suggested elsewhere) make you: > -- propose mahout-stratosphere as a different top level project? > -- worry that stratosphere would be a dependency to Mahout? > -- worry that you won't be able to say "Future of Mahout is Spark .. but it > also supports stratosphere"? > -- worry that as a user/commiter/contributor you have to worry about a new > framework? > -- resist having a DSL backend for stratosphere because Hadoop vendors may > not support it? > > Obviously no, since they are all just different DSL backends. > > Have you guys embraced the idea that the DSL allows for multiple backends > (Spark being the first to get implemented)? or Not? Hence I do not > understand the "problem" here. > > Thanks > > On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <[email protected] > >wrote: > > > I would echo Pat's sentiments spot on related to the goal of supporting > > both spark and H2O confusing folks that are interested in using, > committing > > to and trying to understand where Mahout is headed small to medium term. > > I hate to throw this out but given the amount of "sometimes not so nice > > back and forths I've seen on issue 1500" I really wonder whether we > should > > have mahout-spark and mahout-h2o as two different top level projects > > potentially supporting a different set of algorithms underneath, yes I > know > > tieing mahout to a particular technology goes against the initial vision > > but given the churn I'm seeing I'm not sure I understand what the current > > vision even is :) > > > >> Subject: Re: Straw poll re: H2O ? > >> From: [email protected] > >> Date: Mon, 28 Apr 2014 13:17:03 -0700 > >> CC: [email protected] > >> To: [email protected] > >> > >> I haven’t heard a good explanation of what this project is. There should > > be some small step like implementing an algo on h2o to takes the same > input > > as a current Hadoop Mahout job and produce the same result or do one not > > already in Mahout. At least it will answer some technical questions and > > shouldn’t take a lot of support from current committers to produce. > >> > >> I’m still not convinced that this is the primary thing that should drive > > making it a Mahout dependency. > >> > >> I’m highly dubious of actively supporting and working on Mahout for > > Spark and h2o. Not for technical reasons but because rebooting Mahout on > > two platforms seems a non-starter. No project manager in the commercial > > world would allow that sort of thing. And rightly so, it confuses users, > > committers, contributors. You shouldn’t have a great deal of redundancy > or > > competing efforts _inside_ a project even an open source one. That’s for > > separate projects and the incubator, right? There are plenty of examples > of > > going that route, Spark itself is redundant with Hadoop in many ways. > Would > > Apache accept h2o as a parallel project to Spark, if so why not do that? > >> > >> Question: Where do we (Mahout user, committer, contributor) invest > > extremely precious time learning new languages, frameworks, architecture, > > configurations, optimizations? > >> > >> Answer: Many will simply not choose but wait and see, or go elsewhere. > >> > >> Why? Because we fail to communicate “the future of Mahout is Spark > > first—period” It keeps coming out "Spark and, well, h2o too” > >> > >> That is a momentum killer. If we’re agreed on “Spark first” then > > there’s no need to incubate Mahout 2, Spark and Mahout have already gone > > through that and though Dmitriy’s DSL and Scala shell work is entirely > new, > > to the end user the jobs, input and output, and functionality will look > > like a v2. People dealing with internals will see a different world but > > they should be a minority of users and will hopefully like what they see. > >> > >> > >> Somewhat off subject notes on external politics: > >> > >> We really need to make sure Mahout stays in all the big distros. That > > means Sebastian’s comments are spot on: "The best way to help Mahout is > to > > pick up some of the work that needs to be done with regards to > > documentation, examples, Hadoop 2 compatibility and designing the future, > > especially with regards to dataframes” All the distros are hadoop 2. > >> > >> Incubating Mahout 2 as another project is surely a way out of the > > distros, another momentum killer. > >> > >> Another political question is whether an h2o dependency would be an > > issue to the distros. If we are going to put big efforts into h2o let’s > see > > how that plays out first. Spark is already supported by them, even > > Hortonworks has taken a first step with 2.1. If Mahout is in a distro the > > distro will be asked to support it, that’s what they are paid for. Do > they > > want to support h2o? I have no idea how they would react to that but it > > affects Mahout. > >> > >> > >> For all these reasons I’d be -1 to any big-bang integration. > >> > >> > >>> On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <[email protected]> > > wrote: > >>> > >>> +1. I don't think anyone said anything, privately or publicly, about > > h20 > >>> integration being a bad idea. It's just there's more than one way to > > do it, > >>> so debate is focusing on exploration of pluses and minuses of each > >>> individual proposal (as they come to light). Part of difficulty here > > was > >>> that the expertise intersection of all parts being connected and > > integrated > >>> has been pretty poor on individual basis. So we have to go by scenarios > >>> where a group of specialized experts tries to figure out the solution. > >>> > >>> w.r.t to incubation proposals, it seems dubious for a number reasons. > >>> > >>> Reason 1 is that these projects are the primary factor moving Mahout > >>> anywhere forward. Without them, given "bye-bye mapreduce" jira, there's > >>> frankly not much left in Mahout, so it is reflection of more or less > > common > >>> opinion that the project would just spiral down on its own if the > > things > >>> stay status-quo. > >>> > >>> Reason 2 is that there are good (not irreplaceable, but good) > > components in > >>> Mahout that these efforts depend on. Therefore, incubation would be > > faced > >>> with a perspective of having dependencies on project that on its own is > >>> winding down. Not good for incubation side. > >>> > >>> Reason 3 is that current effort is (IMO) minimalistic enough not to > > warrant > >>> a new project. It simply doesn't, and can't have the scale of things > > like > >>> Spark or Hadoop eco. There would be just not enough substance for a new > >>> project at this point. I don't feel very strong about this point > > though. > >>> > >>> > >>> On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <[email protected]> > > wrote: > >>> > >>>> We all should calm down here and remind ourselves why we are doing > > this > >>>> whole thing: Because we love open source and want to have a vibrant > >>>> community and a great piece of software. > >>>> > >>>> Mahout has come a long way and is at a crossroads right now, so its > > only > >>>> natural that there are heated discussions. But, we should immediately > > stop > >>>> the fingerpointing and related stuff, we have managed to avoid this > > since > >>>> Mahout's inception and we should continue to do so. > >>>> > >>>> The best way to help Mahout is to pick up some of the work that needs > > to > >>>> be done with regards to documentation, examples, Hadoop 2 > > compatibility and > >>>> designing the future, especially with regards to dataframes e.g. > >>>> > >>>> We agreed to give the h2O guys a shot for exploration of a possible > >>>> integration into Mahout. We should be grateful that they are > > investing a > >>>> lot of time into this, and should help whereever we can. Once they > > come up > >>>> with a concrete proposal or patch, we will have a look at it, have a > > deep, > >>>> technical and polite discussion, and make a decision afterwards. > >>>> > >>>> --sebastian > >>>> > >>>> > >>>> > >>>> > >>>> On 04/28/2014 07:42 PM, Anand Avati wrote: > >>>> > >>>>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <[email protected]> wrote: > >>>>> > >>>>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA) > >>>>>> <[email protected]> wrote: > >>>>>> > >>>>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly > > what > >>>>>>> > >>>>>> is encouraging the h2o work to be done a bit apart. It simply isn't > >>>>>> efficient to have to answer so many off-topic points whenever any > > reports > >>>>>> on work in progress are given. > >>>>>> > >>>>>>> > >>>>>>> I think this has been the off-topic here. > >>>>>>> > >>>>>>> Calling my comments "emotional" or "non-technical", or _loosely_ > >>>>>>> > >>>>>> paraphrasing me. > >>>>>> > >>>>>> Yes, the personal finger-pointing parts don't belong and don't > >>>>>> convince anyone, let's skip those. > >>>>>> > >>>>>> > >>>>> +1. Let's skip those. > >>>>> > >>>>> > >>>>> From the sidelines, I see a bunch of work intended for Mahout > >>>>> > >>>>>> proceeding outside the community such as it is, and even Apache. Of > >>>>>> course, contributions are always prepped externally to some degree. > > I > >>>>>> create, debug, change patches before posting them, maybe checking in > >>>>>> early on choices that others may want input on. > >>>>>> > >>>>>> This is a large-ish change being proposed, IIUC. I can see one > > person > >>>>>> who publicly, and at least two who privately, have clear > > reservations > >>>>>> about this direction. > >>>>>> > >>>>> > >>>>> > >>>>> It will probably be a large-ish change, indeed. But my personal take > > is > >>>>> that, non-technical aspects of the debate is unfortunately taking > >>>>> precedence over real technical parts. Please refer to email thread > > "Mahout > >>>>> DSL vs Spark". > >>>>> > >>>>> > >>>>> > >>>>> It certainly appears funny vis-a-vis the "Apache > >>>>>> way" to work on a contribution *because* one (or more) other > >>>>>> committers aren't convinced. > >>>>>> > >>>>>> > >>>>> As mentioned in the referred email thread, a lot of the technical > > issues > >>>>> which got addressed in the work which was carried out outside of > > Apache, > >>>>> was really sorting out and highlighting build and classloader related > >>>>> challenges on the H2O side. There was little motivation to carry out > > those > >>>>> discussions on the Mahout lists as it was really ~99% H2O specific > >>>>> discussions and noise/spam to the Mahout community. > >>>>> > >>>>> I don't think that's important to dither about. What is, is this: if > > a > >>>>> > >>>>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE? > >>>>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled > >>>>>> yet, but it seems like a quite possible outcome. > >>>>>> > >>>>>> > >>>>> As an outsider, my opinion is that the proposed need for a VOTE is a > >>>>> largely masqueraded problem built around the perception of > > disagreement > >>>>> over something vague, abstract and inaccurate. And therefore > > premature. > >>>>> That being said the PMC may vote on any issues/non-issues it may > > please. > >>>>> > >>>>> Would be a shame to do a lot of work, intending it for a commit, and > >>>>> > >>>>>> then find there is not consensus. > >>>>>> > >>>>>> > >>>>> Exactly the kind of inaccurate perception I meant. While we are (at > > least > >>>>> I > >>>>> am) exploring the best fit model for integration, and exploration by > >>>>> definition involves taking potentially wrong steps and backtracking > > if > >>>>> necessary, the perception unfortunately seems to be that the proposed > >>>>> intermediate (potentially wrong) steps are some kind of pre-decided > > plan > >>>>> of > >>>>> action. So, no, there WOULDN'T be a lot of work intended for a commit > >>>>> against consensus. > >>>>> > >>>>> So is it better to figure out earlier than later whether these 2+ > >>>>> > >>>>>> parallel tracks have enough commonality to coexist? > >>>>>> > >>>>> > >>>>> > >>>>> Whether two parallel tracks (I assume the spark track and the H2O > > track?) > >>>>> have enough commonality to exist - one way you surely cannot get the > > right > >>>>> answer for this (except by co-incidence) is by taking a vote from a > > group > >>>>> who are experts in only either one of those tracks. From what I see, > > most > >>>>> of the opposition has been due to a combination of lack of > > understanding > >>>>> of > >>>>> H2O and (welcome) skepticism. If, as a contributor, I find there is > > no > >>>>> natural or beneficial way to co-exist with Spark, I wouldn't waste > > my time > >>>>> writing code, and for sure am not dependent on another group's vote > > to > >>>>> make > >>>>> that decision for me. > >>>>> > >>>>> Avati > >>>>> > >>>>> > >>>> > >>> > > > > > >
