My main question at this point would be (and I realize this discussion has been had in multiple places in a parts but not in enough detail), given that the user is staying in the DSL land, when would I use one backend versus another and the easy of getting up and running with 1 backend versus the other. Keenly following all discussions in the meantime.
> Date: Mon, 28 Apr 2014 14:13:35 -0700 > Subject: Re: Straw poll re: H2O ? > From: [email protected] > To: [email protected] > CC: [email protected] > > Saikat, Pat, > > For background, please refer to the "Mahout DSL vs Spark" discussion the > for the general direction in which the integration is being explored. With > that background, I would like to present some counter questions: > > 1. Why is the DSL claiming to have (in its vision) logical vs physical > separation if not for providing multiple compute backends? > > 2. Does the proposal of having a new DSL backend in the future (for e.g > stratosphere as suggested elsewhere) make you: > -- propose mahout-stratosphere as a different top level project? > -- worry that stratosphere would be a dependency to Mahout? > -- worry that you won't be able to say "Future of Mahout is Spark .. but it > also supports stratosphere"? > -- worry that as a user/commiter/contributor you have to worry about a new > framework? > -- resist having a DSL backend for stratosphere because Hadoop vendors may > not support it? > > Obviously no, since they are all just different DSL backends. > > Have you guys embraced the idea that the DSL allows for multiple backends > (Spark being the first to get implemented)? or Not? Hence I do not > understand the "problem" here. > > Thanks > > On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <[email protected]>wrote: > > > I would echo Pat's sentiments spot on related to the goal of supporting > > both spark and H2O confusing folks that are interested in using, committing > > to and trying to understand where Mahout is headed small to medium term. > > I hate to throw this out but given the amount of "sometimes not so nice > > back and forths I've seen on issue 1500" I really wonder whether we should > > have mahout-spark and mahout-h2o as two different top level projects > > potentially supporting a different set of algorithms underneath, yes I know > > tieing mahout to a particular technology goes against the initial vision > > but given the churn I'm seeing I'm not sure I understand what the current > > vision even is :) > > > > > Subject: Re: Straw poll re: H2O ? > > > From: [email protected] > > > Date: Mon, 28 Apr 2014 13:17:03 -0700 > > > CC: [email protected] > > > To: [email protected] > > > > > > I haven’t heard a good explanation of what this project is. There should > > be some small step like implementing an algo on h2o to takes the same input > > as a current Hadoop Mahout job and produce the same result or do one not > > already in Mahout. At least it will answer some technical questions and > > shouldn’t take a lot of support from current committers to produce. > > > > > > I’m still not convinced that this is the primary thing that should drive > > making it a Mahout dependency. > > > > > > I’m highly dubious of actively supporting and working on Mahout for > > Spark and h2o. Not for technical reasons but because rebooting Mahout on > > two platforms seems a non-starter. No project manager in the commercial > > world would allow that sort of thing. And rightly so, it confuses users, > > committers, contributors. You shouldn’t have a great deal of redundancy or > > competing efforts _inside_ a project even an open source one. That’s for > > separate projects and the incubator, right? There are plenty of examples of > > going that route, Spark itself is redundant with Hadoop in many ways. Would > > Apache accept h2o as a parallel project to Spark, if so why not do that? > > > > > > Question: Where do we (Mahout user, committer, contributor) invest > > extremely precious time learning new languages, frameworks, architecture, > > configurations, optimizations? > > > > > > Answer: Many will simply not choose but wait and see, or go elsewhere. > > > > > > Why? Because we fail to communicate “the future of Mahout is Spark > > first—period” It keeps coming out "Spark and, well, h2o too” > > > > > > That is a momentum killer. If we’re agreed on “Spark first” then > > there’s no need to incubate Mahout 2, Spark and Mahout have already gone > > through that and though Dmitriy’s DSL and Scala shell work is entirely new, > > to the end user the jobs, input and output, and functionality will look > > like a v2. People dealing with internals will see a different world but > > they should be a minority of users and will hopefully like what they see. > > > > > > > > > Somewhat off subject notes on external politics: > > > > > > We really need to make sure Mahout stays in all the big distros. That > > means Sebastian’s comments are spot on: "The best way to help Mahout is to > > pick up some of the work that needs to be done with regards to > > documentation, examples, Hadoop 2 compatibility and designing the future, > > especially with regards to dataframes” All the distros are hadoop 2. > > > > > > Incubating Mahout 2 as another project is surely a way out of the > > distros, another momentum killer. > > > > > > Another political question is whether an h2o dependency would be an > > issue to the distros. If we are going to put big efforts into h2o let’s see > > how that plays out first. Spark is already supported by them, even > > Hortonworks has taken a first step with 2.1. If Mahout is in a distro the > > distro will be asked to support it, that’s what they are paid for. Do they > > want to support h2o? I have no idea how they would react to that but it > > affects Mahout. > > > > > > > > > For all these reasons I’d be -1 to any big-bang integration. > > > > > > > > > > On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > > > > +1. I don't think anyone said anything, privately or publicly, about > > h20 > > > > integration being a bad idea. It's just there's more than one way to > > do it, > > > > so debate is focusing on exploration of pluses and minuses of each > > > > individual proposal (as they come to light). Part of difficulty here > > was > > > > that the expertise intersection of all parts being connected and > > integrated > > > > has been pretty poor on individual basis. So we have to go by scenarios > > > > where a group of specialized experts tries to figure out the solution. > > > > > > > > w.r.t to incubation proposals, it seems dubious for a number reasons. > > > > > > > > Reason 1 is that these projects are the primary factor moving Mahout > > > > anywhere forward. Without them, given "bye-bye mapreduce" jira, there's > > > > frankly not much left in Mahout, so it is reflection of more or less > > common > > > > opinion that the project would just spiral down on its own if the > > things > > > > stay status-quo. > > > > > > > > Reason 2 is that there are good (not irreplaceable, but good) > > components in > > > > Mahout that these efforts depend on. Therefore, incubation would be > > faced > > > > with a perspective of having dependencies on project that on its own is > > > > winding down. Not good for incubation side. > > > > > > > > Reason 3 is that current effort is (IMO) minimalistic enough not to > > warrant > > > > a new project. It simply doesn't, and can't have the scale of things > > like > > > > Spark or Hadoop eco. There would be just not enough substance for a new > > > > project at this point. I don't feel very strong about this point > > though. > > > > > > > > > > > > On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <[email protected]> > > wrote: > > > > > > > >> We all should calm down here and remind ourselves why we are doing > > this > > > >> whole thing: Because we love open source and want to have a vibrant > > > >> community and a great piece of software. > > > >> > > > >> Mahout has come a long way and is at a crossroads right now, so its > > only > > > >> natural that there are heated discussions. But, we should immediately > > stop > > > >> the fingerpointing and related stuff, we have managed to avoid this > > since > > > >> Mahout's inception and we should continue to do so. > > > >> > > > >> The best way to help Mahout is to pick up some of the work that needs > > to > > > >> be done with regards to documentation, examples, Hadoop 2 > > compatibility and > > > >> designing the future, especially with regards to dataframes e.g. > > > >> > > > >> We agreed to give the h2O guys a shot for exploration of a possible > > > >> integration into Mahout. We should be grateful that they are > > investing a > > > >> lot of time into this, and should help whereever we can. Once they > > come up > > > >> with a concrete proposal or patch, we will have a look at it, have a > > deep, > > > >> technical and polite discussion, and make a decision afterwards. > > > >> > > > >> --sebastian > > > >> > > > >> > > > >> > > > >> > > > >> On 04/28/2014 07:42 PM, Anand Avati wrote: > > > >> > > > >>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <[email protected]> wrote: > > > >>> > > > >>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA) > > > >>>> <[email protected]> wrote: > > > >>>> > > > >>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly > > what > > > >>>>> > > > >>>> is encouraging the h2o work to be done a bit apart. It simply isn't > > > >>>> efficient to have to answer so many off-topic points whenever any > > reports > > > >>>> on work in progress are given. > > > >>>> > > > >>>>> > > > >>>>> I think this has been the off-topic here. > > > >>>>> > > > >>>>> Calling my comments "emotional" or "non-technical", or _loosely_ > > > >>>>> > > > >>>> paraphrasing me. > > > >>>> > > > >>>> Yes, the personal finger-pointing parts don't belong and don't > > > >>>> convince anyone, let's skip those. > > > >>>> > > > >>>> > > > >>> +1. Let's skip those. > > > >>> > > > >>> > > > >>> From the sidelines, I see a bunch of work intended for Mahout > > > >>> > > > >>>> proceeding outside the community such as it is, and even Apache. Of > > > >>>> course, contributions are always prepped externally to some degree. > > I > > > >>>> create, debug, change patches before posting them, maybe checking in > > > >>>> early on choices that others may want input on. > > > >>>> > > > >>>> This is a large-ish change being proposed, IIUC. I can see one > > person > > > >>>> who publicly, and at least two who privately, have clear > > reservations > > > >>>> about this direction. > > > >>>> > > > >>> > > > >>> > > > >>> It will probably be a large-ish change, indeed. But my personal take > > is > > > >>> that, non-technical aspects of the debate is unfortunately taking > > > >>> precedence over real technical parts. Please refer to email thread > > "Mahout > > > >>> DSL vs Spark". > > > >>> > > > >>> > > > >>> > > > >>> It certainly appears funny vis-a-vis the "Apache > > > >>>> way" to work on a contribution *because* one (or more) other > > > >>>> committers aren't convinced. > > > >>>> > > > >>>> > > > >>> As mentioned in the referred email thread, a lot of the technical > > issues > > > >>> which got addressed in the work which was carried out outside of > > Apache, > > > >>> was really sorting out and highlighting build and classloader related > > > >>> challenges on the H2O side. There was little motivation to carry out > > those > > > >>> discussions on the Mahout lists as it was really ~99% H2O specific > > > >>> discussions and noise/spam to the Mahout community. > > > >>> > > > >>> I don't think that's important to dither about. What is, is this: if > > a > > > >>> > > > >>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE? > > > >>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled > > > >>>> yet, but it seems like a quite possible outcome. > > > >>>> > > > >>>> > > > >>> As an outsider, my opinion is that the proposed need for a VOTE is a > > > >>> largely masqueraded problem built around the perception of > > disagreement > > > >>> over something vague, abstract and inaccurate. And therefore > > premature. > > > >>> That being said the PMC may vote on any issues/non-issues it may > > please. > > > >>> > > > >>> Would be a shame to do a lot of work, intending it for a commit, and > > > >>> > > > >>>> then find there is not consensus. > > > >>>> > > > >>>> > > > >>> Exactly the kind of inaccurate perception I meant. While we are (at > > least > > > >>> I > > > >>> am) exploring the best fit model for integration, and exploration by > > > >>> definition involves taking potentially wrong steps and backtracking > > if > > > >>> necessary, the perception unfortunately seems to be that the proposed > > > >>> intermediate (potentially wrong) steps are some kind of pre-decided > > plan > > > >>> of > > > >>> action. So, no, there WOULDN'T be a lot of work intended for a commit > > > >>> against consensus. > > > >>> > > > >>> So is it better to figure out earlier than later whether these 2+ > > > >>> > > > >>>> parallel tracks have enough commonality to coexist? > > > >>>> > > > >>> > > > >>> > > > >>> Whether two parallel tracks (I assume the spark track and the H2O > > track?) > > > >>> have enough commonality to exist - one way you surely cannot get the > > right > > > >>> answer for this (except by co-incidence) is by taking a vote from a > > group > > > >>> who are experts in only either one of those tracks. From what I see, > > most > > > >>> of the opposition has been due to a combination of lack of > > understanding > > > >>> of > > > >>> H2O and (welcome) skepticism. If, as a contributor, I find there is > > no > > > >>> natural or beneficial way to co-exist with Spark, I wouldn't waste > > my time > > > >>> writing code, and for sure am not dependent on another group's vote > > to > > > >>> make > > > >>> that decision for me. > > > >>> > > > >>> Avati > > > >>> > > > >>> > > > >> > > > > > > > >
