Re: Straw poll re: H2O ?

Pat Ferrel Wed, 30 Apr 2014 07:44:27 -0700

On Apr 30, 2014, at 7:06 AM, Ted Dunning <[email protected]> wrote:


The motivation to contribute comes from h2o.

My motivation to accept comes from the fact that they have machine learning
codes that are as fast as what google has internally.  They completely
crush all of the spark efforts on speed.

The sample bit was more to experiment with ways to bring the mahout frame
of mind together with the h2o implementation capabilities.  The real
contribution will be linked to the DSL work, not the old API.




On Wed, Apr 30, 2014 at 3:52 PM, Pat Ferrel <[email protected]> wrote:

> Where is the motivation to integrate now coming from? A sample
> bit—Kmeans—was integrated with Mahout-ish input? How did that stack up to
> say MLlib?
> 
> 
> On Apr 30, 2014, at 2:36 AM, Ellen Friedman <[email protected]>
> wrote:
> 
> I am weighing in here on issues of great concern but non-technical.
> 
> 1. One of the great things about Mahout is the community – not an easy
> thing to have achieved given that people are dispersed geographically
> and there is no single focus or company backing the project. In short,
> the people who make Mahout are doing something cool.
> 
> Suggestions to try to break it into different groups, Mahout-Spark and
> Mahout2o, run counter to this success. Why fragment it at exactly the
> moment when new contributors (from 0xdata) are coming forward ?  The
> spirit of this project has been inclusive. Let's not  change that now.
> 
> 2. Sebastian pointed out:
> 
> "We agreed to give the h2O guys a shot for exploration of a possible
> integration into Mahout. We should be grateful that they are investing
> a lot of time into this, and should help whereever we can. Once they
> come up with a concrete proposal or patch, we will have a look at it,
> have a deep, technical and polite discussion, and make a decision
> afterwards."
> 
> +1
> 
> We agreed to explore the h2o option. Why use of lots of time and
> energy in re-visiting and second guessing that decision? Let it go
> forward, likely some great things will emerge for Mahout, and if not,
> then we say "thank you" to h2o contributors for giving it a try.
> 
> As the guys from h2o are adding new resources to do this development,
> it is not really detracting anything from Mahout's resources except
> when someone opens one of these discussions that lead to fragmentation
> and distraction. I'm not a coder and not as technical as any of you,
> but from my view It seems to be the talk and not the development that
> is distracting.
> 
> 3. Over the last year, there has been growing and widespread interest
> in Mahout from the outside world, and now, with the new changes to
> support Scala, Spark and h2o (possibly Stratosphere later) the growing
> interest has turned into excitement. This is a great time for the
> project – tons of effort but moving toward a big result.
> 
> Users will have some excellent new choices, all parts of Mahout will
> benefit. And if in the future it is seen that some of the new features
> are not being widely or successfully used, they will be deprecated, as
> was done during the big clean-up of the 0.8 release. New choices, new
> ways to use Mahout, new people getting involved – this is excellent.
> 
> 4. My thought is, stick together, embrace change, welcome new comers
> and be very proud to be building the new Mahout.
> 
> 
> 
> On 4/29/14, Sebastian Schelter <[email protected]> wrote:
>> For reasons of transparency in this discussion, I should add that I am a
>> committer on the upcoming Stratosphere ASF podling, co-worker of the
>> main developers and have contributed to it as part of my PhD.
>> 
>> On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
>>> Anand,
>>> 
>>> I'm trying to answer some of your questions, and my answers highlight
>>> the points that I would like to see clarified about h20.
>>> 
>>> On 04/28/2014 11:13 PM, Anand Avati wrote:
>>> 
>>>> 1. Why is the DSL claiming to have (in its vision) logical vs physical
>>>> separation if not for providing multiple compute backends?
>>> 
>>> This is not a claim or a vision, the DSL already has this separation.
>>> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
>>> operator for executing a Transpose-Times-Self matrix multiplication. In
>>> o.a.m.sparkbindings.blas.AtA you will find two physical operator
>>> implementations for that. The choice which one to use depends on whether
>>> there is enough memory to hold certain intermediary results in memory.
>>> 
>>> The primary intention of a separation into logical and physical
>>> operators is to allow for a declarative programming style on the users
>>> side and for an optimizer on the system side which automatically chooses
>>> the optimal physical operator for the execution of a specific program.
>>> 
>>> This choice of the physical operator might depend on the shape and
>>> amount of the data processed as well on the underlying available
>>> resources. *The separation into logical and physical operators clearly
>>> doesn't imply to have multiple backends*. It only makes it very easy to
>>> support them.
>>> 
>>>> 
>>>> 2. Does the proposal of having a new DSL backend in the future (for e.g
>>>> stratosphere as suggested elsewhere) make you:
>>> 
>>>> -- worry that stratosphere would be a dependency to Mahout?
>>> 
>>> Stratosphere has been accepted as a incubator project in the ASF
>>> recently, so the worry about such a dependency is naturally less than
>>> about an externally managed project like h20.
>>> 
>>>> -- worry that as a user/commiter/contributor you have to worry about a
>>>> new
>>>> framework?
>>> 
>>> In my eyes, there is a big difference between Spark/Stratosphere and
>>> h20. Spark and Stratosphere have a clearly defined programming and
>>> execution model. They execute programs that are composed of a DAG of
>>> operators. The set of operators has clearly defined semantics and
>>> parallelization strategies. If you compare their operators, you will
>>> find that they offer pretty much the same in lightly different flavors.
>>> For both, there are scientific papers that in detail explain all these
>>> things.
>>> 
>>> I have asked about a detailed description of h20's programming model and
>>> execution model and I searched the documentation, but I haven't been
>>> able to find something that clearly describes how things are done. I
>>> would love to read up on this, but until I'm presented with this, I have
>>> to assume that such a principled foundation is missing.
>>> 
>>> 
>>> --sebastian
>>> 
>> 
>> 
> 
>

Re: Straw poll re: H2O ?

Reply via email to