0xdata (now is called H2O) is developing integration with Spark with
the project called Sparkling Water [1].
It creates new RDD that could connect to H2O cluster to pass the
higher order function to execute in the ML flow.

The easiest way to use H2O is with R binding [2][3] but I think we
would want to interact with H2O as via the REST APIs [4].

- Henry

[1] https://github.com/h2oai/sparkling-water
[2] http://www.slideshare.net/anqifu1/big-data-science-with-h2o-in-r
[3] http://docs.h2o.ai/Ruser/rtutorial.html
[4] http://docs.h2o.ai/developuser/rest.html

On Wed, Jan 7, 2015 at 3:10 AM, Stephan Ewen <se...@apache.org> wrote:
> Thanks Henry!
>
> Do you know of a good source that gives pointers or examples how to
> interact with H2O ?
>
> Stephan
>
>
> On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <trohrm...@apache.org> wrote:
>
>> The idea to work with H2O sounds really interesting.
>>
>> In terms of the Mahout DSL this would mean that we have to translate a
>> Flink dataset into H2O's basic abstraction of distributed data and vice
>> versa. Everything other than writing to disk with one system and reading
>> from there with the other is probably non-trivial and hard to realize.
>> On Jan 4, 2015 9:18 AM, "Henry Saputra" <henry.sapu...@gmail.com> wrote:
>>
>> > Happy new year all!
>> >
>> > Like the idea to add ML module with Flink.
>> >
>> > As I have mentioned to Kostas, Stephan, and Robert before, I would
>> > love to see if we could work with H20 project [1], and it seemed like
>> > the community has added support for it for Apache Mahout backend
>> > binding [2].
>> >
>> > So we might get some additional scale ML algos like deep learning.
>> >
>> > Definitely would love to help with this initiative =)
>> >
>> > - Henry
>> >
>> > [1] https://github.com/h2oai/h2o-dev
>> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500
>> >
>> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <se...@apache.org> wrote:
>> > > Hi everyone!
>> > >
>> > > Happy new year, first of all and I hope you had a nice end-of-the-year
>> > > season.
>> > >
>> > > I thought that it is a good time now to officially kick off the
>> creation
>> > of
>> > > a library of machine learning algorithms. There are a lot of individual
>> > > artifacts and algorithms floating around which we should consolidate.
>> > >
>> > > The machine-learning library in Flink would stand on two legs:
>> > >
>> > >  - A collection of efficient implementations for common problems and
>> > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
>> > > Matrix Factorization (ALS), ...
>> > >
>> > >  - An adapter to the linear algebra DSL in Apache Mahout.
>> > >
>> > > In the long run, it would be the goal to be able to mix and match code
>> > from
>> > > both parts.
>> > > The linear algebra DSL is very convenient when it comes to quickly
>> > > composing an algorithm, or some custom pre- and post-processing steps.
>> > > For some complex algorithms, however, a low level system specific
>> > > implementation is necessary to make the algorithm efficient.
>> > > Being able to call the tailored algorithms from the DSL would combine
>> the
>> > > benefits.
>> > >
>> > >
>> > > As a concrete initial step, I suggest to do the following:
>> > >
>> > > 1) We create a dedicated maven sub-project for that ML library
>> > > (flink-lib-ml). The project gets two sub-projects, one for the
>> collection
>> > > of specialized algorithms, one for the Mahout DSL
>> > >
>> > > 2) We add the code for the existing specialized algorithms. As followup
>> > > work, we need to consolidate data types between those algorithms, to
>> > ensure
>> > > that they can easily be combined/chained.
>> > >
>> > > 3) The code for the Flink bindings to the Mahout DSL will actually
>> reside
>> > > in the Mahout project, which we need to add as a dependency to
>> > flink-lib-ml.
>> > >
>> > > 4) We add some examples of Mahout DSL algorithms, and a template how to
>> > use
>> > > them within Flink programs.
>> > >
>> > > 5) Create a good introductory readme.md, outlining this structure. The
>> > > readme can also track the implemented algorithms and the ones we put on
>> > the
>> > > roadmap.
>> > >
>> > >
>> > > Comments welcome :-)
>> > >
>> > >
>> > > Greetings,
>> > > Stephan
>> >
>>

Reply via email to