0xdata (now is called H2O) is developing integration with Spark with the project called Sparkling Water [1]. It creates new RDD that could connect to H2O cluster to pass the higher order function to execute in the ML flow.
The easiest way to use H2O is with R binding [2][3] but I think we would want to interact with H2O as via the REST APIs [4]. - Henry [1] https://github.com/h2oai/sparkling-water [2] http://www.slideshare.net/anqifu1/big-data-science-with-h2o-in-r [3] http://docs.h2o.ai/Ruser/rtutorial.html [4] http://docs.h2o.ai/developuser/rest.html On Wed, Jan 7, 2015 at 3:10 AM, Stephan Ewen <se...@apache.org> wrote: > Thanks Henry! > > Do you know of a good source that gives pointers or examples how to > interact with H2O ? > > Stephan > > > On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <trohrm...@apache.org> wrote: > >> The idea to work with H2O sounds really interesting. >> >> In terms of the Mahout DSL this would mean that we have to translate a >> Flink dataset into H2O's basic abstraction of distributed data and vice >> versa. Everything other than writing to disk with one system and reading >> from there with the other is probably non-trivial and hard to realize. >> On Jan 4, 2015 9:18 AM, "Henry Saputra" <henry.sapu...@gmail.com> wrote: >> >> > Happy new year all! >> > >> > Like the idea to add ML module with Flink. >> > >> > As I have mentioned to Kostas, Stephan, and Robert before, I would >> > love to see if we could work with H20 project [1], and it seemed like >> > the community has added support for it for Apache Mahout backend >> > binding [2]. >> > >> > So we might get some additional scale ML algos like deep learning. >> > >> > Definitely would love to help with this initiative =) >> > >> > - Henry >> > >> > [1] https://github.com/h2oai/h2o-dev >> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500 >> > >> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <se...@apache.org> wrote: >> > > Hi everyone! >> > > >> > > Happy new year, first of all and I hope you had a nice end-of-the-year >> > > season. >> > > >> > > I thought that it is a good time now to officially kick off the >> creation >> > of >> > > a library of machine learning algorithms. There are a lot of individual >> > > artifacts and algorithms floating around which we should consolidate. >> > > >> > > The machine-learning library in Flink would stand on two legs: >> > > >> > > - A collection of efficient implementations for common problems and >> > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), >> > > Matrix Factorization (ALS), ... >> > > >> > > - An adapter to the linear algebra DSL in Apache Mahout. >> > > >> > > In the long run, it would be the goal to be able to mix and match code >> > from >> > > both parts. >> > > The linear algebra DSL is very convenient when it comes to quickly >> > > composing an algorithm, or some custom pre- and post-processing steps. >> > > For some complex algorithms, however, a low level system specific >> > > implementation is necessary to make the algorithm efficient. >> > > Being able to call the tailored algorithms from the DSL would combine >> the >> > > benefits. >> > > >> > > >> > > As a concrete initial step, I suggest to do the following: >> > > >> > > 1) We create a dedicated maven sub-project for that ML library >> > > (flink-lib-ml). The project gets two sub-projects, one for the >> collection >> > > of specialized algorithms, one for the Mahout DSL >> > > >> > > 2) We add the code for the existing specialized algorithms. As followup >> > > work, we need to consolidate data types between those algorithms, to >> > ensure >> > > that they can easily be combined/chained. >> > > >> > > 3) The code for the Flink bindings to the Mahout DSL will actually >> reside >> > > in the Mahout project, which we need to add as a dependency to >> > flink-lib-ml. >> > > >> > > 4) We add some examples of Mahout DSL algorithms, and a template how to >> > use >> > > them within Flink programs. >> > > >> > > 5) Create a good introductory readme.md, outlining this structure. The >> > > readme can also track the implemented algorithms and the ones we put on >> > the >> > > roadmap. >> > > >> > > >> > > Comments welcome :-) >> > > >> > > >> > > Greetings, >> > > Stephan >> > >>