Happy new year all! Like the idea to add ML module with Flink.
As I have mentioned to Kostas, Stephan, and Robert before, I would love to see if we could work with H20 project [1], and it seemed like the community has added support for it for Apache Mahout backend binding [2]. So we might get some additional scale ML algos like deep learning. Definitely would love to help with this initiative =) - Henry [1] https://github.com/h2oai/h2o-dev [2] https://issues.apache.org/jira/browse/MAHOUT-1500 On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <se...@apache.org> wrote: > Hi everyone! > > Happy new year, first of all and I hope you had a nice end-of-the-year > season. > > I thought that it is a good time now to officially kick off the creation of > a library of machine learning algorithms. There are a lot of individual > artifacts and algorithms floating around which we should consolidate. > > The machine-learning library in Flink would stand on two legs: > > - A collection of efficient implementations for common problems and > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), > Matrix Factorization (ALS), ... > > - An adapter to the linear algebra DSL in Apache Mahout. > > In the long run, it would be the goal to be able to mix and match code from > both parts. > The linear algebra DSL is very convenient when it comes to quickly > composing an algorithm, or some custom pre- and post-processing steps. > For some complex algorithms, however, a low level system specific > implementation is necessary to make the algorithm efficient. > Being able to call the tailored algorithms from the DSL would combine the > benefits. > > > As a concrete initial step, I suggest to do the following: > > 1) We create a dedicated maven sub-project for that ML library > (flink-lib-ml). The project gets two sub-projects, one for the collection > of specialized algorithms, one for the Mahout DSL > > 2) We add the code for the existing specialized algorithms. As followup > work, we need to consolidate data types between those algorithms, to ensure > that they can easily be combined/chained. > > 3) The code for the Flink bindings to the Mahout DSL will actually reside > in the Mahout project, which we need to add as a dependency to flink-lib-ml. > > 4) We add some examples of Mahout DSL algorithms, and a template how to use > them within Flink programs. > > 5) Create a good introductory readme.md, outlining this structure. The > readme can also track the implemented algorithms and the ones we put on the > roadmap. > > > Comments welcome :-) > > > Greetings, > Stephan