+1 for the initial steps, which I can implement. On Sat, Jan 3, 2015 at 8:15 PM, Till Rohrmann <trohrm...@apache.org> wrote:
> Hi, > > happy new year everyone. I hope you all had some relaxing holidays. > > I really like the idea of having a machine learning library because this > allows users to quickly solve problems without having to dive too deep into > the system. Moreover, it is a good way to show what the system is capable > of in terms of expressibility and programming paradigms. > > Currently, we already have more or less optimised versions of several ML > algorithms implemented with Flink. I'm aware of the following > implementations: PageRank, ALS, KMeans, ConnectedComponents. I think that > these algorithms constitute a good foundation for the ML library. > > I like the idea to have optimised algorithms which can be mixed with > Mahout DSL code. As far as I can tell, the interoperation should not be too > difficult if the "future" Flink backend is used to execute the Mahout DSL > program. Internally, the Mahout DSL performs its operations on a row-wise > partitioned matrix which is represented as a DataSet[(Key, Vector)]. > Providing some wrapper functions to transform different matrix > representations into the row-wise representation should be the first step. > > Another idea could be to investigate to what extent Flink can interact > with the Parameter Server and which algorithms could be adapted to benefit > from these systems. > > Greetings, > > Till > > On Fri, Jan 2, 2015 at 3:46 PM, Stephan Ewen <se...@apache.org> wrote: > >> Hi everyone! >> >> Happy new year, first of all and I hope you had a nice end-of-the-year >> season. >> >> I thought that it is a good time now to officially kick off the creation >> of >> a library of machine learning algorithms. There are a lot of individual >> artifacts and algorithms floating around which we should consolidate. >> >> The machine-learning library in Flink would stand on two legs: >> >> - A collection of efficient implementations for common problems and >> algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), >> Matrix Factorization (ALS), ... >> >> - An adapter to the linear algebra DSL in Apache Mahout. >> >> In the long run, it would be the goal to be able to mix and match code >> from >> both parts. >> The linear algebra DSL is very convenient when it comes to quickly >> composing an algorithm, or some custom pre- and post-processing steps. >> For some complex algorithms, however, a low level system specific >> implementation is necessary to make the algorithm efficient. >> Being able to call the tailored algorithms from the DSL would combine the >> benefits. >> >> >> As a concrete initial step, I suggest to do the following: >> >> 1) We create a dedicated maven sub-project for that ML library >> (flink-lib-ml). The project gets two sub-projects, one for the collection >> of specialized algorithms, one for the Mahout DSL >> >> 2) We add the code for the existing specialized algorithms. As followup >> work, we need to consolidate data types between those algorithms, to >> ensure >> that they can easily be combined/chained. >> >> 3) The code for the Flink bindings to the Mahout DSL will actually reside >> in the Mahout project, which we need to add as a dependency to >> flink-lib-ml. >> >> 4) We add some examples of Mahout DSL algorithms, and a template how to >> use >> them within Flink programs. >> >> 5) Create a good introductory readme.md, outlining this structure. The >> readme can also track the implemented algorithms and the ones we put on >> the >> roadmap. >> >> >> Comments welcome :-) >> >> >> Greetings, >> Stephan >> > >