Kicking off the Machine Learning Library

Stephan Ewen Fri, 02 Jan 2015 06:47:33 -0800

Hi everyone!

Happy new year, first of all and I hope you had a nice end-of-the-year
season.


I thought that it is a good time now to officially kick off the creation of
a library of machine learning algorithms. There are a lot of individual
artifacts and algorithms floating around which we should consolidate.

The machine-learning library in Flink would stand on two legs:

 - A collection of efficient implementations for common problems and
algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
Matrix Factorization (ALS), ...

 - An adapter to the linear algebra DSL in Apache Mahout.

In the long run, it would be the goal to be able to mix and match code from
both parts.
The linear algebra DSL is very convenient when it comes to quickly
composing an algorithm, or some custom pre- and post-processing steps.
For some complex algorithms, however, a low level system specific
implementation is necessary to make the algorithm efficient.
Being able to call the tailored algorithms from the DSL would combine the
benefits.


As a concrete initial step, I suggest to do the following:

1) We create a dedicated maven sub-project for that ML library
(flink-lib-ml). The project gets two sub-projects, one for the collection
of specialized algorithms, one for the Mahout DSL

2) We add the code for the existing specialized algorithms. As followup
work, we need to consolidate data types between those algorithms, to ensure
that they can easily be combined/chained.

3) The code for the Flink bindings to the Mahout DSL will actually reside
in the Mahout project, which we need to add as a dependency to flink-lib-ml.

4) We add some examples of Mahout DSL algorithms, and a template how to use
them within Flink programs.

5) Create a good introductory readme.md, outlining this structure. The
readme can also track the implemented algorithms and the ones we put on the
roadmap.


Comments welcome :-)


Greetings,
Stephan

Kicking off the Machine Learning Library

Reply via email to