Hi,

happy new year everyone. I hope you all had some relaxing holidays.

I really like the idea of having a machine learning library because this
allows users to quickly solve problems without having to dive too deep into
the system. Moreover, it is a good way to show what the system is capable
of in terms of expressibility and programming paradigms.

Currently, we already have more or less optimised versions of several ML
algorithms implemented with Flink. I'm aware of the following
implementations: PageRank, ALS, KMeans, ConnectedComponents. I think that
these algorithms constitute a good foundation for the ML library.

I like the idea to have optimised algorithms which can be mixed with Mahout
DSL code. As far as I can tell, the interoperation should not be too
difficult if the "future" Flink backend is used to execute the Mahout DSL
program. Internally, the Mahout DSL performs its operations on a row-wise
partitioned matrix which is represented as a DataSet[(Key, Vector)].
Providing some wrapper functions to transform different matrix
representations into the row-wise representation should be the first step.

Another idea could be to investigate to what extent Flink can interact with
the Parameter Server and which algorithms could be adapted to benefit from
these systems.

Greetings,

Till

On Fri, Jan 2, 2015 at 3:46 PM, Stephan Ewen <se...@apache.org> wrote:

> Hi everyone!
>
> Happy new year, first of all and I hope you had a nice end-of-the-year
> season.
>
> I thought that it is a good time now to officially kick off the creation of
> a library of machine learning algorithms. There are a lot of individual
> artifacts and algorithms floating around which we should consolidate.
>
> The machine-learning library in Flink would stand on two legs:
>
>  - A collection of efficient implementations for common problems and
> algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
> Matrix Factorization (ALS), ...
>
>  - An adapter to the linear algebra DSL in Apache Mahout.
>
> In the long run, it would be the goal to be able to mix and match code from
> both parts.
> The linear algebra DSL is very convenient when it comes to quickly
> composing an algorithm, or some custom pre- and post-processing steps.
> For some complex algorithms, however, a low level system specific
> implementation is necessary to make the algorithm efficient.
> Being able to call the tailored algorithms from the DSL would combine the
> benefits.
>
>
> As a concrete initial step, I suggest to do the following:
>
> 1) We create a dedicated maven sub-project for that ML library
> (flink-lib-ml). The project gets two sub-projects, one for the collection
> of specialized algorithms, one for the Mahout DSL
>
> 2) We add the code for the existing specialized algorithms. As followup
> work, we need to consolidate data types between those algorithms, to ensure
> that they can easily be combined/chained.
>
> 3) The code for the Flink bindings to the Mahout DSL will actually reside
> in the Mahout project, which we need to add as a dependency to
> flink-lib-ml.
>
> 4) We add some examples of Mahout DSL algorithms, and a template how to use
> them within Flink programs.
>
> 5) Create a good introductory readme.md, outlining this structure. The
> readme can also track the implemented algorithms and the ones we put on the
> roadmap.
>
>
> Comments welcome :-)
>
>
> Greetings,
> Stephan
>

Reply via email to