[ 
https://issues.apache.org/jira/browse/FLINK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357223#comment-14357223
 ] 

Till Rohrmann commented on FLINK-1537:
--------------------------------------

You're right that it's best if the algorithms are implemented in a general 
fashion so that you can plug in different regularizer or cost functions, for 
example. Mahout is indeed a good example to learn from.

What I was aiming at are more the general building blocks of ML algorithms. In 
many cases, distributed algorithms distribute the data. Then some local 
processing is done which results in some local state. This local state often 
has to be communicated to the other worker nodes to obtain a consistent global 
state. Starting from this global state, the next local computation can be 
started. 

Depending on the algorithm, whether it is stochastic or not, for example, one 
only need a random subset of the data or the complete local partition. If you 
only need a stochastic subset, then often you will only need the corresponding 
subset of the global state to perform your local computations. Then sometimes 
the global state is so huge that it cannot be kept on a single machine and has 
to stored in parallel.

The question is now, how these operations can be realized within Flink to allow 
an efficient implementation of a multitude of machine learning algorithms. For 
example, local state can either be stored as part of a stateful operator or as 
part of the elements stored in the {{DataSet}}.

> GSoC project: Machine learning with Apache Flink
> ------------------------------------------------
>
>                 Key: FLINK-1537
>                 URL: https://issues.apache.org/jira/browse/FLINK-1537
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Till Rohrmann
>            Priority: Minor
>              Labels: gsoc2015, java, machine_learning, scala
>
> Currently, the Flink community is setting up the infrastructure for a machine 
> learning library for Flink. The goal is to provide a set of highly optimized 
> ML algorithms and to offer a high level linear algebra abstraction to easily 
> do data pre- and post-processing. By defining a set of commonly used data 
> structures on which the algorithms work it will be possible to define complex 
> processing pipelines. 
> The Mahout DSL constitutes a good fit to be used as the linear algebra 
> language in Flink. It has to be evaluated which means have to be provided to 
> allow an easy transition between the high level abstraction and the optimized 
> algorithms.
> The machine learning library offers multiple starting points for a GSoC 
> project. Amongst others, the following projects are conceivable.
> * Extension of Flink's machine learning library by additional ML algorithms
> ** Stochastic gradient descent
> ** Distributed dual coordinate ascent
> ** SVM
> ** Gaussian mixture EM
> ** DecisionTrees
> ** ...
> * Integration of Flink with the Mahout DSL to support a high level linear 
> algebra abstraction
> * Integration of H2O with Flink to benefit from H2O's sophisticated machine 
> learning algorithms
> * Implementation of a parameter server like distributed global state storage 
> facility for Flink. This also includes the extension of Flink to support 
> asynchronous iterations and update messages.
> Own ideas for a possible contribution on the field of the machine learning 
> library are highly welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to