[ 
https://issues.apache.org/jira/browse/FLINK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1537:
---------------------------------
    Comment: was deleted

(was: Implementing first a decision tree algorithm is definitely the right way 
to go. If you implemented it, then it would be an awesome contribution to 
Flink. And I think it's the best way to get used to Flink's API. Thus, it's a 
win-win situation :-) 

Look at the recently opened [machine learning 
PR|https://github.com/apache/flink/pull/479] which loosely defines interfaces 
for {{Learner}} and {{Transformer}}. A {{Learner}} is an algorithm which takes 
a {{DataSet[A]}} and fits a model to this data. In the case of a decision tree, 
the input data would be a labeled vector and the output would be the learned 
tree. A {{Transformer}} simply takes a {{DataSet[A]}} and transforms it into a 
{{DataSet[B]}}. A feature extractor or data whitening would be an example for 
that. {{Transformer}} can be arbitrarily chained as long as their types match. 
A {{Learner}} terminates a transformer pipeline. If you sticked to this model 
with your implementation, then one could prepend any {{Transformer}} to the 
decision tree learner. This makes creating a data analysis pipeline really 
easy. If I can help you with the implementation, then let me know.

A deep learning framework is also something really intriguing but at the same 
time highly ambitious. So far, we haven't made an effort implementing deep 
learning algorithms with Flink. I know that there is the [H2O 
project|https://github.com/h2oai/h2o-dev] which does distributed deep learning. 
However, their underlying data model is different form ours. If I'm not 
mistaken, then they store the data column-wise whereas we store them row-wise. 
I don't know what difference this makes. The first thing would probably be to 
evaluate Flink's potential for deep learning and then to come up with a 
prototype.)

> GSoC project: Machine learning with Apache Flink
> ------------------------------------------------
>
>                 Key: FLINK-1537
>                 URL: https://issues.apache.org/jira/browse/FLINK-1537
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Till Rohrmann
>            Priority: Minor
>              Labels: gsoc2015, java, machine_learning, scala
>
> Currently, the Flink community is setting up the infrastructure for a machine 
> learning library for Flink. The goal is to provide a set of highly optimized 
> ML algorithms and to offer a high level linear algebra abstraction to easily 
> do data pre- and post-processing. By defining a set of commonly used data 
> structures on which the algorithms work it will be possible to define complex 
> processing pipelines. 
> The Mahout DSL constitutes a good fit to be used as the linear algebra 
> language in Flink. It has to be evaluated which means have to be provided to 
> allow an easy transition between the high level abstraction and the optimized 
> algorithms.
> The machine learning library offers multiple starting points for a GSoC 
> project. Amongst others, the following projects are conceivable.
> * Extension of Flink's machine learning library by additional ML algorithms
> ** Stochastic gradient descent
> ** Distributed dual coordinate ascent
> ** SVM
> ** Gaussian mixture EM
> ** DecisionTrees
> ** ...
> * Integration of Flink with the Mahout DSL to support a high level linear 
> algebra abstraction
> * Integration of H2O with Flink to benefit from H2O's sophisticated machine 
> learning algorithms
> * Implementation of a parameter server like distributed global state storage 
> facility for Flink. This also includes the extension of Flink to support 
> asynchronous iterations and update messages.
> Own ideas for a possible contribution on the field of the machine learning 
> library are highly welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to