[
https://issues.apache.org/jira/browse/FLINK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357165#comment-14357165
]
Till Rohrmann commented on FLINK-1537:
--------------------------------------
Implementing first a decision tree algorithm is definitely the right way to go.
If you implemented it, then it would be an awesome contribution to Flink. And I
think it's the best way to get used to Flink's API. Thus, it's a win-win
situation :-)
Look at the recently opened [machine learning
PR|https://github.com/apache/flink/pull/479] which loosely defines interfaces
for {{Learner}} and {{Transformer}}. A {{Learner}} is an algorithm which takes
a {{DataSet[A]}} and fits a model to this data. In the case of a decision tree,
the input data would be a labeled vector and the output would be the learned
tree. A {{Transformer}} simply takes a {{DataSet[A]}} and transforms it into a
{{DataSet[B]}}. A feature extractor or data whitening would be an example for
that. {{Transformer}} can be arbitrarily chained as long as their types match.
A {{Learner}} terminates a transformer pipeline. If you sticked to this model
with your implementation, then one could prepend any {{Transformer}} to the
decision tree learner. This makes creating a data analysis pipeline really
easy. If I can help you with the implementation, then let me know.
A deep learning framework is also something really intriguing but at the same
time highly ambitious. So far, we haven't made an effort implementing deep
learning algorithms with Flink. I know that there is the [H2O
project|https://github.com/h2oai/h2o-dev] which does distributed deep learning.
However, their underlying data model is different form ours. If I'm not
mistaken, then they store the data column-wise whereas we store them row-wise.
I don't know what difference this makes. The first thing would probably be to
evaluate Flink's potential for deep learning and then to come up with a
prototype.
> GSoC project: Machine learning with Apache Flink
> ------------------------------------------------
>
> Key: FLINK-1537
> URL: https://issues.apache.org/jira/browse/FLINK-1537
> Project: Flink
> Issue Type: New Feature
> Reporter: Till Rohrmann
> Priority: Minor
> Labels: gsoc2015, java, machine_learning, scala
>
> Currently, the Flink community is setting up the infrastructure for a machine
> learning library for Flink. The goal is to provide a set of highly optimized
> ML algorithms and to offer a high level linear algebra abstraction to easily
> do data pre- and post-processing. By defining a set of commonly used data
> structures on which the algorithms work it will be possible to define complex
> processing pipelines.
> The Mahout DSL constitutes a good fit to be used as the linear algebra
> language in Flink. It has to be evaluated which means have to be provided to
> allow an easy transition between the high level abstraction and the optimized
> algorithms.
> The machine learning library offers multiple starting points for a GSoC
> project. Amongst others, the following projects are conceivable.
> * Extension of Flink's machine learning library by additional ML algorithms
> ** Stochastic gradient descent
> ** Distributed dual coordinate ascent
> ** SVM
> ** Gaussian mixture EM
> ** DecisionTrees
> ** ...
> * Integration of Flink with the Mahout DSL to support a high level linear
> algebra abstraction
> * Integration of H2O with Flink to benefit from H2O's sophisticated machine
> learning algorithms
> * Implementation of a parameter server like distributed global state storage
> facility for Flink. This also includes the extension of Flink to support
> asynchronous iterations and update messages.
> Own ideas for a possible contribution on the field of the machine learning
> library are highly welcome.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)