[ 
https://issues.apache.org/jira/browse/FLINK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357238#comment-14357238
 ] 

Sachin Goel commented on FLINK-1537:
------------------------------------

Yes. I completely agree with the transformer-learner chain design methodology. 
The decision I'll write will provide an interface for first specifying the 
structure of the data, i.e., the tuple, as in the types, ranges, etc. and any 
other statistics possible to help with the learning.
I do not see myself how it makes a difference to store the data column wise or 
row wise, although it might have some far-reaching consequences on how the 
learning process proceeds. In fact, this seems like a valid idea for a learning 
process which treat each coordinate one by one. It might help in providing all 
the attributes of one particular coordinate in one go and learn some statistics 
on it, which might help in a better learning process. In fact, in a decision 
tree implementation on big data, it becomes prudent to learn such a statistic 
to ensure only a reasonable number of splits on the data are considered. I will 
look into how this could be achieved with a row-style data representation.
As for the deep learning framework, you are indeed right. I am not sure myself 
if anyone has yet evaluated the potential of a deep learning system on a 
distributed system. I will look into the H2O project's implementation related 
to this. As of yet, I'm still not sure if deep learning can be as fast on 
distributed systems as it is on GPUs.

> GSoC project: Machine learning with Apache Flink
> ------------------------------------------------
>
>                 Key: FLINK-1537
>                 URL: https://issues.apache.org/jira/browse/FLINK-1537
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Till Rohrmann
>            Priority: Minor
>              Labels: gsoc2015, java, machine_learning, scala
>
> Currently, the Flink community is setting up the infrastructure for a machine 
> learning library for Flink. The goal is to provide a set of highly optimized 
> ML algorithms and to offer a high level linear algebra abstraction to easily 
> do data pre- and post-processing. By defining a set of commonly used data 
> structures on which the algorithms work it will be possible to define complex 
> processing pipelines. 
> The Mahout DSL constitutes a good fit to be used as the linear algebra 
> language in Flink. It has to be evaluated which means have to be provided to 
> allow an easy transition between the high level abstraction and the optimized 
> algorithms.
> The machine learning library offers multiple starting points for a GSoC 
> project. Amongst others, the following projects are conceivable.
> * Extension of Flink's machine learning library by additional ML algorithms
> ** Stochastic gradient descent
> ** Distributed dual coordinate ascent
> ** SVM
> ** Gaussian mixture EM
> ** DecisionTrees
> ** ...
> * Integration of Flink with the Mahout DSL to support a high level linear 
> algebra abstraction
> * Integration of H2O with Flink to benefit from H2O's sophisticated machine 
> learning algorithms
> * Implementation of a parameter server like distributed global state storage 
> facility for Flink. This also includes the extension of Flink to support 
> asynchronous iterations and update messages.
> Own ideas for a possible contribution on the field of the machine learning 
> library are highly welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to