[ 
https://issues.apache.org/jira/browse/FLINK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353019#comment-14353019
 ] 

Till Rohrmann commented on FLINK-1537:
--------------------------------------

Some of the more system-side topics could be:

* Tree based aggregations: Make aggregations more efficient by introducing a 
hierarchical aggregation scheme. Instead of sending the combined results from 
each worker to a single worker. Send subsets of the result to different workers 
which aggregate the results and do the same again until only one worker 
remains. Many ML algorithms, all which have to aggregate some partial update 
messages into a global state, will benefit from that.

* Asynchronous iterations and global state management: Support for non 
synchronized iterations and a mean to handle asynchronously produced update 
messages (e.g. ParameterServer). Asynchronous iterations would allow us to 
better support stochastic algorithms, where it is not crucial to process all 
data points and gather all update messages. This becomes actually more and more 
important for people, because the data volumes increase steadily and we no 
longer have to process always all data. That's what makes many of these 
algorithms scale very well

* Persistence of local state and optimistic recovery: Many ML algorithms 
distribute the data and perform a series of operations on the individual chunks 
of data. The result is some local state for each partition which is later 
merged back into the global state. If one of machines crashes, then it would be 
good to have persisted this local state to recover it. Currently Flink does not 
support this. Furthermore, in case of iterative algorithms and a lost local 
state, there are means to restart the algorithm from a point where it is nearer 
to the optimum than just a random initialization. That way, a restart would not 
be too costly. You can look here for a more in depth description of [optimistic 
recovery|http://ssc.io/wp-content/uploads/2011/12/db0247-schelterPS.pdf]

* Distributed profiling: The hardest part of developing algorithms with a 
distributed system is to understand what happens in the system. Thus, tools to 
analyze the internal behaviour of Flink would be tremendously helpful. One of 
these tools would be distributed profiling. Here one would have to start 
profiling on each worker node and later aggregate the partial results to obtain 
the full picture.

> GSoC project: Machine learning with Apache Flink
> ------------------------------------------------
>
>                 Key: FLINK-1537
>                 URL: https://issues.apache.org/jira/browse/FLINK-1537
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Till Rohrmann
>            Priority: Minor
>              Labels: gsoc2015, java, machine_learning, scala
>
> Currently, the Flink community is setting up the infrastructure for a machine 
> learning library for Flink. The goal is to provide a set of highly optimized 
> ML algorithms and to offer a high level linear algebra abstraction to easily 
> do data pre- and post-processing. By defining a set of commonly used data 
> structures on which the algorithms work it will be possible to define complex 
> processing pipelines. 
> The Mahout DSL constitutes a good fit to be used as the linear algebra 
> language in Flink. It has to be evaluated which means have to be provided to 
> allow an easy transition between the high level abstraction and the optimized 
> algorithms.
> The machine learning library offers multiple starting points for a GSoC 
> project. Amongst others, the following projects are conceivable.
> * Extension of Flink's machine learning library by additional ML algorithms
> ** Stochastic gradient descent
> ** Distributed dual coordinate ascent
> ** SVM
> ** Gaussian mixture EM
> ** DecisionTrees
> ** ...
> * Integration of Flink with the Mahout DSL to support a high level linear 
> algebra abstraction
> * Integration of H2O with Flink to benefit from H2O's sophisticated machine 
> learning algorithms
> * Implementation of a parameter server like distributed global state storage 
> facility for Flink. This also includes the extension of Flink to support 
> asynchronous iterations and update messages.
> Own ideas for a possible contribution on the field of the machine learning 
> library are highly welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to