[ 
https://issues.apache.org/jira/browse/SYSTEMML-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372499#comment-16372499
 ] 

Matthias Boehm commented on SYSTEMML-2083:
------------------------------------------

great to hear that [~Guobao] - I would recommend to subscribe to our mailing 
list but you can also view the achieve (see 
https://systemml.apache.org/community for the links). I will post some 
organizational comments there shortly to bring everybody on the same page. 
Subsequently, we can enter more detailed discussions regarding the topic.

Just to clarify one aspect to ensure the topic is not misunderstood. This epic 
is NOT about integrating an "off-the-shelf parameter server" but to build 
language and runtime support from scratch in SystemML. However, SystemML 
already supports general linear algebra programs as well as deep-learning 
primitives, which we will completely reuse. So major components to develop here 
are nice API extensions (builtin function, Keras2DML extension), as well as 
local and distributed runtime support. The runtime support primarily involves 
data and program shipping (where we will reuse many primitives from the 
existing runtime of parfor loops), efficient parameter exchange, and a modular 
design for different update strategies. 

> Language and runtime for parameter servers
> ------------------------------------------
>
>                 Key: SYSTEMML-2083
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2083
>             Project: SystemML
>          Issue Type: Epic
>            Reporter: Matthias Boehm
>            Priority: Major
>              Labels: gsoc2018
>         Attachments: image-2018-02-14-12-18-48-932.png, 
> image-2018-02-14-12-21-00-932.png, image-2018-02-14-12-31-37-563.png
>
>
> SystemML already provides a rich set of execution strategies ranging from 
> local operations to large-scale computation on MapReduce or Spark. In this 
> context, we support both data-parallel (multi-threaded or distributed 
> operations) as well as task-parallel computation (multi-threaded or 
> distributed parfor loops). This epic aims to complement the existing 
> execution strategies by language and runtime primitives for parameter 
> servers, i.e., model-parallel execution. We use the terminology of 
> model-parallel execution with distributed data and distributed model to 
> differentiate them from the existing data-parallel operations. Target 
> applications are distributed deep learning and mini-batch algorithms in 
> general. These new abstractions will help making SystemML a unified framework 
> for small- and large-scale machine learning that supports all three major 
> execution strategies in a single framework.
>  
> A major challenge is the integration of stateful parameter servers and their 
> common push/pull primitives into an otherwise functional (and thus, 
> stateless) language. We will approach this challenge via a new builtin 
> function {{paramserv}} which internally maintains state but at the same time 
> fits into the runtime framework of stateless operations.
> Furthermore, we are interested in providing (1) different runtime backends 
> (local and distributed), (2) different parameter server modes (synchronous, 
> asynchronous, hogwild!, stale-synchronous), (3) different update frequencies 
> (batch, multi-batch, epoch), as well as (4) different architectures for 
> distributed data (1 parameter server, k workers) and distributed model (k1 
> parameter servers, k2 workers). 
>  
> *Note for GSOC students:* This is large project which will be broken down 
> into sub projects, so everybody will be having their share of pie.
> *Prerequistes:* Java, machine learning experience is a plus but not required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to