[ 
https://issues.apache.org/jira/browse/SYSTEMML-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361878#comment-16361878
 ] 

Matthias Boehm edited comment on SYSTEMML-2083 at 2/13/18 6:27 AM:
-------------------------------------------------------------------

Great - thanks for your interest [~mpgovinda]. Below I try to give you a couple 
of pointers and a more concrete idea of the project. Additional details 
regarding the individual sub tasks will follow later and likely evolve over 
time. Note that we're happy to help out and guide you where needed. 

SystemML is an existing system for a broad range of machine learning 
algorithms. Users write their algorithms in an R- or Python-like syntax with 
abstract data types for scalars, matrix, and frames as well as operations such 
as linear algebra, element-wise operations, aggregations, indexing, and 
statistical functions. SystemML then automatically compiles these scripts into 
hybrid runtime plans of single-node and distributed operations on MapReduce or 
Spark according to data and cluster characteristics. For more details, please 
refer to our website (https://systemml.apache.org/) as well as our "SystemML on 
Spark" paper (http://www.vldb.org/pvldb/vol9/p1425-boehm.pdf).

In the past, we primarily focused on data- and task-parallel execution 
strategies (as described above) but in the last years we also added support for 
deep learning including an nn script library, various builtin functions for 
specific layers, as well as a native and GPU operations. 

This epic aims to extend these capabilities by execution strategies for 
parameter servers. We want to build alternative runtime backends as a 
foundation which would already enable users to easily select their preferred 
strategy for local or distributed execution. Later (not part of this project) 
we would like to futher extend this to the automatic selection of these 
strategies.

Specifically, this project aims to introduce a new builtin function, called 
{{paramserv}} that can be called at script level.
{code}
[model’] = paramserv(model, X, y, X_val, y_val, fun1,
   mode=ASYNC, freq=EPOCH, agg=..., epochs=100, batchsize=64, k=7, 
checkpointing=...)
{code}
where we pass an existing (e.g., for transfer learning) or otherwise 
initialized {{model}}, the training feature and label matrices {{X}}, {{y}}, 
the validation features and labels {{X_val}}, {{y_val}}, a batch update 
function specified in SystemML's R- or Python-like language, an update strategy 
{{mode}} along with its frequency {{freq}} (e.g., per batch or epoch), an 
aggregation function {{agg}}, the number of {{epochs}}, {{batchsize}}, degree 
of parallelism {{k}}, and a checkpointing strategy. 

The core of the project then deals with implementing the runtime for this 
builtin function in Java for both local, multi-threaded execution and 
distributed execution on top Spark. The advantage of building the distributed 
parameter servers on top of the data-parallel Spark framework is a seamless 
integration with the rest of SystemML (e.g., where the input feature matrix 
{{X}} can be a large RDD). Since the update and aggregation functions are 
expressed in SystemML's language, we can simply reuse the existing runtime 
(control flow, instructions, and matrix operations) and concentrate on building 
the alternative parameter update mechanisms.



was (Author: mboehm7):
Great - thanks for your interest [~mpgovinda]. Below I try to give you a couple 
of pointers and a more concrete idea of the project. Additional details 
regarding the individual sub tasks will follow later and likely evolve over 
time. Note that we're happy to help out and guide you where needed. 

SystemML is an existing system for a broad range of machine learning 
algorithms. Users write their algorithms in an R- or Python-like syntax with 
abstract data types for scalars, matrix, and frames as well as operations such 
as linear algebra, element-wise operations, aggregations, indexing, and 
statistical functions. SystemML then automatically compiles these scripts into 
hybrid runtime plans of single-node and distributed operations on MapReduce or 
Spark according to data and cluster characteristics. For more details, please 
refer to our website (https://systemml.apache.org/) as well as our "SystemML on 
Spark" paper (http://www.vldb.org/pvldb/vol9/p1425-boehm.pdf).

In the past, we primarily focused on data- and task-parallel execution 
strategies (as described above) but in the last years we also added support for 
deep learning including an nn script library, various builtin functions for 
specific layers, as well as a native and GPU operations. 

This epic aims to extend these capabilities by execution strategies for 
parameter servers. We want to build alternative runtime backends as a 
foundation which would already enable users to easily select their preferred 
strategy for local or distributed execution. Later (not part of this project) 
we would like to futher extend this to the automatic selection of these 
strategies.

Specifically, this project aims to introduce a new builtin function, called 
{{paramserv}} that can be called at script level.
{code}
[model’] = paramserv(model, X, y, X_val, y_val, fun1,
   mode=ASYNC, freq=EPOCH, agg=..., epochs=100, batchsize=64, k=7, 
checkpointing=...)
{code}
where we pass an existing (e.g., for transfer learning) or otherwise 
initialized {{model}}, the training feature and label matrices {{X}}, {{y}}, 
the validation features and labels {{X_val}}, {{y_val}}, a batch update 
function specified in SystemML's R- or Python-like language, an update strategy 
{{mode}} along with its frequency {{freq}} (e.g., per batch or epoch), an 
aggregation function {{agg}}, the number of {{epochs}}, {{batchsize}}, degree 
of parallelism {{k}}, and a checkpointing strategy. 

The core of the project then deals with implementing the runtime for this 
builtin function in Java for both local, multi-threaded execution and 
distributed execution on top Spark. The advantage of building the distributed 
parameter servers on top of the data-parallel Spark framework is a seamless 
integration with the rest of the SystemML (e.g., where the input feature matrix 
{{X}} cab be a large RDD). Since the update and aggregation functions are 
expressed in SystemML's language, we can simply reuse the existing runtime 
(control flow, instructions, and matrix operations) and concentrate on building 
alternative parameter update mechanisms.


> Language and runtime for parameter servers
> ------------------------------------------
>
>                 Key: SYSTEMML-2083
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2083
>             Project: SystemML
>          Issue Type: Epic
>            Reporter: Matthias Boehm
>            Priority: Major
>              Labels: gsoc2018
>
> SystemML already provides a rich set of execution strategies ranging from 
> local operations to large-scale computation on MapReduce or Spark. In this 
> context, we support both data-parallel (multi-threaded or distributed 
> operations) as well as task-parallel computation (multi-threaded or 
> distributed parfor loops). This epic aims to complement the existing 
> execution strategies by language and runtime primitives for parameter 
> servers, i.e., model-parallel execution. We use the terminology of 
> model-parallel execution with distributed data and distributed model to 
> differentiate them from the existing data-parallel operations. Target 
> applications are distributed deep learning and mini-batch algorithms in 
> general. These new abstractions will help making SystemML a unified framework 
> for small- and large-scale machine learning that supports all three major 
> execution strategies in a single framework.
>  
> A major challenge is the integration of stateful parameter servers and their 
> common push/pull primitives into an otherwise functional (and thus, 
> stateless) language. We will approach this challenge via a new builtin 
> function \{{paramserv}} which internally maintains state but at the same time 
> fits into the runtime framework of stateless operations.
> Furthermore, we are interested in providing (1) different runtime backends 
> (local and distributed), (2) different parameter server modes (synchronous, 
> asynchronous, hogwild!, stale-synchronous), (3) different update frequencies 
> (batch, multi-batch, epoch), as well as (4) different architectures for 
> distributed data (1 parameter server, k workers) and distributed model (k1 
> parameter servers, k2 workers). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to