[ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473322#comment-16473322
 ] 

Matthias Boehm edited comment on SYSTEMML-2087 at 5/13/18 1:01 AM:
-------------------------------------------------------------------

Once we come closer to this task, it would be good to flash out the details in 
terms of sub tasks. For example, we need to decide (1) how to distribute the 
data (for the different distribution schemes) to the individual workers, (2) 
how to do the worker setup and cleanup (e.g., directories for local evictions; 
most of this functionality can be reused from parfor but it would be good to 
clarify what exactly it entails), (3) how to implement the parameter exchange, 
and (4) how to handle task failures and preemption. Regarding the latter, I 
would recommend to start simple with something like "once a worker is brought 
up it pulls the current state of the model" and checkpointing is done in a 
centralized manner.


was (Author: mboehm7):
Once we come closer to this task, it would be good to flash out the details in 
terms of sub tasks. For example, we need to decide (1) how to distribute the 
data (for the different distribution schemes) to the individual workers, (2) 
how to implement the parameter exchange, and (3) how to handle task failures 
and preemption. Regarding the latter, I would recommend to start simple with 
something like once a worker is brought up it pulls the current state of the 
model and checkpointing is done in a centralized manner.

> Initial version of distributed spark backend
> --------------------------------------------
>
>                 Key: SYSTEMML-2087
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>
> This part aims to implement the BSP for spark distributed backend. Hence the 
> idea is to be able to launch a remote parameter server and the workers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to