[ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473322#comment-16473322
 ] 

Matthias Boehm commented on SYSTEMML-2087:
------------------------------------------

Once we come closer to this task, it would be good to flash out the details in 
terms of sub tasks. For example, we need to decide (1) how to distribute the 
data (for the different distribution schemes) to the individual workers, (2) 
how to implement the parameter exchange, and (3) how to handle task failures 
and preemption. Regarding the latter, I would recommend to start simple with 
something like once a worker is brought up it pulls the current state of the 
model and checkpointing is done in a centralized manner.

> Initial version of distributed spark backend
> --------------------------------------------
>
>                 Key: SYSTEMML-2087
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>
> This part aims to implement the BSP for spark distributed backend. Hence the 
> idea is to be able to launch a remote parameter server and the workers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to