[ 
https://issues.apache.org/jira/browse/SYSTEMML-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2421:
--------------------------------
    Description: It aims to introduce the checkpointing to guarantee that the 
worker could recover from previous failure. In details, once a worker is 
brought up it pulls the current state of the model which consists of each 
worker's process (i.e., which batch iteration and epoch is being executing). 
And the checkpointing could be set to EPOCH10 which means that every 10 epoch 
the state will be persisted in centralized file on server side.  (was: It aims 
to introduce the checkpointing to guarantee that the worker could recover from 
previous failure. In details, once a worker is brought up it pulls the current 
state of the model. And the checkpointing could be set to be EPOCH10 which 
means that every 10 epoch the state will be persisted in centralized file on 
server side.)

> Task error and preemption handles
> ---------------------------------
>
>                 Key: SYSTEMML-2421
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2421
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
>
> It aims to introduce the checkpointing to guarantee that the worker could 
> recover from previous failure. In details, once a worker is brought up it 
> pulls the current state of the model which consists of each worker's process 
> (i.e., which batch iteration and epoch is being executing). And the 
> checkpointing could be set to EPOCH10 which means that every 10 epoch the 
> state will be persisted in centralized file on server side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to