[
https://issues.apache.org/jira/browse/SYSTEMML-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
LI Guobao updated SYSTEMML-2421:
--------------------------------
Description: It aims to introduce the checkpointing to guarantee that the
worker could recover from previous failure. In details, once a worker is
brought up it pulls the current state of the model. And the checkpointing could
be set to be EPOCH10 which means that every 10 epoch the state will be
persisted in a file on worker side. (was: It aims to introduce the
checkpointing to guarantee that the task could recover from failure. In
details, once a worker is brought up it pulls the current state of the model.
And the checkpointing could be set to be EPOCH10 which means that every 10
epoch the state will be persisted in a file.)
> Task error and preemption handles
> ---------------------------------
>
> Key: SYSTEMML-2421
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2421
> Project: SystemML
> Issue Type: Sub-task
> Reporter: LI Guobao
> Assignee: LI Guobao
> Priority: Major
>
> It aims to introduce the checkpointing to guarantee that the worker could
> recover from previous failure. In details, once a worker is brought up it
> pulls the current state of the model. And the checkpointing could be set to
> be EPOCH10 which means that every 10 epoch the state will be persisted in a
> file on worker side.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)