[ https://issues.apache.org/jira/browse/SINGA-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644280#comment-14644280 ]
ASF subversion and git services commented on SINGA-12: ------------------------------------------------------ Commit 06163950bff355ce3c83764ab51f07ee95993e09 in incubator-singa's branch refs/heads/master from Wei Wang [ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=0616395 ] SINGA-12 Supprt Checkpoint and Restore Fixbug from Resume function, which generated errors when irregular files are put into the checkpoint folder. Now irregular files will be reported and ignored. > Supprt Checkpoint and Restore > ----------------------------- > > Key: SINGA-12 > URL: https://issues.apache.org/jira/browse/SINGA-12 > Project: Singa > Issue Type: New Feature > Reporter: Sheng Wang > Assignee: Sheng Wang > Original Estimate: 504h > Remaining Estimate: 504h > > With the support of checkpoint, we can provide following features: > 1. Failure Recovery: when a task is failed during the training, we can > recover the task from the latest checkpoint; > 2. Continuous Training: when the user checks the trained model and finds that > more steps are needed, he can continue the training; > 3. Parameter Reuse: from a previously trained model, we can create a new > model by adding new layers on top of it, and reuse the parameters during the > training. > The checkpoint should be done on the server side every few steps. In > addition, a final checkpoint will be made when the task is finished. > During restore, the servers/workers will be firstly set up as normal, and > after that parameters will be loaded from the checkpoint file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)