[ 
https://issues.apache.org/jira/browse/SINGA-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635149#comment-14635149
 ] 

ASF subversion and git services commented on SINGA-12:
------------------------------------------------------

Commit 729a5c48a0142ffad3265b84a69642997167fae9 in incubator-singa's branch 
refs/heads/master from wang wei
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=729a5c4 ]

SINGA-12 Supprt Checkpoint and Restore

The checkpoint is done in the Worker class and controlled by two model 
configuration fields: checkpoint_after and checkpoint_frequency.
Only do checkpoint for the Params owning the param values from the first group.
The name, version and values of one Param are dumped onto disk (the path is 
<workspace>/checkpoint/step<training step>-worker<worker id>.bin).
It is possible that the snapshot is separated into mutiple files because the 
neural net is partitioned into multi workers.

The checkpoint files can be used:
application 1: to restore (resume) the training by setting the command line 
argument -resume = true.
The Resume function of the Trainer will find the files for the latest snapshot 
and add them to the model.conf's checkpoint filed.
It also set the model config's step field to the snapshot step (extracted from 
file name).

application 2: as the pretraining result of another model. Users have to config 
the new model's checkpoint field to add the paths of all checkpoint files of 
the same snapshot.

The Worker's InitLocalParam will init Param's from checkpoint files if 
available. Otherwise it randomly init them using user configured init method.
It matches the Param objects based on name. If the Param is not configured with 
a name, NeuralNet class will automatically create one for it based on the name
of the layer to which the Param belongs. In this case,
for application 2, users have to either configure the names of the new model 
params carefully to
match the names of params from the pre-trained model
for application 1, the worker-server topology cannot be changed.

Restore for params which are partitioned due to model partitioning is not 
supported. Because if the pre-training is done using 2 workers, while the new 
model is trained with 3 workers,
then the same original param is partitioned in different ways and hence cannot 
be matched.


> Supprt Checkpoint and Restore
> -----------------------------
>
>                 Key: SINGA-12
>                 URL: https://issues.apache.org/jira/browse/SINGA-12
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: Sheng Wang
>            Assignee: Sheng Wang
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> With the support of checkpoint, we can provide following features:
> 1. Failure Recovery: when a task is failed during the training, we can 
> recover the task from the latest checkpoint;
> 2. Continuous Training: when the user checks the trained model and finds that 
> more steps are needed, he can continue the training;
> 3. Parameter Reuse: from a previously trained model, we can create a new 
> model by adding new layers on top of it, and reuse the parameters during the 
> training.
> The checkpoint should be done on the server side every few steps. In 
> addition, a final checkpoint will be made when the task is finished.
> During restore, the servers/workers will be firstly set up as normal, and 
> after that parameters will be loaded from the checkpoint file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to