eric-haibin-lin commented on a change in pull request #14885: [Fit-API] Adress
PR comments
URL: https://github.com/apache/incubator-mxnet/pull/14885#discussion_r282146462
##########
File path: python/mxnet/gluon/contrib/estimator/estimator.py
##########
@@ -222,28 +230,30 @@ def evaluate(self,
def fit(self, train_data,
Review comment:
Any thought on how to make resuming from the checkpoint easier? With the
current code, if I want to train 90 epochs and the training failed at epoch 30,
the steps I need to do to resume training are:
- create net
- call net.load_parameters, with the correct .params filename
- create trainer
- call trainer.load_states, with the correct .states filename
- create an estimator
- call fit with epoch = 60?
Since we know the naming scheme of the checkpoint, we can do all these
automatically by loading the last checkpoint?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services