wkcn opened a new issue #15533: The problems about SGD with momentum when learning rate changes URL: https://github.com/apache/incubator-mxnet/issues/15533 Hi, there. Currently, SGD(Stochastic Gradient Descent) in MXNet is applied by: ``` rescaled_grad = lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight) state = momentum * state + rescaled_grad weight = weight - state ``` I found there are two problems in SGD. 1. Loss of accuracy on float-point number For the SGD with momentum, the variable `state` stores the gradients multiplied by learning rate. However, learning rate is usually a small value, such as 1e-3, enabling the `state` becomes smaller than the gradient. It may loss the accuracy. 2. The case when learning rate changes. When learning rate changes, the variable `state` stores the gradients multiplied by old learning rate. It is wrong. Solution: We should update the implement of SGD with momentum, but we should consider the compatibility with old optimizer states. ``` rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight state = momentum * state + rescaled_grad weight = weight - lr * state ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
