wkcn opened a new issue #15533: The problems about SGD with momentum when 
learning rate changes
URL: https://github.com/apache/incubator-mxnet/issues/15533
 
 
   Hi, there.
   Currently, SGD(Stochastic Gradient Descent) in MXNet is applied by:
   ```
   rescaled_grad = lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight)
   state = momentum * state + rescaled_grad
   weight = weight - state
   ```
   I found there are two problems in SGD.
   1. Loss of accuracy on float-point number
   For the SGD with momentum, the variable `state` stores the gradients 
multiplied by learning rate. However, learning rate is usually a small value, 
such as 1e-3, enabling the `state` becomes smaller than the gradient. It may 
loss the accuracy.
   
   2. The case when learning rate changes.
   When learning rate changes, the variable `state` stores the gradients 
multiplied by old learning rate. It is wrong.
   
   Solution:
   We should update the implement of SGD with momentum, but we should consider 
the compatibility with old optimizer states.
   ```
   rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight
   state = momentum * state + rescaled_grad
   weight = weight - lr * state
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to