eric-haibin-lin closed pull request #13754: [DOC] Fix Adam optimizer doc with
bias correction term
URL: https://github.com/apache/incubator-mxnet/pull/13754
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/python/mxnet/optimizer/optimizer.py
b/python/mxnet/optimizer/optimizer.py
index 442a11d0220..be5bd7aa551 100644
--- a/python/mxnet/optimizer/optimizer.py
+++ b/python/mxnet/optimizer/optimizer.py
@@ -1021,13 +1021,14 @@ class Adam(Optimizer):
Stochastic Optimization*, available at http://arxiv.org/abs/1412.6980.
If the storage types of grad is ``row_sparse``, and ``lazy_update`` is
True, \
- **lazy updates** are applied by::
+ **lazy updates** at step t are applied by::
for row in grad.indices:
rescaled_grad[row] = clip(grad[row] * rescale_grad + wd *
weight[row], clip_gradient)
m[row] = beta1 * m[row] + (1 - beta1) * rescaled_grad[row]
v[row] = beta2 * v[row] + (1 - beta2) * (rescaled_grad[row]**2)
- w[row] = w[row] - learning_rate * m[row] / (sqrt(v[row]) + epsilon)
+ lr = learning_rate * sqrt(1 - beta1**t) / (1 - beta2**t)
+ w[row] = w[row] - lr * m[row] / (sqrt(v[row]) + epsilon)
The lazy update only updates the mean and var for the weights whose
row_sparse
gradient indices appear in the current batch, rather than updating it for
all indices.
@@ -1035,12 +1036,13 @@ class Adam(Optimizer):
throughput for some applications. However, it provides slightly different
semantics than
the original update, and may lead to different empirical results.
- Otherwise, **standard updates** are applied by::
+ Otherwise, **standard updates** at step t are applied by::
rescaled_grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
m = beta1 * m + (1 - beta1) * rescaled_grad
v = beta2 * v + (1 - beta2) * (rescaled_grad**2)
- w = w - learning_rate * m / (sqrt(v) + epsilon)
+ lr = learning_rate * sqrt(1 - beta1**t) / (1 - beta2**t)
+ w = w - lr * m / (sqrt(v) + epsilon)
This optimizer accepts the following parameters in addition to those
accepted
by :class:`.Optimizer`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services