Github user njayaram2 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/272#discussion_r192245605
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
\[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate
learning and avoid local minima when using gradient descent. We also support
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum.
momentum must be in the range 0 to 1, where 0 means no momentum. The momentum
value is responsible for damping the velocity and is analogous to the
coefficient of friction. \\
+In classical momentum you first correct the velocity and step with that
velocity, whereas in Nesterov momentum you first step in the velocity direction
then make a correction to the velocity vector based on the new location. \\
--- End diff --
`step with that velocity` is a little confusing to me. Do we have some
source where it is defined this way?
If it's any better, can we use the following text to say what the
difference between momentum and NAG is (source is
http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf):
```
... the key difference between momentum and Nesterovâs
accelerated gradient is that momentum computes the gradient before applying
the velocity, while Nesterovâs
accelerated gradient computes the gradient after doing so.
```
If `step with that velocity` is a standard way of defining it, then I am
okay with it.
This comment applies to user and online docs too.
---