Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192245605 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ +In classical momentum you first correct the velocity and step with that velocity, whereas in Nesterov momentum you first step in the velocity direction then make a correction to the velocity vector based on the new location. \\ --- End diff -- `step with that velocity` is a little confusing to me. Do we have some source where it is defined this way? If it's any better, can we use the following text to say what the difference between momentum and NAG is (source is http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf): ``` ... the key difference between momentum and Nesterovâs accelerated gradient is that momentum computes the gradient before applying the velocity, while Nesterovâs accelerated gradient computes the gradient after doing so. ``` If `step with that velocity` is a standard way of defining it, then I am okay with it. This comment applies to user and online docs too.
---