[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/272 ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r194151705 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +122,26 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. +The velocity is the same size as the coefficient and is accumulated in the direction of persistent reduction, which speeds up the optimization. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ +In classical momentum we first correct the velocity, and then update the model with that velocity, whereas in Nesterov momentum, we first move the model in the direction of momentum*velocity , then correct the velocity and finally use the updated model to calculate the gradient. The main difference being that in classical momentum, we compute the gradient before updating the model whereas in nesterov we first update the model and then compute the gradient from the updated position.\\ --- End diff -- `momentum*velocity ,` -> `momentum*velocity,`. The extra space before the comma is moving the `,` to the next line in the pdf. ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r194151826 --- Diff: doc/design/modules/neural-network.tex --- @@ -196,17 +221,28 @@ \subsubsection{The $\mathit{Gradient}$ Function} \end{algorithmic} \end{algorithm} -\begin{algorithm}[mlp-train-iteration$(X, Y, \eta)$] \label{alg:mlp-train-iteration} +\begin{algorithm}[mlp-train-iteration$(X, Y, \eta, mu)$] \label{alg:mlp-train-iteration} --- End diff -- `mu` -> `\mu` ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192246910 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ +In classical momentum you first correct the velocity and step with that velocity, whereas in Nesterov momentum you first step in the velocity direction then make a correction to the velocity vector based on the new location. \\ +Classic momentum update +\[\begin{aligned} +\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} \text{ (velocity update)} \\ % $\eta$,\\ * $\nabla f(u)$ +\mathit{u} \set \mathit{u} + \mathit{v} \\ +\end{aligned}\] + +Nesterov momentum update +\[\begin{aligned} +\mathit{u} \set \mathit{u} + \mathit{mu} * \mathit{v} \text{ (nesterov's initial coefficient update )} \\ +\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} \text{ (velocity update)} \\ % $\eta$,\\ * $\nabla f(u)$ +\mathit{u} \set \mathit{u} - \eta * \mathit{gradient} \\ --- End diff -- Can we left-align these and other momentum related equations? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192246463 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ --- End diff -- We only describe the range for momentum, but nothing about velocity before the next line begins. Can we say a bit about what velocity is before we describe how it is used? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192251198 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] --- End diff -- Please add your name to the list of authors in the beginning of the file. Maybe add the history for this file as well? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192245605 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ +In classical momentum you first correct the velocity and step with that velocity, whereas in Nesterov momentum you first step in the velocity direction then make a correction to the velocity vector based on the new location. \\ --- End diff -- `step with that velocity` is a little confusing to me. Do we have some source where it is defined this way? If it's any better, can we use the following text to say what the difference between momentum and NAG is (source is http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf): ``` ... the key difference between momentum and Nesterovâs accelerated gradient is that momentum computes the gradient before applying the velocity, while Nesterovâs accelerated gradient computes the gradient after doing so. ``` If `step with that velocity` is a standard way of defining it, then I am okay with it. This comment applies to user and online docs too. ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r192248589 --- Diff: doc/design/modules/neural-network.tex --- @@ -117,6 +117,24 @@ \subsubsection{Backpropagation} \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\] where $k = 1,...,N-1$, and $j = 1,...,n_{k}$. +\paragraph{Momentum updates.} +Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate learning and avoid local minima when using gradient descent. We also support nesterov's accelarated gradient due to its look ahead characteristics. \\ +Here we need to introduce two new variables namely velocity and momentum. momentum must be in the range 0 to 1, where 0 means no momentum. The momentum value is responsible for damping the velocity and is analogous to the coefficient of friction. \\ +In classical momentum you first correct the velocity and step with that velocity, whereas in Nesterov momentum you first step in the velocity direction then make a correction to the velocity vector based on the new location. \\ +Classic momentum update +\[\begin{aligned} +\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} \text{ (velocity update)} \\ % $\eta$,\\ * $\nabla f(u)$ +\mathit{u} \set \mathit{u} + \mathit{v} \\ +\end{aligned}\] + +Nesterov momentum update +\[\begin{aligned} +\mathit{u} \set \mathit{u} + \mathit{mu} * \mathit{v} \text{ (nesterov's initial coefficient update )} \\ +\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} \text{ (velocity update)} \\ % $\eta$,\\ * $\nabla f(u)$ --- End diff -- Instead of `gradient`, can we replace it with the actual math notation? I think the gradient at this line is: `\frac{\partial f}{\partial u_{k-1}^{sj}}` The gradient in the next line should be w.r.t `v` instead of `u`. ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191945714 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -197,6 +244,7 @@ MLP::loss( const model_type, const independent_variables_type, const dependent_variable_type _true) { + --- End diff -- +1 will do ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191942290 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -1781,7 +1799,7 @@ class MLPMinibatchPreProcessor: summary_table_columns = summary_table_columns[0] required_columns = (self.DEPENDENT_VARNAME, self.INDEPENDENT_VARNAME, -self.CLASS_VALUES, self.GROUPING_COL) +self.CLASS_VALUES, self.GROUPING_COL, self.DEPENDENT_VARTYPE) if set(required_columns) <= set(summary_table_columns): self.preprocessed_summary_dict = summary_table_columns else: --- End diff -- +1 will do ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191942284 --- Diff: src/ports/postgres/modules/convex/mlp.sql_in --- @@ -1474,13 +1480,15 @@ CREATE AGGREGATE MADLIB_SCHEMA.mlp_minibatch_step( /* warm_start_coeff */DOUBLE PRECISION[], /* lambda */ DOUBLE PRECISION, /* batch_size */ INTEGER, -/* n_epochs */INTEGER +/* n_epochs */INTEGER, +/* momentum */DOUBLE PRECISION, +/* is_nesterov */ BOOLEAN )( STYPE=DOUBLE PRECISION[], SFUNC=MADLIB_SCHEMA.mlp_minibatch_transition, m4_ifdef(`__POSTGRESQL__', `', `prefunc=MADLIB_SCHEMA.mlp_minibatch_merge,') FINALFUNC=MADLIB_SCHEMA.mlp_minibatch_final, -INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0}' +INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0,0,0}' ); - --- End diff -- +1 will do ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191575057 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -1781,7 +1799,7 @@ class MLPMinibatchPreProcessor: summary_table_columns = summary_table_columns[0] required_columns = (self.DEPENDENT_VARNAME, self.INDEPENDENT_VARNAME, -self.CLASS_VALUES, self.GROUPING_COL) +self.CLASS_VALUES, self.GROUPING_COL, self.DEPENDENT_VARTYPE) if set(required_columns) <= set(summary_table_columns): self.preprocessed_summary_dict = summary_table_columns else: --- End diff -- Can you please update the optimizer params in online docs? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191533727 --- Diff: src/modules/convex/type/model.hpp --- @@ -126,45 +129,96 @@ struct MLPModel { for (k = 0; k < N; k ++) { size += static_cast((n[k] + 1) * (n[k+1])); } +//TODO conditionally assign size based on momentum --- End diff -- Is this TODO still valid? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191574947 --- Diff: src/ports/postgres/modules/convex/mlp.sql_in --- @@ -1474,13 +1480,15 @@ CREATE AGGREGATE MADLIB_SCHEMA.mlp_minibatch_step( /* warm_start_coeff */DOUBLE PRECISION[], /* lambda */ DOUBLE PRECISION, /* batch_size */ INTEGER, -/* n_epochs */INTEGER +/* n_epochs */INTEGER, +/* momentum */DOUBLE PRECISION, +/* is_nesterov */ BOOLEAN )( STYPE=DOUBLE PRECISION[], SFUNC=MADLIB_SCHEMA.mlp_minibatch_transition, m4_ifdef(`__POSTGRESQL__', `', `prefunc=MADLIB_SCHEMA.mlp_minibatch_merge,') FINALFUNC=MADLIB_SCHEMA.mlp_minibatch_final, -INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0}' +INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0,0,0}' ); - --- End diff -- Can you please update the user docs with momentum and nesterov related optimizer params? Is it also a good time to update the MLP design doc? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191539168 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -197,6 +244,7 @@ MLP::loss( const model_type, const independent_variables_type, const dependent_variable_type _true) { + --- End diff -- The loss computation code in this function is duplicated in `getLoss()` function. Can we call that function to remove the code duplication? ---
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/272#discussion_r191537654 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -126,68 +157,84 @@ MLP::getLossAndUpdateModel( const Matrix _true_batch, const double ) { -uint16_t num_layers = static_cast(model.u.size()); // assuming nu. of layers >= 1 -Index num_rows_in_batch = x_batch.rows(); double total_loss = 0.; +// model is updated with the momentum step (i.e. velocity vector) +// if Nesterov Accelerated Gradient is enabled +model.nesterovUpdatePosition(); -// gradient added over the batch -std::vector total_gradient_per_layer(num_layers); -for (Index k=0; k < num_layers; ++k) +// initialize gradient vector +std::vector total_gradient_per_layer(model.num_layers); +for (Index k=0; k < model.num_layers; ++k) { total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(), model.u[k].cols()); +} +std::vector net, o, delta; +Index num_rows_in_batch = x_batch.rows(); for (Index i=0; i < num_rows_in_batch; i++){ +// gradient and loss added over the batch ColumnVector x = x_batch.row(i); ColumnVector y_true = y_true_batch.row(i); -std::vector net, o, delta; feedForward(model, x, net, o); backPropogate(y_true, o.back(), net, model, delta); -for (Index k=0; k < num_layers; k++){ +// compute the gradient +for (Index k=0; k < model.num_layers; k++){ total_gradient_per_layer[k] += o[k] * delta[k].transpose(); } -// loss computation -ColumnVector y_estimated = o.back(); -if(model.is_classification){ -double clip = 1.e-10; -y_estimated = y_estimated.cwiseMax(clip).cwiseMin(1.-clip); -total_loss += - (y_true.array()*y_estimated.array().log() - + (-y_true.array()+1)*(-y_estimated.array()+1).log()).sum(); -} -else{ -total_loss += 0.5 * (y_estimated - y_true).squaredNorm(); -} +// compute the loss +total_loss += getLoss(y_true, o.back(), model.is_classification); } -for (Index k=0; k < num_layers; k++){ + +// convert gradient to a gradient update vector +// 1. normalize to per row update +// 2. discount by stepsize +// 3. add regularization +// 4. make negative +for (Index k=0; k < model.num_layers; k++){ Matrix regularization = MLP::lambda * model.u[k]; regularization.row(0).setZero(); // Do not update bias -model.u[k] -= -stepsize * -(total_gradient_per_layer[k] / static_cast(num_rows_in_batch) + - regularization); +total_gradient_per_layer[k] = -stepsize * (total_gradient_per_layer[k] / static_cast(num_rows_in_batch) + + regularization); +model.updateVelocity(total_gradient_per_layer[k], k); +model.updatePosition(total_gradient_per_layer[k], k); } + return total_loss; + } + template void MLP::gradientInPlace( model_type , const independent_variables_type, const dependent_variable_type _true, -const double) { -size_t N = model.u.size(); // assuming nu. of layers >= 1 +const double) +{ +model.nesterovUpdatePosition(); + std::vector net, o, delta; feedForward(model, x, net, o); backPropogate(y_true, o.back(), net, model, delta); -for (size_t k=0; k < N; k++){ +for (Index k=0; k < model.num_layers; k++){ Matrix regularization = MLP::lambda*model.u[k]; regularization.row(0).setZero(); // Do not update bias -model.u[k] -= stepsize * (o[k] * delta[k].transpose() + regularization); +if (model.momentum > 0){ +Matrix gradient = -stepsize * (o[k] * delta[k].transpose() + regularization); +model.updateVelocity(gradient, k); +model.updatePosition(gradient, k); +} +else { +// Updating model inline instead of using updatePosition since +// updatePosition creates a copy of the gradient making it slower. --- End diff -- This comment is a little confusing. We do pass along the gradient by reference to `updatePosition()`. If the
[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...
GitHub user kaknikhil opened a pull request: https://github.com/apache/madlib/pull/272 MLP: Add momentum and nesterov to gradient updates. JIRA: MADLIB-1210 We refactored the minibatch code to separate out the momentum and model update functions. We initially were using the same function to get the loss and gradient for both igd and minibatch but the overhead of creating and updating the total_gradient_per_layer variable made igd slower. So we decided not to use the same code and are now calling the model and momentum update functions for both igd and minibatch Co-authored-by: Rahul IyerCo-authored-by: Jingyi Mei You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/mlp_momentum Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/272.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #272 commit 176e197f48732443ce658c5d02cefc8c45e7ff52 Author: Rahul Iyer Date: 2018-05-02T12:25:48Z MLP: Add momentum and nesterov to gradient updates. JIRA: MADLIB-1210 We refactored the minibatch code to separate out the momentum and model update functions. We initially were using the same function to get the loss and gradient for both igd and minibatch but the overhead of creating and updating the total_gradient_per_layer variable made igd slower. So we decided not to use the same code and are now calling the model and momentum update functions for both igd and minibatch Co-authored-by: Rahul Iyer Co-authored-by: Jingyi Mei ---