[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-06-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/272


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-06-08 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r194151705
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +122,26 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum.
+The velocity is the same size as the coefficient and is accumulated in the 
direction of persistent reduction, which speeds up the optimization. The 
momentum value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
+In classical momentum we first correct the velocity, and then update the 
model with that velocity, whereas in Nesterov momentum, we first move the model 
in the direction of momentum*velocity , then correct the velocity and finally 
use the updated model to calculate the gradient. The main difference being that 
in classical momentum, we compute the gradient before updating the model 
whereas in nesterov we first update the model and then compute the gradient 
from the updated position.\\
--- End diff --

`momentum*velocity ,` -> `momentum*velocity,`. The extra space before the 
comma is moving the `,` to the next line in the pdf.


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-06-08 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r194151826
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -196,17 +221,28 @@ \subsubsection{The $\mathit{Gradient}$ Function}
 \end{algorithmic}
 \end{algorithm}
 
-\begin{algorithm}[mlp-train-iteration$(X, Y, \eta)$] 
\label{alg:mlp-train-iteration}
+\begin{algorithm}[mlp-train-iteration$(X, Y, \eta, mu)$] 
\label{alg:mlp-train-iteration}
--- End diff --

`mu` -> `\mu`


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192246910
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum. The momentum 
value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
+In classical momentum you first correct the velocity and step with that 
velocity, whereas in Nesterov momentum you first step in the velocity direction 
then make a correction to the velocity vector based on the new location. \\
+Classic momentum update
+\[\begin{aligned}
+\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} 
\text{ (velocity update)} \\ % $\eta$,\\ *  $\nabla f(u)$
+\mathit{u} \set \mathit{u} + \mathit{v} \\
+\end{aligned}\]
+
+Nesterov momentum update
+\[\begin{aligned}
+\mathit{u} \set \mathit{u} + \mathit{mu} * \mathit{v} \text{ 
(nesterov's initial coefficient update )} \\
+\mathit{v} \set \mathit{mu} * \mathit{v} -  \eta * \mathit{gradient} 
\text{ (velocity update)} \\ % $\eta$,\\ *  $\nabla f(u)$
+\mathit{u} \set \mathit{u} - \eta * \mathit{gradient} \\
--- End diff --

Can we left-align these and other momentum related equations?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192246463
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum. The momentum 
value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
--- End diff --

We only describe the range for momentum, but nothing about velocity before 
the next line begins. Can we say a bit about what velocity is before we 
describe how it is used?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192251198
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
--- End diff --

Please add your name to the list of authors in the beginning of the file. 
Maybe add the history for this file as well?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192245605
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum. The momentum 
value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
+In classical momentum you first correct the velocity and step with that 
velocity, whereas in Nesterov momentum you first step in the velocity direction 
then make a correction to the velocity vector based on the new location. \\
--- End diff --

`step with that velocity` is a little confusing to me. Do we have some 
source where it is defined this way?
If it's any better, can we use the following text to say what the 
difference between momentum and NAG is (source is 
http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf):
```
... the key difference between momentum and Nesterov’s
accelerated gradient is that momentum computes the gradient before applying 
the velocity, while Nesterov’s
accelerated gradient computes the gradient after doing so.
```
If `step with that velocity` is a standard way of defining it, then I am 
okay with it.

This comment applies to user and online docs too.


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r192248589
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -117,6 +117,24 @@ \subsubsection{Backpropagation}
 \[\boxed{\delta_{k}^j = \sum_{t=1}^{n_{k+1}} \left( \delta_{k+1}^t \cdot 
u_{k}^{jt} \right) \cdot \phi'(\mathit{net}_{k}^j)}\]
 where $k = 1,...,N-1$, and $j = 1,...,n_{k}$.
 
+\paragraph{Momentum updates.}
+Momentum\cite{momentum_ilya}\cite{momentum_cs231n} can help accelerate 
learning and avoid local minima when using gradient descent. We also support 
nesterov's accelarated gradient due to its look ahead characteristics. \\
+Here we need to introduce two new variables namely velocity and momentum. 
momentum must be in the range 0 to 1, where 0 means no momentum. The momentum 
value is responsible for damping the velocity and is analogous to the 
coefficient of friction. \\
+In classical momentum you first correct the velocity and step with that 
velocity, whereas in Nesterov momentum you first step in the velocity direction 
then make a correction to the velocity vector based on the new location. \\
+Classic momentum update
+\[\begin{aligned}
+\mathit{v} \set \mathit{mu} * \mathit{v} - \eta * \mathit{gradient} 
\text{ (velocity update)} \\ % $\eta$,\\ *  $\nabla f(u)$
+\mathit{u} \set \mathit{u} + \mathit{v} \\
+\end{aligned}\]
+
+Nesterov momentum update
+\[\begin{aligned}
+\mathit{u} \set \mathit{u} + \mathit{mu} * \mathit{v} \text{ 
(nesterov's initial coefficient update )} \\
+\mathit{v} \set \mathit{mu} * \mathit{v} -  \eta * \mathit{gradient} 
\text{ (velocity update)} \\ % $\eta$,\\ *  $\nabla f(u)$
--- End diff --

Instead of `gradient`, can we replace it with the actual math notation?
I think the gradient at this line is: `\frac{\partial f}{\partial 
u_{k-1}^{sj}}`
The gradient in the next line should be w.r.t `v` instead of `u`.


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191945714
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -197,6 +244,7 @@ MLP::loss(
 const model_type,
 const independent_variables_type,
 const dependent_variable_type   _true) {
+
--- End diff --

+1 will do


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191942290
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -1781,7 +1799,7 @@ class MLPMinibatchPreProcessor:
 summary_table_columns = summary_table_columns[0]
 
 required_columns = (self.DEPENDENT_VARNAME, 
self.INDEPENDENT_VARNAME,
-self.CLASS_VALUES, self.GROUPING_COL)
+self.CLASS_VALUES, self.GROUPING_COL, 
self.DEPENDENT_VARTYPE)
 if set(required_columns) <= set(summary_table_columns):
 self.preprocessed_summary_dict = summary_table_columns
 else:
--- End diff --

+1 will do


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191942284
  
--- Diff: src/ports/postgres/modules/convex/mlp.sql_in ---
@@ -1474,13 +1480,15 @@ CREATE AGGREGATE MADLIB_SCHEMA.mlp_minibatch_step(
 /* warm_start_coeff */DOUBLE PRECISION[],
 /* lambda */  DOUBLE PRECISION,
 /* batch_size */  INTEGER,
-/* n_epochs */INTEGER
+/* n_epochs */INTEGER,
+/* momentum */DOUBLE PRECISION,
+/* is_nesterov */ BOOLEAN
 )(
 STYPE=DOUBLE PRECISION[],
 SFUNC=MADLIB_SCHEMA.mlp_minibatch_transition,
 m4_ifdef(`__POSTGRESQL__', `', 
`prefunc=MADLIB_SCHEMA.mlp_minibatch_merge,')
 FINALFUNC=MADLIB_SCHEMA.mlp_minibatch_final,
-INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0}'
+INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
 );
 -
 
--- End diff --

+1 will do


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191575057
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -1781,7 +1799,7 @@ class MLPMinibatchPreProcessor:
 summary_table_columns = summary_table_columns[0]
 
 required_columns = (self.DEPENDENT_VARNAME, 
self.INDEPENDENT_VARNAME,
-self.CLASS_VALUES, self.GROUPING_COL)
+self.CLASS_VALUES, self.GROUPING_COL, 
self.DEPENDENT_VARTYPE)
 if set(required_columns) <= set(summary_table_columns):
 self.preprocessed_summary_dict = summary_table_columns
 else:
--- End diff --

Can you please update the optimizer params in online docs?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191533727
  
--- Diff: src/modules/convex/type/model.hpp ---
@@ -126,45 +129,96 @@ struct MLPModel {
 for (k = 0; k < N; k ++) {
 size += static_cast((n[k] + 1) * (n[k+1]));
 }
+//TODO conditionally assign size based on momentum
--- End diff --

Is this TODO still valid?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191574947
  
--- Diff: src/ports/postgres/modules/convex/mlp.sql_in ---
@@ -1474,13 +1480,15 @@ CREATE AGGREGATE MADLIB_SCHEMA.mlp_minibatch_step(
 /* warm_start_coeff */DOUBLE PRECISION[],
 /* lambda */  DOUBLE PRECISION,
 /* batch_size */  INTEGER,
-/* n_epochs */INTEGER
+/* n_epochs */INTEGER,
+/* momentum */DOUBLE PRECISION,
+/* is_nesterov */ BOOLEAN
 )(
 STYPE=DOUBLE PRECISION[],
 SFUNC=MADLIB_SCHEMA.mlp_minibatch_transition,
 m4_ifdef(`__POSTGRESQL__', `', 
`prefunc=MADLIB_SCHEMA.mlp_minibatch_merge,')
 FINALFUNC=MADLIB_SCHEMA.mlp_minibatch_final,
-INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0}'
+INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
 );
 -
 
--- End diff --

Can you please update the user docs with momentum and nesterov related 
optimizer params? Is it also a good time to update the MLP design doc?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191539168
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -197,6 +244,7 @@ MLP::loss(
 const model_type,
 const independent_variables_type,
 const dependent_variable_type   _true) {
+
--- End diff --

The loss computation code in this function is duplicated in `getLoss()` 
function. Can we call that function to remove the code duplication?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191537654
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -126,68 +157,84 @@ MLP::getLossAndUpdateModel(
 const Matrix _true_batch,
 const double ) {
 
-uint16_t num_layers = static_cast(model.u.size()); // 
assuming nu. of layers >= 1
-Index num_rows_in_batch = x_batch.rows();
 double total_loss = 0.;
+// model is updated with the momentum step (i.e. velocity vector)
+// if Nesterov Accelerated Gradient is enabled
+model.nesterovUpdatePosition();
 
-// gradient added over the batch
-std::vector total_gradient_per_layer(num_layers);
-for (Index k=0; k < num_layers; ++k)
+// initialize gradient vector
+std::vector total_gradient_per_layer(model.num_layers);
+for (Index k=0; k < model.num_layers; ++k) {
 total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(),
model.u[k].cols());
+}
 
+std::vector net, o, delta;
+Index num_rows_in_batch = x_batch.rows();
 for (Index i=0; i < num_rows_in_batch; i++){
+// gradient and loss added over the batch
 ColumnVector x = x_batch.row(i);
 ColumnVector y_true = y_true_batch.row(i);
 
-std::vector net, o, delta;
 feedForward(model, x, net, o);
 backPropogate(y_true, o.back(), net, model, delta);
 
-for (Index k=0; k < num_layers; k++){
+// compute the gradient
+for (Index k=0; k < model.num_layers; k++){
 total_gradient_per_layer[k] += o[k] * delta[k].transpose();
 }
 
-// loss computation
-ColumnVector y_estimated = o.back();
-if(model.is_classification){
-double clip = 1.e-10;
-y_estimated = y_estimated.cwiseMax(clip).cwiseMin(1.-clip);
-total_loss += - (y_true.array()*y_estimated.array().log()
-   + 
(-y_true.array()+1)*(-y_estimated.array()+1).log()).sum();
-}
-else{
-total_loss += 0.5 * (y_estimated - y_true).squaredNorm();
-}
+// compute the loss
+total_loss += getLoss(y_true, o.back(), model.is_classification);
 }
-for (Index k=0; k < num_layers; k++){
+
+// convert gradient to a gradient update vector
+//  1. normalize to per row update
+//  2. discount by stepsize
+//  3. add regularization
+//  4. make negative
+for (Index k=0; k < model.num_layers; k++){
 Matrix regularization = MLP::lambda * model.u[k];
 regularization.row(0).setZero(); // Do not update bias
-model.u[k] -=
-stepsize *
-(total_gradient_per_layer[k] / 
static_cast(num_rows_in_batch) +
- regularization);
+total_gradient_per_layer[k] = -stepsize * 
(total_gradient_per_layer[k] / static_cast(num_rows_in_batch) +
+  regularization);
+model.updateVelocity(total_gradient_per_layer[k], k);
+model.updatePosition(total_gradient_per_layer[k], k);
 }
+
 return total_loss;
+
 }
 
+
 template 
 void
 MLP::gradientInPlace(
 model_type  ,
 const independent_variables_type,
 const dependent_variable_type   _true,
-const double) {
-size_t N = model.u.size(); // assuming nu. of layers >= 1
+const double)
+{
+model.nesterovUpdatePosition();
+
 std::vector net, o, delta;
 
 feedForward(model, x, net, o);
 backPropogate(y_true, o.back(), net, model, delta);
 
-for (size_t k=0; k < N; k++){
+for (Index k=0; k < model.num_layers; k++){
 Matrix regularization = MLP::lambda*model.u[k];
 regularization.row(0).setZero(); // Do not update bias
-model.u[k] -= stepsize * (o[k] * delta[k].transpose() + 
regularization);
+if (model.momentum > 0){
+Matrix gradient = -stepsize * (o[k] * delta[k].transpose() + 
regularization);
+model.updateVelocity(gradient, k);
+model.updatePosition(gradient, k);
+}
+else {
+// Updating model inline instead of using updatePosition since
+// updatePosition creates a copy of the gradient making it 
slower.
--- End diff --

This comment is a little confusing. We do pass along the gradient by 
reference to `updatePosition()`. If the 

[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-24 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/272

MLP: Add momentum and nesterov to gradient updates.

JIRA: MADLIB-1210

We refactored the minibatch code to separate out the momentum and model
update functions. We initially were using the same function to get the
loss and gradient for both igd and minibatch but the overhead of
creating and updating the total_gradient_per_layer variable made igd
slower. So we decided not to use the same code and are now calling the
model and momentum update functions for both igd and minibatch

Co-authored-by: Rahul Iyer
Co-authored-by: Jingyi Mei 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/mlp_momentum

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/272.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #272


commit 176e197f48732443ce658c5d02cefc8c45e7ff52
Author: Rahul Iyer 
Date:   2018-05-02T12:25:48Z

MLP: Add momentum and nesterov to gradient updates.

JIRA: MADLIB-1210

We refactored the minibatch code to separate out the momentum and model
update functions. We initially were using the same function to get the
loss and gradient for both igd and minibatch but the overhead of
creating and updating the total_gradient_per_layer variable made igd
slower. So we decided not to use the same code and are now calling the
model and momentum update functions for both igd and minibatch

Co-authored-by: Rahul Iyer
Co-authored-by: Jingyi Mei 




---