[GitHub] [incubator-mxnet] oljike opened a new issue #16969: Gradient accumulation in Module

GitBox Wed, 04 Dec 2019 00:52:38 -0800

oljike opened a new issue #16969: Gradient accumulation in Module
URL: https://github.com/apache/incubator-mxnet/issues/16969
 
 
   Hi! I am trying to implement a simple MLP on MNIST with gradients 
accumulation using MX Module. _accum_step_ is number gradients accumulation 
steps. I am doing the following:
   1. Bind the model with _grad_req="add"_ parameter
   2. Run _forward(batch)_ and _backward()_ _accum_step - 1_ times. 
   3. On _accum_step_ iteration I run _model.update()_ and the following to 
zero the gradients:
   `model._exec_group.grad_arrays *= 0` 
   
   My problem is that the model is not training at all, i.e the score is not 
changing. 
   (Without gradient accumulation steps and with _grad_req='write'_ model 
trains perfectly)
   
   Here is the full code for reproduce:
   ```
   data = mx.symbol.Variable('data')
   fc1 = mx.symbol.FullyConnected(data, name='fc1', num_hidden=128)
   act1 = mx.symbol.Activation(fc1, name='relu1', act_type="relu")
   fc2 = mx.symbol.FullyConnected(act1, name = 'fc2', num_hidden = 64)
   act2 = mx.symbol.Activation(fc2, name='relu2', act_type="relu")
   fc3 = mx.symbol.FullyConnected(act2, name='fc3', num_hidden=10)
   softmax = mx.symbol.SoftmaxOutput(fc3, name = 'softmax')
   
   
   accum = True
   if accum:
       batch_size = 20
   else:
       batch_size = 100
   
   train_dataiter = mx.io.MNISTIter(
           image=os.path.join("mnist", "train-images-idx3-ubyte"),
           label=os.path.join("mnist", "train-labels-idx1-ubyte"),
           data_shape=(784,),
           batch_size=batch_size, shuffle=True, flat=True, silent=False, 
seed=10)
   val_dataiter = mx.io.MNISTIter(
           image=os.path.join("mnist", "t10k-images-idx3-ubyte"),
           label=os.path.join("mnist", "t10k-labels-idx1-ubyte"),
           data_shape=(784,),
           batch_size=batch_size, shuffle=True, flat=True, silent=False)
   
   mod = mx.mod.Module(softmax)
   
   if accum:
       mod.bind(data_shapes=train_dataiter.provide_data, 
label_shapes=train_dataiter.provide_label, grad_req='add')
   else:
       mod.bind(data_shapes=train_dataiter.provide_data, 
label_shapes=train_dataiter.provide_label, grad_req='write')
   mod.init_params()
   
   mod.init_optimizer(optimizer_params={'learning_rate':0.01, 'momentum': 0.9})
   metric = mx.metric.create('acc')
   
   n_epoch = 10
   
   accum_step = 5
   
   for i_epoch in range(n_epoch):
       for i_iter, batch in enumerate(train_dataiter):
           mod.forward(batch)
           mod.update_metric(metric, batch.label)
           mod.backward()
           
           if accum:
               if i_iter % 5 == 0 and i_iter>0:
                   mod.update()
                   mod._exec_group.grad_arrays *= 0
           else:
               mod.update()
                   
       for name, val in metric.get_name_value():
           print('epoch %03d: %s=%f' % (i_epoch, name, val))
       metric.reset()
       train_dataiter.reset()
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] oljike opened a new issue #16969: Gradient accumulation in Module

Reply via email to