oljike opened a new issue #16969: Gradient accumulation in Module URL: https://github.com/apache/incubator-mxnet/issues/16969 Hi! I am trying to implement a simple MLP on MNIST with gradients accumulation using MX Module. _accum_step_ is number gradients accumulation steps. I am doing the following: 1. Bind the model with _grad_req="add"_ parameter 2. Run _forward(batch)_ and _backward()_ _accum_step - 1_ times. 3. On _accum_step_ iteration I run _model.update()_ and the following to zero the gradients: `model._exec_group.grad_arrays *= 0` My problem is that the model is not training at all, i.e the score is not changing. (Without gradient accumulation steps and with _grad_req='write'_ model trains perfectly) Here is the full code for reproduce: ``` data = mx.symbol.Variable('data') fc1 = mx.symbol.FullyConnected(data, name='fc1', num_hidden=128) act1 = mx.symbol.Activation(fc1, name='relu1', act_type="relu") fc2 = mx.symbol.FullyConnected(act1, name = 'fc2', num_hidden = 64) act2 = mx.symbol.Activation(fc2, name='relu2', act_type="relu") fc3 = mx.symbol.FullyConnected(act2, name='fc3', num_hidden=10) softmax = mx.symbol.SoftmaxOutput(fc3, name = 'softmax') accum = True if accum: batch_size = 20 else: batch_size = 100 train_dataiter = mx.io.MNISTIter( image=os.path.join("mnist", "train-images-idx3-ubyte"), label=os.path.join("mnist", "train-labels-idx1-ubyte"), data_shape=(784,), batch_size=batch_size, shuffle=True, flat=True, silent=False, seed=10) val_dataiter = mx.io.MNISTIter( image=os.path.join("mnist", "t10k-images-idx3-ubyte"), label=os.path.join("mnist", "t10k-labels-idx1-ubyte"), data_shape=(784,), batch_size=batch_size, shuffle=True, flat=True, silent=False) mod = mx.mod.Module(softmax) if accum: mod.bind(data_shapes=train_dataiter.provide_data, label_shapes=train_dataiter.provide_label, grad_req='add') else: mod.bind(data_shapes=train_dataiter.provide_data, label_shapes=train_dataiter.provide_label, grad_req='write') mod.init_params() mod.init_optimizer(optimizer_params={'learning_rate':0.01, 'momentum': 0.9}) metric = mx.metric.create('acc') n_epoch = 10 accum_step = 5 for i_epoch in range(n_epoch): for i_iter, batch in enumerate(train_dataiter): mod.forward(batch) mod.update_metric(metric, batch.label) mod.backward() if accum: if i_iter % 5 == 0 and i_iter>0: mod.update() mod._exec_group.grad_arrays *= 0 else: mod.update() for name, val in metric.get_name_value(): print('epoch %03d: %s=%f' % (i_epoch, name, val)) metric.reset() train_dataiter.reset() ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
