keerthanvasist opened a new issue #17818: RNN operator produces inconsistent 
gradients for h2h_bias for stacked RNNs
URL: https://github.com/apache/incubator-mxnet/issues/17818
 
 
   The RNN operator produces different inconsistent values for gradient for 
h2h_bias in the topmost stack in different mxnet variants. The one I compared 
was between osx-cpu-mkl and linux-gpu-mkl (both built from source). In the cpu 
variant, the gradient is all zeros. 
   
   These differences are leading to exception in DJL. I also confirmed these 
differences in the values exist in Python. 
   
   
   
   
   
   ### Error Message
   I don't see any error in Python but in DJL we have observed Nan values 
during the training process. 
   
   ## To Reproduce
   I used the following script to look at gradients. 
   
   `import mxnet as mx
   from mxnet import gpu, cpu, gluon
   from mxnet import np, npx
   from mxnet import autograd as ag
   from mxnet.gluon import nn
   
   npx.set_np()
   
   def check_param(net):
       param_dict = net.collect_params()
       array = ('lstm0_{}{}_{}_{}'.format(d, l, g, t)
                for t in ['weight', 'bias']
                for l in range(2)
                for d in ['l', 'r'][:1]
                for g in ['i2h', 'h2h'])
       for key in array:
           param = param_dict[key]
           print("checking param: " + str(param))
           print("weight sum: " + str(param.data().sum()))
           print("weight mean: " + str(param.data().mean()))
           print("weight max: " + str(param.data().max()))
           print("weight min: " + str(param.data().min()))
           if param.grad_req != "null":
               print("checking the gradient of para: " + str(param))
               print("grad sum: " + str(param.grad().sum()))
               print("grad mean: " + str(param.grad().mean()))
               print("grad max: " + str(param.grad().max()))
               print("grad min: " + str(param.grad().min()))
   
   def print_ndarray_stats(ndarray, name) :
       print("#####", name, "#####")
       print("checking " + name)
       print("sum: " + str(ndarray.sum()))
       print("mean: " + str(ndarray.mean()))
       print("max: " + str(ndarray.max()))
       print("min: " + str(ndarray.min()))
       print("Shape: " + str(ndarray.shape))
   
   batch = 32
   time = 28
   channel = 28
   state = 64
   num_layers = 2
   
   mx.random.seed(1234)
   data = np.random.uniform(0, 10, size=(batch, time, channel))
   mx.random.seed(1234)
   labels = np.random.uniform(0, 1, size=(batch, time, state))
   
   net = gluon.rnn.LSTM(state, num_layers=2, 
h2h_weight_initializer=mx.initializer.Xavier(), 
i2h_weight_initializer=mx.initializer.Xavier(), layout='NTC')
   loss = gluon.loss.SoftmaxCrossEntropyLoss(sparse_label=False, 
from_logits=True)
   net.collect_params().initialize()
   
   with ag.record():
       z = net(data)
       L = loss(z, labels).mean()
       print_ndarray_stats(z, "OUTPUT")
       print("Loss = ", L)
   L.backward()
   check_param(net)`
   
   ### Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1. Install the appropriate mxnet version
   2. Run the above script

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to