zixuanweeei commented on issue #17086: [MKLDNN] RNN Op gradient computation is broken URL: https://github.com/apache/incubator-mxnet/issues/17086#issuecomment-569213067 Hi, @liuzh91 @szhengac. We have posted https://github.com/apache/incubator-mxnet/pull/17183 to fix the gradient explosion issue in RNN Backward. Thanks for reporting this issue again. And it would be greatly appreciated if you could give a test on this patch. Thanks. BTW, we got the below training log: ``` ❯ python word_language_model.py --log-interval=1 /path/to/mxnet/python/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB Optimizer.opt_registry[name].__name__)) Namespace(alpha=2, batch_size=80, beta=1, bptt=70, clip=0.25, dropout=0.4, dropout_e=0.1, dropout_h=0.2, dropout_i=0.65, emsize=400, epochs=750, eval_only=False, gpu=None, log_interval=1, lr=30, lr_update_factor=0.1, lr_update_interval=30, model='lstm', nhid=1150, nlayers=3, ntasgd=False, optimizer='sgd', save='model.params', test_mode=False, tied=False, wd=1.2e-06, weight_dropout=0.5) Use AWDRNN AWDRNN( (embedding): HybridSequential( (0): Embedding(33278 -> 400, float32) (1): Dropout(p = 0.65, axes=(0,)) ) (encoder): HybridSequential( (0): LSTM(400 -> 1150, TNC) (1): LSTM(1150 -> 1150, TNC) (2): LSTM(1150 -> 1150, TNC) ) (decoder): HybridSequential( (0): Dense(None -> 33278, linear) ) ) [Epoch 0 Batch 1/372] current loss 20.50, ppl 796977445.38, throughput 18.37 samples/s, lr 30.86 [Epoch 0 Batch 2/372] current loss 9.51, ppl 13511.50, throughput 39.56 samples/s, lr 28.29 [Epoch 0 Batch 3/372] current loss 17.53, ppl 41003388.51, throughput 40.65 samples/s, lr 27.43 [Epoch 0 Batch 4/372] current loss 9.45, ppl 12761.47, throughput 40.39 samples/s, lr 27.43 [Epoch 0 Batch 5/372] current loss 14.34, ppl 1695623.66, throughput 35.59 samples/s, lr 31.71 [Epoch 0 Batch 6/372] current loss 9.40, ppl 12113.46, throughput 35.10 samples/s, lr 32.14 [Epoch 0 Batch 7/372] current loss 8.56, ppl 5232.00, throughput 37.62 samples/s, lr 30.00 [Epoch 0 Batch 8/372] current loss 9.32, ppl 11163.67, throughput 42.00 samples/s, lr 26.57 [Epoch 0 Batch 9/372] current loss 8.44, ppl 4642.37, throughput 61.95 samples/s, lr 17.14 [Epoch 0 Batch 10/372] current loss 8.92, ppl 7494.76, throughput 41.39 samples/s, lr 27.00 ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
