stephenrawls edited a comment on issue #15278: fixing var-seq-len rnn backward() operator URL: https://github.com/apache/incubator-mxnet/pull/15278#issuecomment-503774511 Just to keep the ticket updated: I have confirmed the following facts: 1. If I set each sequence_length entry to the maximum sequence length, then my gradients between the reference net and the var-seq-len net do match 2. When I set cudnn debugging on, I *am* calling the appropriate "unpacked enabled" version of the cudnn api and the appropriate seq-len values are passed in. i.e. I set: ``` export CUDNN_LOGINFO_DBG=1 export CUDNN_LOGDEST_DBG=/home/ec2-user/cudnn.dbg.log ``` And I look at the resulting output and see: ``` I! CuDNN (v7501) function cudnnRNNForwardTrainingEx() called: ... paddingMode: type=cudnnRNNPaddingMode_t; val=CUDNN_RNN_PADDED_IO_ENABLED (1); ... i! seqLengthArray: type=int; val=[10,7,10,11,8,3,5,11,6,2]; ``` And this does match up to the corresponding call to the backward functions, i.e. ``` I! CuDNN (v7501) function cudnnRNNBackwardDataEx() called: ... paddingMode: type=cudnnRNNPaddingMode_t; val=CUDNN_RNN_PADDED_IO_ENABLED (1); ... seqLengthArray: type=int; val=[10,7,10,11,8,3,5,11,6,2]; ``` And same for cudnnRNNBackwardWeightsEx(). My suspicion now is maybe the reference net gradient is losing floating point precision because it is going through extra reverse / concat / etc operations. Going to consider another way of constructing the reference net for testing the gradient.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
