Hi Naveen,

The problem that you see with loss is due to the fact that the model clips the 
gradient, which in the case of AMP is scaled by the loss scale. In order for it 
to work you need to apply the same loss scale to the value you are using to 
clip the gradients. This is currently possible in 2 ways, either use 
amp.unscale API to unscale the gradients before clipping, or use (currently 
quite hackily, there is an open issue [1] to expose it properly) 
trainer._amp_loss_scaler.loss_scale to multiply your intended global norm of 
gradients.

The topic of gradient clipping with AMP is a common problem people have and it 
should be included in the tutorial. I intend to update the tutorial with an 
example of this together with other changes intended to bring AMP out of 
contrib.

Regarding performance - it is quite hard to say what is the reason of this 
without profiling the application - there could be multiple different 
bottleneck here, other than the actual computation on the GPU.

Hope this helps :-)
Przemek

On 2020/05/01 05:14:39, Naveen Swamy <mnnav...@gmail.com> wrote: 
> Hello,
> I am trying to use AMP on an RNN Model, however I am not seeing higher
> throughputs using AMP. also the loss seems to have stagnated. I am
> wondering if I am missing something.
> 
> Also has AMP has been tested on any RNN models and if there are any
> benchmarks ? Appreciate some input here..
> 
> I used the RNN model here [1] and followed the tutorial in [2], the output
> of the runs are
> ----
> Without AMP:
> mxnet-lm$ python train.py --cuda --tied --nhid 1500 --emsize 1500 --epochs
> 60  --dropout 0.65 --model gru --batch_size 128
> 
> [Epoch 3 Batch 200/13] loss 6.47, ppl 648.24, throughput 675.94 samples/s
> [Epoch 3 Batch 400/13] loss 6.30, ppl 543.20, throughput 679.51 samples/s
> [Epoch 3] time cost 90.29s, valid loss 5.97, valid ppl 392.94
> test loss 5.89, test ppl 361.69
> [Epoch 4 Batch 200/13] loss 6.15, ppl 470.58, throughput 676.46 samples/s
> [Epoch 4 Batch 400/13] loss 6.01, ppl 408.21, throughput 679.51 samples/s
> [Epoch 4] time cost 90.27s, valid loss 5.69, valid ppl 296.89
> 
> test loss 5.63, test ppl 277.58
> ----
> With AMP:
> 
> (gluonnlp) ubuntu@ip-172-30-0-140:~/mxnet-lm$ python train.py --cuda --tied
> --nhid 1500 --emsize 1500 --epochs 60  --dropout 0.65 --model gru
> --batch_size 128 --amp True
> Namespace(amp=True, batch_size=128, bptt=35, clip=0.25, cuda=True,
> dropout=0.65, emsize=1500, epochs=60, export_model=False, gcthreshold=0.5,
> gctype='none', hybridize=False, log_interval=200, lr=20, model='gru',
> nhid=1500, nlayers=2, save='model.params', static_alloc=False,
> static_shape=False, tied=True)
> using AMP
> INFO:root:Using AMP
> [Epoch 3 Batch 200/13] loss 10.43, ppl 34026.18, throughput 685.66 samples/s
> [Epoch 3 Batch 400/13] loss 10.38, ppl 32150.51, throughput 688.99 samples/s
> [Epoch 3] time cost 89.04s, valid loss 10.36, valid ppl 31650.83
> test loss 10.36, test ppl 31626.99
> INFO:root:AMP: increasing loss scale to 131072.000000
> [Epoch 4 Batch 200/13] loss 10.42, ppl 33642.12, throughput 686.83 samples/s
> [Epoch 4 Batch 400/13] loss 10.37, ppl 31839.51, throughput 689.55 samples/s
> ----
> 
> changes made to the training loop after initializing amp and the trainer:
> 
> with autograd.record():
>     output, hidden = model(data, hidden)
>     # Here L is a vector of size batch_size * bptt size
>     L = loss(output, target)
>     L = L / (args.bptt * args.batch_size)
>         with amp.scale_loss(L, trainer) as scaled_loss:
>             mx.autograd.backward(scaled_loss)
> 
> ----
> [1]
> https://github.com/apache/incubator-mxnet/blob/master/example/gluon/word_language_model/train.py
> 
> [2]
> https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html
> 
> Thanks, Naveen
> 

Reply via email to