Re: Using AMP

2020-05-01 Thread Naveen Swamy
Thanks Przemek, appreciate your input. Let me apply the scale changes to
the gradient clips and run the experiment again.

On Fri, May 1, 2020 at 11:20 AM Przemysław Trędak 
wrote:

> Just realized I did not actually link to the issue I mentioned, it is
> https://github.com/apache/incubator-mxnet/issues/17507
>
> On 2020/05/01 18:19:27, Przemys��aw Tr��dak  wrote:
> > Hi Naveen,
> >
> > The problem that you see with loss is due to the fact that the model
> clips the gradient, which in the case of AMP is scaled by the loss scale.
> In order for it to work you need to apply the same loss scale to the value
> you are using to clip the gradients. This is currently possible in 2 ways,
> either use amp.unscale API to unscale the gradients before clipping, or use
> (currently quite hackily, there is an open issue [1] to expose it properly)
> trainer._amp_loss_scaler.loss_scale to multiply your intended global norm
> of gradients.
> >
> > The topic of gradient clipping with AMP is a common problem people have
> and it should be included in the tutorial. I intend to update the tutorial
> with an example of this together with other changes intended to bring AMP
> out of contrib.
> >
> > Regarding performance - it is quite hard to say what is the reason of
> this without profiling the application - there could be multiple different
> bottleneck here, other than the actual computation on the GPU.
> >
> > Hope this helps :-)
> > Przemek
> >
> > On 2020/05/01 05:14:39, Naveen Swamy  wrote:
> > > Hello,
> > > I am trying to use AMP on an RNN Model, however I am not seeing higher
> > > throughputs using AMP. also the loss seems to have stagnated. I am
> > > wondering if I am missing something.
> > >
> > > Also has AMP has been tested on any RNN models and if there are any
> > > benchmarks ? Appreciate some input here..
> > >
> > > I used the RNN model here [1] and followed the tutorial in [2], the
> output
> > > of the runs are
> > > 
> > > Without AMP:
> > > mxnet-lm$ python train.py --cuda --tied --nhid 1500 --emsize 1500
> --epochs
> > > 60  --dropout 0.65 --model gru --batch_size 128
> > >
> > > [Epoch 3 Batch 200/13] loss 6.47, ppl 648.24, throughput 675.94
> samples/s
> > > [Epoch 3 Batch 400/13] loss 6.30, ppl 543.20, throughput 679.51
> samples/s
> > > [Epoch 3] time cost 90.29s, valid loss 5.97, valid ppl 392.94
> > > test loss 5.89, test ppl 361.69
> > > [Epoch 4 Batch 200/13] loss 6.15, ppl 470.58, throughput 676.46
> samples/s
> > > [Epoch 4 Batch 400/13] loss 6.01, ppl 408.21, throughput 679.51
> samples/s
> > > [Epoch 4] time cost 90.27s, valid loss 5.69, valid ppl 296.89
> > >
> > > test loss 5.63, test ppl 277.58
> > > 
> > > With AMP:
> > >
> > > (gluonnlp) ubuntu@ip-172-30-0-140:~/mxnet-lm$ python train.py --cuda
> --tied
> > > --nhid 1500 --emsize 1500 --epochs 60  --dropout 0.65 --model gru
> > > --batch_size 128 --amp True
> > > Namespace(amp=True, batch_size=128, bptt=35, clip=0.25, cuda=True,
> > > dropout=0.65, emsize=1500, epochs=60, export_model=False,
> gcthreshold=0.5,
> > > gctype='none', hybridize=False, log_interval=200, lr=20, model='gru',
> > > nhid=1500, nlayers=2, save='model.params', static_alloc=False,
> > > static_shape=False, tied=True)
> > > using AMP
> > > INFO:root:Using AMP
> > > [Epoch 3 Batch 200/13] loss 10.43, ppl 34026.18, throughput 685.66
> samples/s
> > > [Epoch 3 Batch 400/13] loss 10.38, ppl 32150.51, throughput 688.99
> samples/s
> > > [Epoch 3] time cost 89.04s, valid loss 10.36, valid ppl 31650.83
> > > test loss 10.36, test ppl 31626.99
> > > INFO:root:AMP: increasing loss scale to 131072.00
> > > [Epoch 4 Batch 200/13] loss 10.42, ppl 33642.12, throughput 686.83
> samples/s
> > > [Epoch 4 Batch 400/13] loss 10.37, ppl 31839.51, throughput 689.55
> samples/s
> > > 
> > >
> > > changes made to the training loop after initializing amp and the
> trainer:
> > >
> > > with autograd.record():
> > > output, hidden = model(data, hidden)
> > > # Here L is a vector of size batch_size * bptt size
> > > L = loss(output, target)
> > > L = L / (args.bptt * args.batch_size)
> > > with amp.scale_loss(L, trainer) as scaled_loss:
> > > mx.autograd.backward(scaled_loss)
> > >
> > > 
> > > [1]
> > >
> https://github.com/apache/incubator-mxnet/blob/master/example/gluon/word_language_model/train.py
> > >
> > > [2]
> > >
> https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html
> > >
> > > Thanks, Naveen
> > >
> >
>


Re: Using AMP

2020-05-01 Thread Przemysław Trędak
Just realized I did not actually link to the issue I mentioned, it is 
https://github.com/apache/incubator-mxnet/issues/17507

On 2020/05/01 18:19:27, Przemys��aw Tr��dak  wrote: 
> Hi Naveen,
> 
> The problem that you see with loss is due to the fact that the model clips 
> the gradient, which in the case of AMP is scaled by the loss scale. In order 
> for it to work you need to apply the same loss scale to the value you are 
> using to clip the gradients. This is currently possible in 2 ways, either use 
> amp.unscale API to unscale the gradients before clipping, or use (currently 
> quite hackily, there is an open issue [1] to expose it properly) 
> trainer._amp_loss_scaler.loss_scale to multiply your intended global norm of 
> gradients.
> 
> The topic of gradient clipping with AMP is a common problem people have and 
> it should be included in the tutorial. I intend to update the tutorial with 
> an example of this together with other changes intended to bring AMP out of 
> contrib.
> 
> Regarding performance - it is quite hard to say what is the reason of this 
> without profiling the application - there could be multiple different 
> bottleneck here, other than the actual computation on the GPU.
> 
> Hope this helps :-)
> Przemek
> 
> On 2020/05/01 05:14:39, Naveen Swamy  wrote: 
> > Hello,
> > I am trying to use AMP on an RNN Model, however I am not seeing higher
> > throughputs using AMP. also the loss seems to have stagnated. I am
> > wondering if I am missing something.
> > 
> > Also has AMP has been tested on any RNN models and if there are any
> > benchmarks ? Appreciate some input here..
> > 
> > I used the RNN model here [1] and followed the tutorial in [2], the output
> > of the runs are
> > 
> > Without AMP:
> > mxnet-lm$ python train.py --cuda --tied --nhid 1500 --emsize 1500 --epochs
> > 60  --dropout 0.65 --model gru --batch_size 128
> > 
> > [Epoch 3 Batch 200/13] loss 6.47, ppl 648.24, throughput 675.94 samples/s
> > [Epoch 3 Batch 400/13] loss 6.30, ppl 543.20, throughput 679.51 samples/s
> > [Epoch 3] time cost 90.29s, valid loss 5.97, valid ppl 392.94
> > test loss 5.89, test ppl 361.69
> > [Epoch 4 Batch 200/13] loss 6.15, ppl 470.58, throughput 676.46 samples/s
> > [Epoch 4 Batch 400/13] loss 6.01, ppl 408.21, throughput 679.51 samples/s
> > [Epoch 4] time cost 90.27s, valid loss 5.69, valid ppl 296.89
> > 
> > test loss 5.63, test ppl 277.58
> > 
> > With AMP:
> > 
> > (gluonnlp) ubuntu@ip-172-30-0-140:~/mxnet-lm$ python train.py --cuda --tied
> > --nhid 1500 --emsize 1500 --epochs 60  --dropout 0.65 --model gru
> > --batch_size 128 --amp True
> > Namespace(amp=True, batch_size=128, bptt=35, clip=0.25, cuda=True,
> > dropout=0.65, emsize=1500, epochs=60, export_model=False, gcthreshold=0.5,
> > gctype='none', hybridize=False, log_interval=200, lr=20, model='gru',
> > nhid=1500, nlayers=2, save='model.params', static_alloc=False,
> > static_shape=False, tied=True)
> > using AMP
> > INFO:root:Using AMP
> > [Epoch 3 Batch 200/13] loss 10.43, ppl 34026.18, throughput 685.66 samples/s
> > [Epoch 3 Batch 400/13] loss 10.38, ppl 32150.51, throughput 688.99 samples/s
> > [Epoch 3] time cost 89.04s, valid loss 10.36, valid ppl 31650.83
> > test loss 10.36, test ppl 31626.99
> > INFO:root:AMP: increasing loss scale to 131072.00
> > [Epoch 4 Batch 200/13] loss 10.42, ppl 33642.12, throughput 686.83 samples/s
> > [Epoch 4 Batch 400/13] loss 10.37, ppl 31839.51, throughput 689.55 samples/s
> > 
> > 
> > changes made to the training loop after initializing amp and the trainer:
> > 
> > with autograd.record():
> > output, hidden = model(data, hidden)
> > # Here L is a vector of size batch_size * bptt size
> > L = loss(output, target)
> > L = L / (args.bptt * args.batch_size)
> > with amp.scale_loss(L, trainer) as scaled_loss:
> > mx.autograd.backward(scaled_loss)
> > 
> > 
> > [1]
> > https://github.com/apache/incubator-mxnet/blob/master/example/gluon/word_language_model/train.py
> > 
> > [2]
> > https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html
> > 
> > Thanks, Naveen
> > 
> 


Re: Using AMP

2020-05-01 Thread Przemysław Trędak
Hi Naveen,

The problem that you see with loss is due to the fact that the model clips the 
gradient, which in the case of AMP is scaled by the loss scale. In order for it 
to work you need to apply the same loss scale to the value you are using to 
clip the gradients. This is currently possible in 2 ways, either use 
amp.unscale API to unscale the gradients before clipping, or use (currently 
quite hackily, there is an open issue [1] to expose it properly) 
trainer._amp_loss_scaler.loss_scale to multiply your intended global norm of 
gradients.

The topic of gradient clipping with AMP is a common problem people have and it 
should be included in the tutorial. I intend to update the tutorial with an 
example of this together with other changes intended to bring AMP out of 
contrib.

Regarding performance - it is quite hard to say what is the reason of this 
without profiling the application - there could be multiple different 
bottleneck here, other than the actual computation on the GPU.

Hope this helps :-)
Przemek

On 2020/05/01 05:14:39, Naveen Swamy  wrote: 
> Hello,
> I am trying to use AMP on an RNN Model, however I am not seeing higher
> throughputs using AMP. also the loss seems to have stagnated. I am
> wondering if I am missing something.
> 
> Also has AMP has been tested on any RNN models and if there are any
> benchmarks ? Appreciate some input here..
> 
> I used the RNN model here [1] and followed the tutorial in [2], the output
> of the runs are
> 
> Without AMP:
> mxnet-lm$ python train.py --cuda --tied --nhid 1500 --emsize 1500 --epochs
> 60  --dropout 0.65 --model gru --batch_size 128
> 
> [Epoch 3 Batch 200/13] loss 6.47, ppl 648.24, throughput 675.94 samples/s
> [Epoch 3 Batch 400/13] loss 6.30, ppl 543.20, throughput 679.51 samples/s
> [Epoch 3] time cost 90.29s, valid loss 5.97, valid ppl 392.94
> test loss 5.89, test ppl 361.69
> [Epoch 4 Batch 200/13] loss 6.15, ppl 470.58, throughput 676.46 samples/s
> [Epoch 4 Batch 400/13] loss 6.01, ppl 408.21, throughput 679.51 samples/s
> [Epoch 4] time cost 90.27s, valid loss 5.69, valid ppl 296.89
> 
> test loss 5.63, test ppl 277.58
> 
> With AMP:
> 
> (gluonnlp) ubuntu@ip-172-30-0-140:~/mxnet-lm$ python train.py --cuda --tied
> --nhid 1500 --emsize 1500 --epochs 60  --dropout 0.65 --model gru
> --batch_size 128 --amp True
> Namespace(amp=True, batch_size=128, bptt=35, clip=0.25, cuda=True,
> dropout=0.65, emsize=1500, epochs=60, export_model=False, gcthreshold=0.5,
> gctype='none', hybridize=False, log_interval=200, lr=20, model='gru',
> nhid=1500, nlayers=2, save='model.params', static_alloc=False,
> static_shape=False, tied=True)
> using AMP
> INFO:root:Using AMP
> [Epoch 3 Batch 200/13] loss 10.43, ppl 34026.18, throughput 685.66 samples/s
> [Epoch 3 Batch 400/13] loss 10.38, ppl 32150.51, throughput 688.99 samples/s
> [Epoch 3] time cost 89.04s, valid loss 10.36, valid ppl 31650.83
> test loss 10.36, test ppl 31626.99
> INFO:root:AMP: increasing loss scale to 131072.00
> [Epoch 4 Batch 200/13] loss 10.42, ppl 33642.12, throughput 686.83 samples/s
> [Epoch 4 Batch 400/13] loss 10.37, ppl 31839.51, throughput 689.55 samples/s
> 
> 
> changes made to the training loop after initializing amp and the trainer:
> 
> with autograd.record():
> output, hidden = model(data, hidden)
> # Here L is a vector of size batch_size * bptt size
> L = loss(output, target)
> L = L / (args.bptt * args.batch_size)
> with amp.scale_loss(L, trainer) as scaled_loss:
> mx.autograd.backward(scaled_loss)
> 
> 
> [1]
> https://github.com/apache/incubator-mxnet/blob/master/example/gluon/word_language_model/train.py
> 
> [2]
> https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html
> 
> Thanks, Naveen
>