IssacCheng opened a new issue #15738: AMP results in higher loss URL: https://github.com/apache/incubator-mxnet/issues/15738 After AMP training for this sample GluonNLP training model: https://github.com/dmlc/gluon-nlp/compare/master...ChengXianbing:master we found that the loss is higher than training without enabling AMP (We also enable AMP training in our own language model, we found that not only the loss is higher, but also the training throughput decreases from 30445.52 samples/s to 25603.86 samples/s). Could you please help out? ``` $ python word_language_model.py --model gru --emsize 64 --nhid 128 --lr 1.0 --epochs 3 --bptt 10 --tied --nlayers 1 --test_mode --gpu 0 [Epoch 0] throughput 25758.38 samples/s [Epoch 0] time cost 0.44s, valid loss 10.12, valid ppl 24864.71,lr 1.00 [Epoch 0] test loss 10.15, test ppl 25633.89 [Epoch 1] throughput 71637.50 samples/s [Epoch 1] time cost 0.36s, valid loss 9.06, valid ppl 8631.06,lr 1.00 [Epoch 1] test loss 9.21, test ppl 10037.64 [Epoch 2] throughput 66607.71 samples/s [Epoch 2] time cost 0.30s, valid loss 8.27, valid ppl 3919.10,lr 1.00 [Epoch 2] test loss 8.45, test ppl 4673.45 Total training throughput 14113.57 samples/s Best validation loss 8.27, val ppl 3919.10 Best test loss 8.45, test ppl 4673.45 Total time cost 1.82s $ python word_language_model.py --model gru --emsize 64 --nhid 128 --lr 1.0 --epochs 3 --bptt 10 --tied --nlayers 1 --test_mode --amp_training --gpu 0 [Epoch 0] throughput 36236.71 samples/s [Epoch 0] time cost 0.44s, valid loss 10.41, valid ppl 33323.06,lr 1.00 [Epoch 0] test loss 10.41, test ppl 33251.08 [Epoch 1] throughput 64450.66 samples/s [Epoch 1] time cost 0.32s, valid loss 10.41, valid ppl 33322.87,lr 1.00 [Epoch 1] test loss 10.41, test ppl 33250.94 [Epoch 2] throughput 63184.95 samples/s [Epoch 2] time cost 0.33s, valid loss 10.41, valid ppl 33322.68,lr 1.00 [Epoch 2] test loss 10.41, test ppl 33250.76 Total training throughput 14169.86 samples/s Best validation loss 10.41, val ppl 33322.68 Best test loss 10.41, test ppl 33250.76 Total time cost 1.82s ``` Here is training log snippet when training our own lm. ``` $ python train.py .... --amp_training --gpu 0 [Epoch 0]: ?| [3442/?, loss=6.95, ppl=1040.65] [Epoch 0]: throughput 41170.57 samples/s [Epoch 0]: time cost 2.82s, valid loss 6.91, valid ppl 999.74 [Epoch 0]: test loss 6.91, test ppl 999.82 [Epoch 1]: ?| [3439/?, loss=6.94, ppl=1035.88] [Epoch 1]: throughput 42611.66 samples/s [Epoch 1]: time cost 2.74s, valid loss 6.90, valid ppl 995.40 [Epoch 1]: test loss 6.90, test ppl 995.63 [Epoch 2]: ?| [3443/?, loss=6.94, ppl=1031.34] [Epoch 2]: throughput 43192.13 samples/s [Epoch 2]: time cost 2.69s, valid loss 6.90, valid ppl 991.09 [Epoch 2]: test loss 6.90, test ppl 991.45 Total training throughput 25603.86 samples/s Best test loss 6.90, test ppl 991.45 $ python train.py .... --gpu 0 [Epoch 0]: ?| [3442/?, loss=5.22, ppl=184.47] [Epoch 0]: throughput 44378.70 samples/s [Epoch 0]: time cost 2.59s, valid loss 4.39, valid ppl 80.54 [Epoch 0]: test loss 4.51, test ppl 91.15 [Epoch 1]: ?| [3439/?, loss=4.49, ppl=89.51] [Epoch 1]: throughput 52796.93 samples/s [Epoch 1]: time cost 2.19s, valid loss 4.15, valid ppl 63.30 [Epoch 1]: test loss 4.29, test ppl 72.85 [Epoch 2]: ?| [3443/?, loss=4.28, ppl=72.37] [Epoch 2]: throughput 50600.25 samples/s [Epoch 2]: time cost 2.29s, valid loss 3.99, valid ppl 54.14,lr 1 [Epoch 2]: test loss 4.14, test ppl 62.88 Total training throughput 30445.52 samples/s Best test loss 4.14, test ppl 62.88 ``` ``` $ nvidia-smi -q ==============NVSMI LOG============== Timestamp : Fri Aug 2 18:38:22 2019 Driver Version : 418.67 CUDA Version : 10.1 Attached GPUs : 1 GPU 00000000:00:1E.0 Product Name : Tesla V100-SXM2-16GB Product Brand : Tesla Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0322917091773 GPU UUID : GPU-e9a16ab2-2c86-8a0b-1126-8511d8165cd5 Minor Number : 0 VBIOS Version : 88.00.4F.00.09 MultiGPU Board : No Board ID : 0x1e GPU Part Number : 900-2G503-0000-000 Inforom Version Image Version : G503.0201.00.03 OEM Object : 1.1 ECC Object : 5.0 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : Pass-Through IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x00 Device : 0x1E Domain : 0x0000 Device Id : 0x1DB110DE Bus Id : 00000000:00:1E.0 Sub System Id : 0x121210DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 16130 MiB Used : 0 MiB Free : 16130 MiB BAR1 Memory Usage Total : 16384 MiB Used : 2 MiB Free : 16382 MiB Compute Mode : Default Utilization Gpu : 4 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile Single Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : 0 Double Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : N/A Texture Shared : N/A CBU : 0 Total : 0 Aggregate Single Bit Device Memory : 5 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : 5 Double Bit Device Memory : 0 Register File : 0 L1 Cache : 0 L2 Cache : 0 Texture Memory : N/A Texture Shared : N/A CBU : 0 Total : 0 Retired Pages Single Bit ECC : 1 Double Bit ECC : 0 Pending : No Temperature GPU Current Temp : 50 C GPU Shutdown Temp : 90 C GPU Slowdown Temp : 87 C GPU Max Operating Temp : 83 C Memory Current Temp : 45 C Memory Max Operating Temp : 85 C Power Readings Power Management : Supported Power Draw : 42.75 W Power Limit : 300.00 W Default Power Limit : 300.00 W Enforced Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Clocks Graphics : 1312 MHz SM : 1312 MHz Memory : 877 MHz Video : 1177 MHz Applications Clocks Graphics : 1312 MHz Memory : 877 MHz Default Applications Clocks Graphics : 1312 MHz Memory : 877 MHz Max Clocks Graphics : 1530 MHz SM : 1530 MHz Memory : 877 MHz Video : 1372 MHz Max Customer Boost Clocks Graphics : 1530 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None ``` mxnet version: ```mxnet-cu101 1.5.0b20190711```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
