anirudhacharya opened a new pull request #14868: Revert "Improve cached_op 
performance for static mode"
URL: https://github.com/apache/incubator-mxnet/pull/14868
 
 
   Reverts apache/incubator-mxnet#14785
   
   This revert is to fix this issue - 
https://github.com/apache/incubator-mxnet/issues/14864
   
   This commit 369b66d0f10ba479ce96f78f7c838bd7bc41d951 caused a regression in 
BERT model training
   
   As seen below the commit caused the `nsp_acc` to drop from 100 to 55.2
   ```bash
   commit 369b66d0f10ba479ce96f78f7c838bd7bc41d951 -
   
   INFO:root:[step 1]   mlm_loss=1.65551        mlm_acc=48.11490        
nsp_loss=0.43527        nsp_acc=81.250  throughput=1.6K tks/s   lr=0.0000020 
time=2.32, latency=1161.4 ms/batch
   INFO:root:[step 3]   mlm_loss=7.84039        mlm_acc=2.22965 
nsp_loss=0.73171        nsp_acc=43.103  throughput=2.2K tks/s   lr=0.0000060 
time=2.73, latency=1364.9 ms/batch
   INFO:root:[step 5]   mlm_loss=7.80324        mlm_acc=2.68692 
nsp_loss=0.73161        nsp_acc=42.308  throughput=2.4K tks/s   lr=0.0000100 
time=2.43, latency=1217.0 ms/batch
   INFO:root:[step 7]   mlm_loss=7.69260        mlm_acc=1.55763 
nsp_loss=0.71501        nsp_acc=46.552  throughput=2.4K tks/s   lr=0.0000140 
time=2.64, latency=1320.8 ms/batch
   INFO:root:[step 9]   mlm_loss=7.72376        mlm_acc=2.29167 
nsp_loss=0.73156        nsp_acc=37.931  throughput=2.4K tks/s   lr=0.0000180 
time=2.70, latency=1350.2 ms/batch
   INFO:root:[step 11]  mlm_loss=7.62214        mlm_acc=2.19436 
nsp_loss=0.69882        nsp_acc=51.724  throughput=2.4K tks/s   lr=0.0000090 
time=2.65, latency=1322.5 ms/batch
   INFO:root:[step 13]  mlm_loss=7.49625        mlm_acc=2.46781 
nsp_loss=0.72365        nsp_acc=43.103  throughput=2.3K tks/s   lr=0.0000070 
time=2.70, latency=1347.8 ms/batch
   INFO:root:[step 15]  mlm_loss=7.47410        mlm_acc=2.18424 
nsp_loss=0.71855        nsp_acc=39.062  throughput=2.4K tks/s   lr=0.0000050 
time=2.92, latency=1458.7 ms/batch
   INFO:root:[step 17]  mlm_loss=7.30681        mlm_acc=2.56674 
nsp_loss=0.68619        nsp_acc=53.448  throughput=2.4K tks/s   lr=0.0000030 
time=2.70, latency=1348.9 ms/batch
   INFO:root:[step 19]  mlm_loss=7.61227        mlm_acc=1.75824 
nsp_loss=0.71591        nsp_acc=44.828  throughput=2.2K tks/s   lr=0.0000010 
time=2.76, latency=1380.5 ms/batch
   INFO:root:[step 20] Saving checkpoints to ckpt/0000020.params, 
ckpt/0000020.states.
   INFO:root:Train cost=45.5s
   INFO:root:Using evaluation data at out/*.npz
   INFO:root:[step 1]   mlm_loss=3.74667        mlm_acc=1.51515 
nsp_loss=0.35332        nsp_acc=25.000  throughput=2.9K tks/s   lr=0.0000000 
time=0.30, latency=149.5 ms/batch
   INFO:root:[step 3]   mlm_loss=7.30128        mlm_acc=3.28467 
nsp_loss=0.68692        nsp_acc=62.500  throughput=5.0K tks/s   lr=0.0000000 
time=0.37, latency=185.7 ms/batch
   INFO:root:[step 5]   mlm_loss=7.55211        mlm_acc=2.85714 
nsp_loss=0.67706        nsp_acc=81.250  throughput=5.0K tks/s   lr=0.0000000 
time=0.33, latency=162.8 ms/batch
   INFO:root:[step 7]   mlm_loss=7.07615        mlm_acc=2.29008 
nsp_loss=0.69678        nsp_acc=43.750  throughput=5.4K tks/s   lr=0.0000000 
time=0.32, latency=161.2 ms/batch
   INFO:root:mlm_loss=6.419     mlm_acc=2.5     nsp_loss=0.604  nsp_acc=55.2    
   INFO:root:Eval cost=1.4s
   
   commit 5dd9fa27d8bdd2a8677b7c275a494d17082c0e1c
   
   INFO:root:[step 1]   mlm_loss=1.65551        mlm_acc=48.11490        
nsp_loss=0.43527        nsp_acc=81.250  throughput=1.6K tks/s   lr=0.0000020 
time=2.33, latency=1166.0 ms/batch
   INFO:root:[step 3]   mlm_loss=3.35410        mlm_acc=47.38016        
nsp_loss=0.70400        nsp_acc=84.483  throughput=2.2K tks/s   lr=0.0000060 
time=2.76, latency=1379.8 ms/batch
   INFO:root:[step 5]   mlm_loss=2.86958        mlm_acc=51.75234        
nsp_loss=0.03236        nsp_acc=100.000 throughput=2.3K tks/s   lr=0.0000100 
time=2.49, latency=1246.8 ms/batch
   INFO:root:[step 7]   mlm_loss=2.53454        mlm_acc=57.21703        
nsp_loss=0.14932        nsp_acc=94.828  throughput=2.3K tks/s   lr=0.0000140 
time=2.76, latency=1380.5 ms/batch
   INFO:root:[step 9]   mlm_loss=2.13252        mlm_acc=63.02083        
nsp_loss=0.03085        nsp_acc=98.276  throughput=2.3K tks/s   lr=0.0000180 
time=2.79, latency=1396.6 ms/batch
   INFO:root:[step 11]  mlm_loss=1.36580        mlm_acc=74.39916        
nsp_loss=0.00306        nsp_acc=100.000 throughput=2.3K tks/s   lr=0.0000090 
time=2.75, latency=1372.9 ms/batch
   INFO:root:[step 13]  mlm_loss=1.00501        mlm_acc=80.79399        
nsp_loss=0.00274        nsp_acc=100.000 throughput=2.2K tks/s   lr=0.0000070 
time=2.78, latency=1392.1 ms/batch
   INFO:root:[step 15]  mlm_loss=0.82224        mlm_acc=83.28585        
nsp_loss=0.00181        nsp_acc=100.000 throughput=2.3K tks/s   lr=0.0000050 
time=3.04, latency=1520.9 ms/batch
   INFO:root:[step 17]  mlm_loss=0.54528        mlm_acc=89.11704        
nsp_loss=0.00083        nsp_acc=100.000 throughput=2.3K tks/s   lr=0.0000030 
time=2.79, latency=1396.3 ms/batch
   INFO:root:[step 19]  mlm_loss=0.53212        mlm_acc=88.90110        
nsp_loss=0.00087        nsp_acc=100.000 throughput=2.2K tks/s   lr=0.0000010 
time=2.76, latency=1379.5 ms/batch
   INFO:root:[step 20] Saving checkpoints to ckpt/0000020.params, 
ckpt/0000020.states.
   INFO:root:Train cost=46.3s
   INFO:root:Using evaluation data at out/*.npz
   INFO:root:[step 1]   mlm_loss=0.08297        mlm_acc=97.72727        
nsp_loss=0.00008        nsp_acc=100.000 throughput=2.9K tks/s   lr=0.0000000 
time=0.30, latency=150.7 ms/batch
   INFO:root:[step 3]   mlm_loss=0.34548        mlm_acc=93.06569        
nsp_loss=0.00016        nsp_acc=100.000 throughput=5.1K tks/s   lr=0.0000000 
time=0.36, latency=180.2 ms/batch
   INFO:root:[step 5]   mlm_loss=0.34622        mlm_acc=92.24490        
nsp_loss=0.00068        nsp_acc=100.000 throughput=5.0K tks/s   lr=0.0000000 
time=0.33, latency=162.9 ms/batch
   INFO:root:[step 7]   mlm_loss=0.40680        mlm_acc=92.36641        
nsp_loss=0.00018        nsp_acc=100.000 throughput=5.4K tks/s   lr=0.0000000 
time=0.32, latency=161.7 ms/batch
   INFO:root:mlm_loss=0.295     mlm_acc=93.2    nsp_loss=0.000  nsp_acc=100.0   
   INFO:root:Eval cost=1.4s
   ```
   
   I think it might be good to revert this PR for now and then revisit the 
original PR and fix it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to