anirudhacharya opened a new pull request #14868: Revert "Improve cached_op performance for static mode" URL: https://github.com/apache/incubator-mxnet/pull/14868 Reverts apache/incubator-mxnet#14785 This revert is to fix this issue - https://github.com/apache/incubator-mxnet/issues/14864 This commit 369b66d0f10ba479ce96f78f7c838bd7bc41d951 caused a regression in BERT model training As seen below the commit caused the `nsp_acc` to drop from 100 to 55.2 ```bash commit 369b66d0f10ba479ce96f78f7c838bd7bc41d951 - INFO:root:[step 1] mlm_loss=1.65551 mlm_acc=48.11490 nsp_loss=0.43527 nsp_acc=81.250 throughput=1.6K tks/s lr=0.0000020 time=2.32, latency=1161.4 ms/batch INFO:root:[step 3] mlm_loss=7.84039 mlm_acc=2.22965 nsp_loss=0.73171 nsp_acc=43.103 throughput=2.2K tks/s lr=0.0000060 time=2.73, latency=1364.9 ms/batch INFO:root:[step 5] mlm_loss=7.80324 mlm_acc=2.68692 nsp_loss=0.73161 nsp_acc=42.308 throughput=2.4K tks/s lr=0.0000100 time=2.43, latency=1217.0 ms/batch INFO:root:[step 7] mlm_loss=7.69260 mlm_acc=1.55763 nsp_loss=0.71501 nsp_acc=46.552 throughput=2.4K tks/s lr=0.0000140 time=2.64, latency=1320.8 ms/batch INFO:root:[step 9] mlm_loss=7.72376 mlm_acc=2.29167 nsp_loss=0.73156 nsp_acc=37.931 throughput=2.4K tks/s lr=0.0000180 time=2.70, latency=1350.2 ms/batch INFO:root:[step 11] mlm_loss=7.62214 mlm_acc=2.19436 nsp_loss=0.69882 nsp_acc=51.724 throughput=2.4K tks/s lr=0.0000090 time=2.65, latency=1322.5 ms/batch INFO:root:[step 13] mlm_loss=7.49625 mlm_acc=2.46781 nsp_loss=0.72365 nsp_acc=43.103 throughput=2.3K tks/s lr=0.0000070 time=2.70, latency=1347.8 ms/batch INFO:root:[step 15] mlm_loss=7.47410 mlm_acc=2.18424 nsp_loss=0.71855 nsp_acc=39.062 throughput=2.4K tks/s lr=0.0000050 time=2.92, latency=1458.7 ms/batch INFO:root:[step 17] mlm_loss=7.30681 mlm_acc=2.56674 nsp_loss=0.68619 nsp_acc=53.448 throughput=2.4K tks/s lr=0.0000030 time=2.70, latency=1348.9 ms/batch INFO:root:[step 19] mlm_loss=7.61227 mlm_acc=1.75824 nsp_loss=0.71591 nsp_acc=44.828 throughput=2.2K tks/s lr=0.0000010 time=2.76, latency=1380.5 ms/batch INFO:root:[step 20] Saving checkpoints to ckpt/0000020.params, ckpt/0000020.states. INFO:root:Train cost=45.5s INFO:root:Using evaluation data at out/*.npz INFO:root:[step 1] mlm_loss=3.74667 mlm_acc=1.51515 nsp_loss=0.35332 nsp_acc=25.000 throughput=2.9K tks/s lr=0.0000000 time=0.30, latency=149.5 ms/batch INFO:root:[step 3] mlm_loss=7.30128 mlm_acc=3.28467 nsp_loss=0.68692 nsp_acc=62.500 throughput=5.0K tks/s lr=0.0000000 time=0.37, latency=185.7 ms/batch INFO:root:[step 5] mlm_loss=7.55211 mlm_acc=2.85714 nsp_loss=0.67706 nsp_acc=81.250 throughput=5.0K tks/s lr=0.0000000 time=0.33, latency=162.8 ms/batch INFO:root:[step 7] mlm_loss=7.07615 mlm_acc=2.29008 nsp_loss=0.69678 nsp_acc=43.750 throughput=5.4K tks/s lr=0.0000000 time=0.32, latency=161.2 ms/batch INFO:root:mlm_loss=6.419 mlm_acc=2.5 nsp_loss=0.604 nsp_acc=55.2 INFO:root:Eval cost=1.4s commit 5dd9fa27d8bdd2a8677b7c275a494d17082c0e1c INFO:root:[step 1] mlm_loss=1.65551 mlm_acc=48.11490 nsp_loss=0.43527 nsp_acc=81.250 throughput=1.6K tks/s lr=0.0000020 time=2.33, latency=1166.0 ms/batch INFO:root:[step 3] mlm_loss=3.35410 mlm_acc=47.38016 nsp_loss=0.70400 nsp_acc=84.483 throughput=2.2K tks/s lr=0.0000060 time=2.76, latency=1379.8 ms/batch INFO:root:[step 5] mlm_loss=2.86958 mlm_acc=51.75234 nsp_loss=0.03236 nsp_acc=100.000 throughput=2.3K tks/s lr=0.0000100 time=2.49, latency=1246.8 ms/batch INFO:root:[step 7] mlm_loss=2.53454 mlm_acc=57.21703 nsp_loss=0.14932 nsp_acc=94.828 throughput=2.3K tks/s lr=0.0000140 time=2.76, latency=1380.5 ms/batch INFO:root:[step 9] mlm_loss=2.13252 mlm_acc=63.02083 nsp_loss=0.03085 nsp_acc=98.276 throughput=2.3K tks/s lr=0.0000180 time=2.79, latency=1396.6 ms/batch INFO:root:[step 11] mlm_loss=1.36580 mlm_acc=74.39916 nsp_loss=0.00306 nsp_acc=100.000 throughput=2.3K tks/s lr=0.0000090 time=2.75, latency=1372.9 ms/batch INFO:root:[step 13] mlm_loss=1.00501 mlm_acc=80.79399 nsp_loss=0.00274 nsp_acc=100.000 throughput=2.2K tks/s lr=0.0000070 time=2.78, latency=1392.1 ms/batch INFO:root:[step 15] mlm_loss=0.82224 mlm_acc=83.28585 nsp_loss=0.00181 nsp_acc=100.000 throughput=2.3K tks/s lr=0.0000050 time=3.04, latency=1520.9 ms/batch INFO:root:[step 17] mlm_loss=0.54528 mlm_acc=89.11704 nsp_loss=0.00083 nsp_acc=100.000 throughput=2.3K tks/s lr=0.0000030 time=2.79, latency=1396.3 ms/batch INFO:root:[step 19] mlm_loss=0.53212 mlm_acc=88.90110 nsp_loss=0.00087 nsp_acc=100.000 throughput=2.2K tks/s lr=0.0000010 time=2.76, latency=1379.5 ms/batch INFO:root:[step 20] Saving checkpoints to ckpt/0000020.params, ckpt/0000020.states. INFO:root:Train cost=46.3s INFO:root:Using evaluation data at out/*.npz INFO:root:[step 1] mlm_loss=0.08297 mlm_acc=97.72727 nsp_loss=0.00008 nsp_acc=100.000 throughput=2.9K tks/s lr=0.0000000 time=0.30, latency=150.7 ms/batch INFO:root:[step 3] mlm_loss=0.34548 mlm_acc=93.06569 nsp_loss=0.00016 nsp_acc=100.000 throughput=5.1K tks/s lr=0.0000000 time=0.36, latency=180.2 ms/batch INFO:root:[step 5] mlm_loss=0.34622 mlm_acc=92.24490 nsp_loss=0.00068 nsp_acc=100.000 throughput=5.0K tks/s lr=0.0000000 time=0.33, latency=162.9 ms/batch INFO:root:[step 7] mlm_loss=0.40680 mlm_acc=92.36641 nsp_loss=0.00018 nsp_acc=100.000 throughput=5.4K tks/s lr=0.0000000 time=0.32, latency=161.7 ms/batch INFO:root:mlm_loss=0.295 mlm_acc=93.2 nsp_loss=0.000 nsp_acc=100.0 INFO:root:Eval cost=1.4s ``` I think it might be good to revert this PR for now and then revisit the original PR and fix it.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
