barry-jin commented on pull request #19685: URL: https://github.com/apache/incubator-mxnet/pull/19685#issuecomment-780032238
After [benchmarking on GluonNLP](https://github.com/dmlc/gluon-nlp/tree/master/scripts/benchmarks), I have got some improvements in the single forward step. I have pasted the average improvements as follows. (The latency is the average number with different batch_size, sequence_length as input) model | training latency withou this PR (s) | training latency with this PR (s) | Improvement (s) -- | -- | -- | -- google_en_uncased_bert_base | 0.09161326 | 0.09133351 | 0.00027974 google_en_uncased_bert_base | 0.3565172 | 0.35624171 | 0.000275489 google_en_uncased_bert_large | 0.91762223 | 0.9173615 | 0.000260731 google_albert_base_v2 | 0.38036531 | 0.38022336 | 0.00014195 google_albert_large_v2 | 0.74285129 | 0.74271887 | 0.000132424 google_albert_xlarge_v2 | 1.53808278 | 1.53795535 | 0.000127428 google_albert_xxlarge_v2 | 2.49918614 | 2.49904376 | 0.000142379 google_electra_small | 0.07791454 | 0.07770361 | 0.000210933 google_electra_base | 0.35639018 | 0.35617552 | 0.000214658 google_electra_large | 0.91575478 | 0.9154471 | 0.000307674 google_uncased_mobilebert | 0.1725719 | 0.17218696 | 0.000384942 fairseq_bart_base | 0.43927581 | 0.43899117 | 0.00028464 fairseq_bart_large | 0.70489126 | 0.70455636 | 0.0003349 Also, I have compared the Training and inferencing time with the [real workloads](https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering): Running google_electra_small model on SQuAD dataset and will get the following results. Training/Inferencing | Latency without this PR | Latency with this PR | Throughput without thie PR (samples/s) | Throughput with this PR (samples/s) -- | -- | -- | -- | -- Training | 1.59179 h | 1.48754 h | 70 | 75 Inferencing | 55.566 s | 55.41125 s | 216.35 | 216.96 Environment python_version | 3.6.9 -- | -- instance | g4dn.2x system | Linux cpu | x86_64 architecture | 64bit fp16 | FALSE cpu_ram_mb | 63622 use_gpu | TRUE num_gpus | 1 gpu | Tesla T4 gpu_ram_mb | 15079 gpu_power_watts | 70 gpu_performance_state | 0 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
