barry-jin commented on pull request #19685:
URL: https://github.com/apache/incubator-mxnet/pull/19685#issuecomment-780032238


   After [benchmarking on 
GluonNLP](https://github.com/dmlc/gluon-nlp/tree/master/scripts/benchmarks), I 
have got some improvements in the single forward step. I have pasted the 
average improvements as follows. (The latency is the average number with 
different batch_size, sequence_length as input)
   
   
   model | training latency withou this PR (s) | training latency with this PR 
(s) | Improvement (s)
   -- | -- | -- | --
   google_en_uncased_bert_base | 0.09161326 | 0.09133351 | 0.00027974
   google_en_uncased_bert_base | 0.3565172 | 0.35624171 | 0.000275489
   google_en_uncased_bert_large | 0.91762223 | 0.9173615 | 0.000260731
   google_albert_base_v2 | 0.38036531 | 0.38022336 | 0.00014195
   google_albert_large_v2 | 0.74285129 | 0.74271887 | 0.000132424
   google_albert_xlarge_v2 | 1.53808278 | 1.53795535 | 0.000127428
   google_albert_xxlarge_v2 | 2.49918614 | 2.49904376 | 0.000142379
   google_electra_small | 0.07791454 | 0.07770361 | 0.000210933
   google_electra_base | 0.35639018 | 0.35617552 | 0.000214658
   google_electra_large | 0.91575478 | 0.9154471 | 0.000307674
   google_uncased_mobilebert | 0.1725719 | 0.17218696 | 0.000384942
   fairseq_bart_base | 0.43927581 | 0.43899117 | 0.00028464
   fairseq_bart_large | 0.70489126 | 0.70455636 | 0.0003349
   
   
   Also, I have compared the Training and inferencing time with the [real 
workloads](https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering):
 
   Running google_electra_small model on SQuAD dataset and will get the 
following results. 
   
   Training/Inferencing | Latency without this PR | Latency with this PR | 
Throughput without thie PR (samples/s) | Throughput with this PR (samples/s)
   -- | -- | -- | -- | --
   Training | 1.59179 h | 1.48754 h | 70 | 75
   Inferencing | 55.566 s | 55.41125 s | 216.35 | 216.96
   
   
   
   Environment
   
   python_version | 3.6.9
   -- | --
   instance | g4dn.2x
   system | Linux
   cpu | x86_64
   architecture | 64bit
   fp16 | FALSE
   cpu_ram_mb | 63622
   use_gpu | TRUE
   num_gpus | 1
   gpu | Tesla T4
   gpu_ram_mb | 15079
   gpu_power_watts | 70
   gpu_performance_state | 0
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to