zixuanweeei commented on issue #18001: [MKLDNN] Support quantized rnn URL: https://github.com/apache/incubator-mxnet/pull/18001#issuecomment-612741705 > what's the performance? We have verified the accuracy and performance using a pre-trained language model provided by gluon-nlp ([a link](https://gluon-nlp.mxnet.io/examples/language_model/language_model.html#Using-a-pre-trained-AWD-LSTM-language-model)). ### Accuracy (PPL, lower is better) | | FP32 | INT8 | |---- |---- |---- | |Validataion dataset | 68.80 | 69.24 | |Test dataset | 65.72 | 66.14 | The accuracy results of INT8 is very close to that of FP32. ### Performance #### Profiler Dumps of FP32 End-to-End | Name | Total Count | Time (ms) | Min Time (ms) | Max Time (ms) | Avg Time (ms) | |---------------------------:|------------:|----------:|---------:|-------------:|--------------:| | log_softmax | 350 | 10968.93 | 31.09 | 31.54 | 31.34 | | RNN | 1050 | **5664.45** | 3.13 | 7.37 | 5.39 | | _sg_mkldnn_fully_connected | 350 | 2630.26 | 7.40 | 7.78 | 7.52 | | _rnn_param_concat | 1050 | 2392.41 | 0.94 | 3.73 | 2.28 | | Reshape | 4200 | 775.83 | 0.01 | 0.64 | 0.18 | | DeleteVariable | 3856 | 185.39 | 0.00 | 0.53 | 0.05 | | CopyCPU2CPU | 2450 | 48.89 | 0.01 | 0.05 | 0.02 | | Embedding | 350 | 21.29 | 0.06 | 0.08 | 0.06 | | WaitForVar | 2800 | 12.85 | 0.00 | 0.02 | 0.00 | | mean | 350 | 9.26 | 0.02 | 0.05 | 0.03 | | Dropout | 1400 | 8.38 | 0.00 | 0.01 | 0.01 | | sum | 350 | 6.85 | 0.02 | 0.04 | 0.02 | | pick | 350 | 6.55 | 0.02 | 0.03 | 0.02 | | _mul_scalar | 350 | 3.56 | 0.01 | 0.02 | 0.01 | | _zeros | 6 | 0.16 | 0.01 | 0.07 | 0.03 | | Total | | **22735.04** | | | | #### Profiler Dumps of INT8 End-to-End | Name | Total Count | Time (ms) | Min Time (ms) | Max Time (ms) | Avg Time (ms) | |-------------------:|-----------:|-----------:|---------------:|---------------:|---------------:| | log_softmax | 350 | 10805.84 | 30.72 | 35.89 | 30.87 | | _contrib_quantized_rnn | 1050 | **2857.42** | 1.52 | 3.81 | 2.72 | | _rnn_param_concat | 1050 | 2375.36 | 0.83 | 5.93 | 2.26 | | _contrib_quantize_asym | 1050 | 1580.61 | 0.55 | 4.87 | 1.51 | | _sg_mkldnn_fully_connected | 350 | 1559.83 | 4.42 | 4.65 | 4.46 | | Reshape | 4200 | 762.71 | 0.01 | 0.66 | 0.18 | | DeleteVariable | 3856 | 131.79 | 0.00 | 0.44 | 0.03 | | CopyCPU2CPU | 2450 | 48.68 | 0.01 | 0.06 | 0.02 | | Embedding | 350 | 21.03 | 0.06 | 0.08 | 0.06 | | WaitForVar | 2796 | 12.34 | 0.00 | 0.02 | 0.00 | | _contrib_quantize_v2 | 350 | 11.29 | 0.03 | 0.06 | 0.03 | | mean | 350 | 9.17 | 0.02 | 0.15 | 0.03 | | Dropout | 1400 | 8.31 | 0.00 | 0.01 | 0.01 | | sum | 350 | 6.63 | 0.02 | 0.04 | 0.02 | | pick | 350 | 6.22 | 0.02 | 0.03 | 0.02 | | _mul_scalar | 350 | 3.67 | 0.01 | 0.03 | 0.01 | | _zeros | 6 | 0.11 | 0.01 | 0.07 | 0.02 | | Total | | **20201.01** | | | | End-to-End latency got ~1.1x speedup (22735.04 vs 20201.01) which is not that good. However, `_contrib_quantized_rnn` got ~2.0x speedup compared with `RNN`. Since `RNN` only occupies ~25% of total time while it's \~48% with `log_softmax`, the speedup of `_contrib_quantized_rnn` might be weakened. And `_contrib_quantize_asym` has a poor performance which needs further optimization (WIP). Besides, the quantization flow of LSTM only takes some gemm operations into INT8 calculation. Others, such as gates' additions, bias additions, element-wise activations, are remain as FP32. So the speedup of `_contrib_quantized_rnn` isn't able to reach the expected 3\~4x speedup.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
