pengzhao-intel edited a comment on issue #17559: [MXNET-1446] Quantization: intgemm matrix multiply wrappers URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-590146883 @kpuatamazon We have tested the performance in our local machine pre- and VNNI feature for both data conversion and computation part (FC layer). Thanks, @ElaineBao's work on data measurements. In general, the intgemm can provide better performance on data conversion with less core (1 core) and this may help on client CPU. However, the computation parts and data conversion with more cores performance are still slower than the current integration. Meanwhile, the supported shape is only limited in number dividable by 4. **detail data as below** - FC layer runtime from current MXNet OP with intgemm by 28 cores in 1 socket. Note: the smaller the better in below table Thus, the current MXNet OP is still faster in most cases for two-generation CPU. |CLX8280 VNNI (ms/call) | BS=1 | BS=8 | BS=16 | BS=32 | BS=64 | BS=128 | -- | -- | -- | -- | -- | -- | -- _sg_mkldnn_fully_connected | 0.03 | 0.08 | 0.09 | 0.10 | 0.11 | 0.12 _contrib_intgemm_fully_connected | 0.04 | 0.13 | 0.24 | 0.45 | 0.86 | 1.71 |SKX8180 (non VNNI) (ms/call) | BS=1 | BS=8 | BS=16 | BS=32 | BS=64 | BS=128 | -- | -- | -- | -- | -- | -- | -- _sg_mkldnn_fully_connected | 0.03 | 0.07 | 0.09 | 0.11 | 0.13 | 0.15 _contrib_intgemm_fully_connected | 0.04 | 0.13 | 0.24 | 0.46 | 0.90 | 1.78 - Data conversion from FP32 to INT8 1 core results show the intgemm has the benefit on data conversion. shape | quantize_v2 | intgemm | quantize_v2_fit | intgemm_fit | Speedup without calib -- | -- | -- | -- | -- | -- (128, 128) | 0.062 | 0.0224 | 0.0675 | 0.0442 | 1.5x (256, 256) | 0.0794 | 0.026 | 0.1264 | 0.0577 | 2.2x (512, 512) | 0.1615 | 0.0863 | 0.4025 | 0.1757 | 2.3x (1024, 1024) | 0.3422 | 0.2485 | 1.3097 | 0.5153 | 2.5x (2048, 2048) | 1.0238 | 0.8415 | 4.7436 | 1.7523 | 2.7x (8, 4096) | 0.0666 | 0.0235 | 0.0845 | 0.0485 | 1.7x (4096, 8) | 0.0657 | 0.0236 | 0.0841 | 0.0482 | 1.7x But with more cores (28 cores), the current OP is slight better. shape | quantize_v2 | intgemm | quantize_v2_fit | intgemm_fit | Speedup without calib -- | -- | -- | -- | -- | -- (128, 128) | 0.0757 | 0.0356 | 0.0767 | 0.0704 | 1.1x (256, 256) | 0.0773 | 0.0408 | 0.084 | 0.0907 | 0.9x (512, 512) | 0.0798 | 0.0949 | 0.1083 | 0.2163 | 0.5x (1024, 1024) | 0.0892 | 0.244 | 0.1645 | 0.5415 | 0.3x (2048, 2048) | 0.1441 | 0.76 | 0.3507 | 1.769 | 0.2x (8, 4096) | 0.0752 | 0.037 | 0.0798 | 0.0766 | 1.0x (4096, 8) | 0.0755 | 0.0372 | 0.0795 | 0.0771 | 1.0x
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
