pengzhao-intel edited a comment on issue #17559: [MXNET-1446] Quantization: 
intgemm matrix multiply wrappers 
URL: https://github.com/apache/incubator-mxnet/pull/17559#issuecomment-590146883
 
 
   @kpuatamazon We have tested the performance in our local machine pre- and 
VNNI feature for both data conversion and computation part (FC layer).
   
   Thanks, @ElaineBao's work on data measurements.
   
   In general, the intgemm can provide better performance on data conversion 
with less core (1 core) and this may help on client CPU. However, the 
computation parts and data conversion with more cores performance are still 
slower than the current integration. Meanwhile, the supported shape is only 
limited in number dividable by 4.
   
   **detail data as below**
   
   - FC layer runtime from current MXNet OP with intgemm by 28 cores in 1 
socket.
   Note: the smaller the better in below table 
   Thus, the current MXNet OP is still faster in most cases for two-generation 
CPU.
   
   |CLX8280 VNNI (ms/call) | BS=1 | BS=8 | BS=16  | BS=32 | BS=64  | BS=128 |
   -- | -- | -- | -- | -- | -- | --
   _sg_mkldnn_fully_connected | 0.03 | 0.08 | 0.09 | 0.10 | 0.11 | 0.12
   _contrib_intgemm_fully_connected | 0.04 | 0.13 | 0.24 | 0.45 | 0.86 | 1.71
   
   |SKX8180 (non VNNI) (ms/call) | BS=1 | BS=8  | BS=16  | BS=32 | BS=64  | 
BS=128 |
   -- | -- | -- | -- | -- | -- | --
   _sg_mkldnn_fully_connected | 0.03 | 0.07 | 0.09 | 0.11 | 0.13 | 0.15
   _contrib_intgemm_fully_connected | 0.04 | 0.13 | 0.24 | 0.46 | 0.90 | 1.78
   
   - Data conversion from FP32 to INT8
   
   1 core results show the intgemm has the benefit on data conversion.
   
   shape | quantize_v2 | intgemm | quantize_v2_fit | intgemm_fit | Speedup 
without calib
   -- | -- | -- | -- | -- | --
   (128, 128) | 0.062 | 0.0224 | 0.0675 | 0.0442 | 1.5x
   (256, 256) | 0.0794 | 0.026 | 0.1264 | 0.0577 | 2.2x
   (512, 512) | 0.1615 | 0.0863 | 0.4025 | 0.1757 | 2.3x
   (1024, 1024) | 0.3422 | 0.2485 | 1.3097 | 0.5153 | 2.5x
   (2048, 2048) | 1.0238 | 0.8415 | 4.7436 | 1.7523 | 2.7x
   (8, 4096) | 0.0666 | 0.0235 | 0.0845 | 0.0485 | 1.7x
   (4096, 8) | 0.0657 | 0.0236 | 0.0841 | 0.0482 | 1.7x
   
   But with more cores (28 cores), the current OP is slight better.
   
   shape | quantize_v2 | intgemm | quantize_v2_fit | intgemm_fit | Speedup 
without calib
   -- | -- | -- | -- | -- | --
   (128, 128) | 0.0757 | 0.0356 | 0.0767 | 0.0704 | 1.1x
   (256, 256) | 0.0773 | 0.0408 | 0.084 | 0.0907 | 0.9x
   (512, 512) | 0.0798 | 0.0949 | 0.1083 | 0.2163 | 0.5x
   (1024, 1024) | 0.0892 | 0.244 | 0.1645 | 0.5415 | 0.3x
   (2048, 2048) | 0.1441 | 0.76 | 0.3507 | 1.769 | 0.2x
   (8, 4096) | 0.0752 | 0.037 | 0.0798 | 0.0766 | 1.0x
   (4096, 8) | 0.0755 | 0.0372 | 0.0795 | 0.0771 | 1.0x
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to