ChaiBapchya commented on issue #17980:
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629526785


   Can confirm that this issue is specific to AVX512 kernels.
   Tried this on c5.xl 
   $ lscpu
   ```
   Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic 
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm 
abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep 
bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb 
avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
   ```
   ## Results
   ### Default [slower]
   ```
   
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0
 wei_f32::blocked:ab:f0 bia_undef::undef::f0 
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.133789
   
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0
 wei_f32::blocked:ab:f0 bia_undef::undef::f0 
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.132812
   ```
   ```
   [{'FullyConnected': [
   {'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 
'num_hidden': 512}, 'avg_time_FullyConnected': 0.10202302001744101, 
'p50_time_FullyConnected': 0.10086749989568489, 'p90_time_FullyConnected': 
0.10658760029400582, 'p99_time_FullyConnected': 0.13521948004836298}, 
   {'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 
'num_hidden': 512}, 'avg_time_FullyConnected': 0.10642346004715364, 
'p50_time_FullyConnected': 0.09991750016524747, 'p90_time_FullyConnected': 
0.10565369971118344, 'p99_time_FullyConnected': 0.2586996700802042}, 
   {'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 
'num_hidden': 1536}, 'avg_time_FullyConnected': 0.16890607999812346, 
'p50_time_FullyConnected': 0.16431500012004108, 'p90_time_FullyConnected': 
0.1781331999154645, 'p99_time_FullyConnected': 0.2831235897247094}, 
   {'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 
'num_hidden': 2048}, 'avg_time_FullyConnected': 0.20140223995440465, 
'p50_time_FullyConnected': 0.19778950013460417, 'p90_time_FullyConnected': 
0.20401089991537447, 'p99_time_FullyConnected': 0.3063294199228036}, 
   {'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 
'num_hidden': 512}, 'avg_time_FullyConnected': 0.21596427998701984, 
'p50_time_FullyConnected': 0.2096700000038254, 'p90_time_FullyConnected': 
0.21819640001012885, 'p99_time_FullyConnected': 0.3412436299549877}]}]
   ```
   ### MKL Workaround [Faster]
   ```
   MKL_VERBOSE 
SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b1405e840,512)
 21.25us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
   
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0
 wei_f32::blocked:ab:f0 bia_undef::undef::f0 
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0378418
   MKL_VERBOSE 
SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b14061c00,512)
 20.94us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
   
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0
 wei_f32::blocked:ab:f0 bia_undef::undef::f0 
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0371094
   ```
   ```
   [{'FullyConnected': [
   {'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 
'num_hidden': 512}, 'avg_time_FullyConnected': 0.11772135999308375, 
'p50_time_FullyConnected': 0.1149684999290912, 'p90_time_FullyConnected': 
0.1244978000613628, 'p99_time_FullyConnected': 0.14825501980340045}, 
   {'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 
'num_hidden': 512}, 'avg_time_FullyConnected': 0.120828840035756, 
'p50_time_FullyConnected': 0.11370450010872446, 'p90_time_FullyConnected': 
0.12752780021401122, 'p99_time_FullyConnected': 0.2412066401620902}, 
   {'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 
'num_hidden': 1536}, 'avg_time_FullyConnected': 0.13385597998421872, 
'p50_time_FullyConnected': 0.12600750005731243, 'p90_time_FullyConnected': 
0.14806160011175962, 'p99_time_FullyConnected': 0.2509373301927551}, 
   {'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 
'num_hidden': 2048}, 'avg_time_FullyConnected': 0.14175208003507578, 
'p50_time_FullyConnected': 0.1372545000322134, 'p90_time_FullyConnected': 
0.14401020002878798, 'p99_time_FullyConnected': 0.2423993399725075}, 
   {'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 
'num_hidden': 512}, 'avg_time_FullyConnected': 0.143890859962994, 
'p50_time_FullyConnected': 0.1397979999637755, 'p90_time_FullyConnected': 
0.14637689982919258, 'p99_time_FullyConnected': 0.22678783964693117}]}]
   ```
   
   To reproduce
   https://gist.github.com/ChaiBapchya/a849cfd566b8114e695454850b48077b
   
https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad#file-benchmark_intel_mkl-py


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to