ChaiBapchya commented on issue #17980:
URL:
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629526785
Can confirm that this issue is specific to AVX512 kernels.
Tried this on c5.xl
$ lscpu
```
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf
tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm
abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep
bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb
avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
```
## Results
### Default [slower]
```
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0
wei_f32::blocked:ab:f0 bia_undef::undef::f0
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.133789
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0
wei_f32::blocked:ab:f0 bia_undef::undef::f0
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.132812
```
```
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True,
'num_hidden': 512}, 'avg_time_FullyConnected': 0.10202302001744101,
'p50_time_FullyConnected': 0.10086749989568489, 'p90_time_FullyConnected':
0.10658760029400582, 'p99_time_FullyConnected': 0.13521948004836298},
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True,
'num_hidden': 512}, 'avg_time_FullyConnected': 0.10642346004715364,
'p50_time_FullyConnected': 0.09991750016524747, 'p90_time_FullyConnected':
0.10565369971118344, 'p99_time_FullyConnected': 0.2586996700802042},
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True,
'num_hidden': 1536}, 'avg_time_FullyConnected': 0.16890607999812346,
'p50_time_FullyConnected': 0.16431500012004108, 'p90_time_FullyConnected':
0.1781331999154645, 'p99_time_FullyConnected': 0.2831235897247094},
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True,
'num_hidden': 2048}, 'avg_time_FullyConnected': 0.20140223995440465,
'p50_time_FullyConnected': 0.19778950013460417, 'p90_time_FullyConnected':
0.20401089991537447, 'p99_time_FullyConnected': 0.3063294199228036},
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True,
'num_hidden': 512}, 'avg_time_FullyConnected': 0.21596427998701984,
'p50_time_FullyConnected': 0.2096700000038254, 'p90_time_FullyConnected':
0.21819640001012885, 'p99_time_FullyConnected': 0.3412436299549877}]}]
```
### MKL Workaround [Faster]
```
MKL_VERBOSE
SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b1405e840,512)
21.25us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0
wei_f32::blocked:ab:f0 bia_undef::undef::f0
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0378418
MKL_VERBOSE
SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b14061c00,512)
20.94us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0
wei_f32::blocked:ab:f0 bia_undef::undef::f0
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0371094
```
```
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True,
'num_hidden': 512}, 'avg_time_FullyConnected': 0.11772135999308375,
'p50_time_FullyConnected': 0.1149684999290912, 'p90_time_FullyConnected':
0.1244978000613628, 'p99_time_FullyConnected': 0.14825501980340045},
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True,
'num_hidden': 512}, 'avg_time_FullyConnected': 0.120828840035756,
'p50_time_FullyConnected': 0.11370450010872446, 'p90_time_FullyConnected':
0.12752780021401122, 'p99_time_FullyConnected': 0.2412066401620902},
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True,
'num_hidden': 1536}, 'avg_time_FullyConnected': 0.13385597998421872,
'p50_time_FullyConnected': 0.12600750005731243, 'p90_time_FullyConnected':
0.14806160011175962, 'p99_time_FullyConnected': 0.2509373301927551},
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True,
'num_hidden': 2048}, 'avg_time_FullyConnected': 0.14175208003507578,
'p50_time_FullyConnected': 0.1372545000322134, 'p90_time_FullyConnected':
0.14401020002878798, 'p99_time_FullyConnected': 0.2423993399725075},
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True,
'num_hidden': 512}, 'avg_time_FullyConnected': 0.143890859962994,
'p50_time_FullyConnected': 0.1397979999637755, 'p90_time_FullyConnected':
0.14637689982919258, 'p99_time_FullyConnected': 0.22678783964693117}]}]
```
To reproduce
https://gist.github.com/ChaiBapchya/a849cfd566b8114e695454850b48077b
https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad#file-benchmark_intel_mkl-py
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]