cbalint13 opened a new pull request, #15918:
URL: https://github.com/apache/tvm/pull/15918
This PR enhance x86 SIMD (u)int8 coverage for dense and conv2d operators.
It extends current SIMD support with avx2 & ssse3, and adds a new set of
non-overflowing SIMD methods.
---
#### Changes:
[x86][TOPI]
* Extends current set called ```fast-math```, the overflowing ones, with
```avx2``` and ```ssse3```.
* Adds a new set, a precision non-overflowing one, supporting:
```avx512```, ```avx2``` and ```ssse3```.
[TIR][LLVM]
* Adds new TIR ops, mapped to LLVM instrinsics: ```zextend```,
```sextend```, ```truncate``` for type conversions.
* Adds new TIR ops, mapped to LLVM instrinsics: ```vectorpermute```,
```vectorshuffle``` for vector manipulation.
* Enables TIR op ```atomic_add``` mapped to proper LLVM intrinsic.
[TE]
* Introduce new ```ArrayIntImm``` expression, for immediate list of some
integer constants.
[Target]
* Introduce a flag ```-key=cpu, fast-math``` to switch from the precise SIMD
(default) to the overflowing SIMD set.
#### Performance
For the new avx2 & ssse3 the ```fast``` vs. ```precise``` SIMD sets:
```
$ python3 tests/python/contrib/test_gemm_acc32_simd.py
Task tensorized: {True } [llvm -mcpu=ivybridge ],
running time: 3.655 ms, 587.58 Gops/s
Task tensorized: {True } [llvm -mcpu=ivybridge -keys=cpu,fast-math ],
running time: 3.678 ms, 583.86 Gops/s
Task tensorized: {True } [llvm -mcpu=haswell ],
running time: 3.708 ms, 579.09 Gops/s
Task tensorized: {True } [llvm -mcpu=haswell -keys=cpu,fast-math ],
running time: 3.668 ms, 585.52 Gops/s
Task tensorized: {False} [llvm -mcpu=ivybridge ],
running time: 41.152 ms, 52.18 Gops/s
Task tensorized: {False} [llvm -mcpu=haswell ],
running time: 41.194 ms, 52.13 Gops/s
```
#### Notes
* Precision (non ```fast-math```) is the default now.
* x86 ```amx``` and ```vnni``` schedules remains unchanged, their specific
intrinsics never overflows.
* The ```zextend```, ```sextend```, ```truncate``` lowers on x86 into single
specialized instruction e.g: ```punpcklwd``` & ```punpcklwd```
* The ```vectorpermute```, ```vectorshuffle``` also lowers on x86 into
appropriate single specialized instruction.
* ```ArrayIntImm``` is for the new ops: ```tvm.tir.vectorpermute("int32x8",
quad_reduction, [0, 1, 4, 5, 2, 3, 6, 7])```
* The ```fast-math``` mode will always warn the user:
```Using `fast-math` may overflow, make sure ranges for either data is
[0,128] or weight is [-64,+64]```
#### Samples
Lowering results for the ```ssse3``` case, the innermost loop.
The ```precise``` one:
```
000000000001e90 <tvmgen_default_fused_nn_contrib_dense_pack_compute_>:
1e90: c4 e2 79 18 16 vbroadcastss (%rsi),%xmm2
1e95: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0
1e99: c5 e9 68 d8 vpunpckhbw %xmm0,%xmm2,%xmm3
1e9d: c4 e2 79 20 4a 08 vpmovsxbw 0x8(%rdx),%xmm1
1ea3: c4 e2 79 30 e2 vpmovzxbw %xmm2,%xmm4
1ea8: c4 e2 79 20 12 vpmovsxbw (%rdx),%xmm2
1ead: c5 d9 f5 e2 vpmaddwd %xmm2,%xmm4,%xmm4
1eb1: c5 e1 f5 d9 vpmaddwd %xmm1,%xmm3,%xmm3
1eb5: c4 e2 59 02 eb vphaddd %xmm3,%xmm4,%xmm5
{...}
define internal fastcc void
@tvmgen_default_fused_nn_contrib_dense_pack_compute {
entry:
{...}
%3 = load i32, ptr %1, align 64, !tbaa !310
%4 = insertelement <4 x i32> undef, i32 %3, i64 0
%5 = bitcast <4 x i32> %4 to <16 x i8>
%6 = shufflevector <16 x i8> %5, <16 x i8> poison, <16 x i32>
<i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3,
i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
%7 = zext <16 x i8> %6 to <16 x i16>
%8 = shufflevector <16 x i16> %7, <16 x i16> poison, <8 x i32>
<i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%9 = load <16 x i8>, ptr %2, align 64, !tbaa !312
%10 = sext <16 x i8> %9 to <16 x i16>
%11 = shufflevector <16 x i16> %10, <16 x i16> poison, <8 x i32>
<i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%12 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %8, <8 x i16>
%11)
%13 = shufflevector <16 x i16> %7, <16 x i16> poison, <8 x i32>
<i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
%14 = shufflevector <16 x i16> %10, <16 x i16> poison, <8 x i32>
<i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
%15 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %13, <8 x i16>
%14)
%16 = tail call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %12, <4 x
i32> %15)
%17 = getelementptr inbounds i8, ptr %1, i64 4
{...}
```
The ```fast-math``` one:
```
0000000000001e90 <tvmgen_default_fused_nn_contrib_dense_pack_compute_>:
1e90: c4 e2 79 18 06 vbroadcastss (%rsi),%xmm0
1e95: c5 f9 6f 12 vmovdqa (%rdx),%xmm2
1e99: c5 f9 6f 4a 10 vmovdqa 0x10(%rdx),%xmm1
1e9e: c4 e2 79 04 da vpmaddubsw %xmm2,%xmm0,%xmm3
1ea3: c4 e2 79 18 05 e4 11 vbroadcastss 0x11e4(%rip),%xmm0
# 3090 <_fini+0x620>
1eaa: 00 00
1eac: c5 e1 f5 d8 vpmaddwd %xmm0,%xmm3,%xmm3
define internal fastcc void
@tvmgen_default_fused_nn_contrib_dense_pack_compute_{
entry:
{...}
%3 = load i32, ptr %1, align 64, !tbaa !310
%4 = insertelement <4 x i32> undef, i32 %3, i64 0
%5 = bitcast <4 x i32> %4 to <16 x i8>
%6 = shufflevector <16 x i8> %5, <16 x i8> poison, <16 x i32>
<i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3,
i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
%7 = load <16 x i8>, ptr %2, align 64, !tbaa !312
%8 = tail call <8 x i16> @llvm.x86.ssse3.pmadd.ub.sw.128(<16 x i8> %6, <16
x i8> %7)
%9 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %8, <8 x i16>
<i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>)
%10 = getelementptr inbounds i8, ptr %1, i64 4
{...}
```
There is a compact full x86 SIMD table guide
[here](https://www.officedaytime.com/simd512e/).
This work here follows some sugestions from intel's [onednn int8
notes](https://oneapi-src.github.io/oneDNN/v1.4/dev_guide_int8_computations.html)
#### Next
(WiP) This work here will be extended to metaschedule auto-tensorization.
(WiP) Will try bring ```int4``` (not native) using best possible SIMD type
conversions.
---
Cc: @masahi , @anijain2305, @jianyuh, @Qianshui-Jiang, @kparzysz-quic ,
@junrushao , @tqchen , @elvin-n , @vvchernov , @echuraev
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]