[PR] [TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d [tvm]

via GitHub Wed, 11 Oct 2023 14:45:37 -0700


cbalint13 opened a new pull request, #15918:
URL: https://github.com/apache/tvm/pull/15918


   This PR enhance x86 SIMD (u)int8 coverage for dense and conv2d operators.
   It extends current SIMD support with avx2 & ssse3, and adds a new set of 
non-overflowing SIMD methods.
   
   ---
   
   #### Changes:
   
   [x86][TOPI]
     * Extends current set called ```fast-math```, the overflowing ones, with 
```avx2``` and ```ssse3```.
     * Adds a new set, a precision non-overflowing one, supporting: 
```avx512```,  ```avx2``` and ```ssse3```.
   
   [TIR][LLVM]
    * Adds new TIR ops, mapped to LLVM instrinsics: ```zextend```, 
```sextend```, ```truncate``` for type conversions.
    * Adds new TIR ops, mapped to LLVM instrinsics: ```vectorpermute```, 
```vectorshuffle``` for vector manipulation.
    * Enables TIR op ```atomic_add``` mapped to proper LLVM intrinsic.
   
   [TE]
    * Introduce new ```ArrayIntImm``` expression, for immediate list of some 
integer constants.
   
   [Target]
   * Introduce a flag ```-key=cpu, fast-math``` to switch from the precise SIMD 
(default) to the overflowing SIMD set.
   
   #### Performance
   
   For the new avx2 & ssse3 the ```fast``` vs. ```precise``` SIMD sets:
   
   ```
   $ python3 tests/python/contrib/test_gemm_acc32_simd.py
   Task tensorized: {True } [llvm -mcpu=ivybridge                     ], 
running time: 3.655 ms, 587.58 Gops/s
   Task tensorized: {True } [llvm -mcpu=ivybridge -keys=cpu,fast-math ], 
running time: 3.678 ms, 583.86 Gops/s
   Task tensorized: {True } [llvm -mcpu=haswell                       ], 
running time: 3.708 ms, 579.09 Gops/s
   Task tensorized: {True } [llvm -mcpu=haswell -keys=cpu,fast-math   ], 
running time: 3.668 ms, 585.52 Gops/s
   Task tensorized: {False} [llvm -mcpu=ivybridge                     ], 
running time: 41.152 ms, 52.18 Gops/s
   Task tensorized: {False} [llvm -mcpu=haswell                       ], 
running time: 41.194 ms, 52.13 Gops/s
   ```
   
   #### Notes
   
   * Precision (non ```fast-math```) is the default now.
   * x86 ```amx``` and ```vnni``` schedules remains unchanged, their specific 
intrinsics never overflows.
   * The ```zextend```, ```sextend```, ```truncate``` lowers on x86 into single 
specialized instruction e.g: ```punpcklwd``` & ```punpcklwd```
   * The ```vectorpermute```, ```vectorshuffle``` also lowers on x86 into 
appropriate single specialized instruction.
   * ```ArrayIntImm``` is for the new ops: ```tvm.tir.vectorpermute("int32x8", 
quad_reduction, [0, 1, 4, 5, 2, 3, 6, 7])```
   * The ```fast-math``` mode will always warn the user: 
       ```Using `fast-math` may overflow, make sure ranges for either data is 
[0,128] or weight is [-64,+64]```
   
   #### Samples
   
   Lowering results for the ```ssse3``` case, the innermost loop.
   
   The ```precise``` one:
   ```
   000000000001e90 <tvmgen_default_fused_nn_contrib_dense_pack_compute_>:
       1e90:    c4 e2 79 18 16          vbroadcastss (%rsi),%xmm2
       1e95:    c5 f9 ef c0             vpxor  %xmm0,%xmm0,%xmm0
       1e99:    c5 e9 68 d8             vpunpckhbw %xmm0,%xmm2,%xmm3
       1e9d:    c4 e2 79 20 4a 08       vpmovsxbw 0x8(%rdx),%xmm1
       1ea3:    c4 e2 79 30 e2          vpmovzxbw %xmm2,%xmm4
       1ea8:    c4 e2 79 20 12          vpmovsxbw (%rdx),%xmm2
       1ead:    c5 d9 f5 e2             vpmaddwd %xmm2,%xmm4,%xmm4
       1eb1:    c5 e1 f5 d9             vpmaddwd %xmm1,%xmm3,%xmm3
       1eb5:    c4 e2 59 02 eb          vphaddd %xmm3,%xmm4,%xmm5
       {...}
   
   define internal fastcc void 
@tvmgen_default_fused_nn_contrib_dense_pack_compute {
   entry:
     {...}
     %3 = load i32, ptr %1, align 64, !tbaa !310
     %4 = insertelement <4 x i32> undef, i32 %3, i64 0
     %5 = bitcast <4 x i32> %4 to <16 x i8>
     %6 = shufflevector <16 x i8> %5, <16 x i8> poison, <16 x i32> 
              <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, 
               i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
     %7 = zext <16 x i8> %6 to <16 x i16>
     %8 = shufflevector <16 x i16> %7, <16 x i16> poison, <8 x i32>
              <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
     %9 = load <16 x i8>, ptr %2, align 64, !tbaa !312
     %10 = sext <16 x i8> %9 to <16 x i16>
     %11 = shufflevector <16 x i16> %10, <16 x i16> poison, <8 x i32>
                <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
     %12 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %8, <8 x i16> 
%11)
     %13 = shufflevector <16 x i16> %7, <16 x i16> poison, <8 x i32>
                 <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
     %14 = shufflevector <16 x i16> %10, <16 x i16> poison, <8 x i32>
                <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
     %15 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %13, <8 x i16> 
%14)
     %16 = tail call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %12, <4 x 
i32> %15)
     %17 = getelementptr inbounds i8, ptr %1, i64 4
     {...}
   ```
   
   The ```fast-math``` one:
   ```
   0000000000001e90 <tvmgen_default_fused_nn_contrib_dense_pack_compute_>:
       1e90:    c4 e2 79 18 06          vbroadcastss (%rsi),%xmm0
       1e95:    c5 f9 6f 12             vmovdqa (%rdx),%xmm2
       1e99:    c5 f9 6f 4a 10          vmovdqa 0x10(%rdx),%xmm1
       1e9e:    c4 e2 79 04 da          vpmaddubsw %xmm2,%xmm0,%xmm3
       1ea3:    c4 e2 79 18 05 e4 11    vbroadcastss 0x11e4(%rip),%xmm0        
# 3090 <_fini+0x620>
       1eaa:    00 00 
       1eac:    c5 e1 f5 d8             vpmaddwd %xmm0,%xmm3,%xmm3
   
   define internal fastcc void 
@tvmgen_default_fused_nn_contrib_dense_pack_compute_{
   entry:
     {...}
     %3 = load i32, ptr %1, align 64, !tbaa !310
     %4 = insertelement <4 x i32> undef, i32 %3, i64 0
     %5 = bitcast <4 x i32> %4 to <16 x i8>
     %6 = shufflevector <16 x i8> %5, <16 x i8> poison, <16 x i32>
               <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3,
               i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
     %7 = load <16 x i8>, ptr %2, align 64, !tbaa !312
     %8 = tail call <8 x i16> @llvm.x86.ssse3.pmadd.ub.sw.128(<16 x i8> %6, <16 
x i8> %7)
     %9 = tail call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %8, <8 x i16> 
               <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>)
     %10 = getelementptr inbounds i8, ptr %1, i64 4
     {...}
   ```
   
   There is a compact full x86 SIMD table guide 
[here](https://www.officedaytime.com/simd512e/).
   This work here follows some sugestions from intel's [onednn int8 
notes](https://oneapi-src.github.io/oneDNN/v1.4/dev_guide_int8_computations.html)
   
   
   #### Next
   
   (WiP) This work here will be extended to metaschedule auto-tensorization.
   (WiP) Will try bring ```int4``` (not native) using best possible SIMD type 
conversions.
   
   
   ---
   
   Cc: @masahi , @anijain2305, @jianyuh, @Qianshui-Jiang, @kparzysz-quic , 
@junrushao , @tqchen , @elvin-n , @vvchernov , @echuraev
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d [tvm]

Reply via email to