Re: [PR] [TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d [tvm]

via GitHub Mon, 16 Oct 2023 04:27:31 -0700


cbalint13 commented on PR #15918:
URL: https://github.com/apache/tvm/pull/15918#issuecomment-1764265200


   
   > Very interesting PR!
   
   Thank you @ekalda !
   
   > > Introduce new `ArrayIntImm` expression for small immediate list of 
integer constants.
   
   * Main issue is that within LLVM  some x86 (also other arches) some 
instructions are not exposed at all.
   * So, I had to look into adding zextend, sextend, truncate plus the 
vectorpermute, vectorshuffle instead.
   
   The good point that these are lowered to exactly what is needed (even single 
insn, optim) for the target arch (x86 here).
   
   
   > Can you not use `AllocateConst`? It's a bit awkward to introduce a whole 
new TIR node to support a small set of intrinsics.
   
   Hmm no, ```AllocateConst``` looked bit too complicated to me.
   * There are already the ```{Float,Int,String}Imm``` so why not 
```ArrayIntImm```.
   * See the [simple 
usage](https://github.com/apache/tvm/blob/dc1b62916f68bf7b7f417eb4ea79bc75e83ec1a5/python/tvm/topi/x86/tensor_intrin.py#L180-L182)
 from python: ```tir.vectorpermute("int32x8", whatever_vector, [0, 1, 4, 5, 2, 
3, 6, 7])```
   * See the lowered IR (```llvm -mcpu=haswell```) as ```{..} T.ArrayIntImm([0, 
1, 4, 5, 2, 3, 6, 7]) {...}```:
   ```
   @I.ir_module
   class Module:
       @T.prim_func
       def tvmgen_default_fused_nn_contrib_dense_pack(
              p0: T.Buffer((4, 8), "uint8"), 
              p1: T.Buffer((1, 2, 8, 4), "int8"), 
              compute: T.Buffer((4, 8), "int32")):
           T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": 
T.bool(True)})
           for i_inner in range(4):
             compute_1 = T.Buffer((32,), "int32", data=compute.data)
             compute_1[i_inner * 8:i_inner * 8 + 8] = T.Broadcast(0, 8)
             for k_outer in range(2):
               p0_1 = T.Buffer((32,), "uint8", data=p0.data)
               p1_1 = T.Buffer((64,), "int8", data=p1.data)
               compute_1[i_inner * 8:i_inner * 8 + 8] =
               T.vectorpermute("int32x8", 
               T.call_llvm_pure_intrin("int32x8", "llvm.x86.avx2.phadd.d", 
T.uint32(2),
               T.call_llvm_pure_intrin("int32x8", "llvm.x86.avx2.pmadd.wd",  
T.uint32(2),
               T.vectorlow("void", T.zextend("int16x32", 
T.reinterpret("int8x32", T.Broadcast(
               T.reinterpret("int32", p0_1[i_inner * 8 + k_outer * 4:i_inner * 
8 + k_outer * 4 + 4]), 8)))), 
               T.vectorlow("void", T.sextend("int16x32", p1_1[k_outer * 
32:k_outer * 32 + 32]))), 
               T.call_llvm_pure_intrin("int32x8", "llvm.x86.avx2.pmadd.wd", 
T.uint32(2),
               T.vectorhigh("void", 
               T.zextend("int16x32", T.reinterpret("int8x32",  T.Broadcast(
               T.reinterpret("int32", p0_1[i_inner * 8 + k_outer * 4:i_inner * 
8 + k_outer * 4 + 4]), 8)))), 
               T.vectorhigh("void", 
               T.sextend("int16x32", p1_1[k_outer * 32:k_outer * 32 + 32])))), 
               T.ArrayIntImm([0, 1, 4, 5, 2, 3, 6, 7])) + compute_1[i_inner * 
8:i_inner * 8 + 8]
   ```
   
   > 
   > > The `call_llvm_pure_intrin` & `call_llvm_intrin` now holds instruction 
`StringImm` instead of `IntImm` abstract.
   > 
   > I like this a lot, would make the TIR much easier to reason about!
   
   @ekalda 
   
   The work here (x86) is a pseudo kind of "scalable-vector" having _m128, 
_m256, _m512 but "hand unrolled" ones.
   I also follow your RFC related to scalable vectors, I am interested in 
similar ideas for the riscv64 "v" extension.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [TOPI][TIR][TE][x86] Extend x86 SIMD (u)int8 coverage for dense & conv2d [tvm]

Reply via email to