addmisol wrote:

for dot4 with <4 x i8> args :

AMDGPU ABI unpacks <4 x i8> vectors into 4 separate i32 registers when passed 
as function arguments. By the time the DAG combiner sees the code, it looks 
like:
  v0 = byte0 (as i32)
  v1 = byte1 (as i32)
  v2 = byte2 (as i32)
  v3 = byte3 (as i32)

  The packed byte pattern needed for v_dot4_u32_u8 is lost.

but this will work
  When bytes are loaded from memory as packed i32, dot4 works correctly:
  v_dot4_u32_u8 v1, v1, v2, s0        ; non-saturating
  v_dot4_u32_u8 v1, v1, v2, s0 clamp  ; saturating

so,
  - All dot2 patterns are fixed 
  - dot4 patterns work when bytes come from memory (packed) 
  - dot4 with <4 x i8> function arguments is an ABI limitation

https://github.com/llvm/llvm-project/pull/187945
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to