addmisol wrote: for dot4 with <4 x i8> args :
AMDGPU ABI unpacks <4 x i8> vectors into 4 separate i32 registers when passed as function arguments. By the time the DAG combiner sees the code, it looks like: v0 = byte0 (as i32) v1 = byte1 (as i32) v2 = byte2 (as i32) v3 = byte3 (as i32) The packed byte pattern needed for v_dot4_u32_u8 is lost. but this will work When bytes are loaded from memory as packed i32, dot4 works correctly: v_dot4_u32_u8 v1, v1, v2, s0 ; non-saturating v_dot4_u32_u8 v1, v1, v2, s0 clamp ; saturating so, - All dot2 patterns are fixed - dot4 patterns work when bytes come from memory (packed) - dot4 with <4 x i8> function arguments is an ABI limitation https://github.com/llvm/llvm-project/pull/187945 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
