https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113112
Bug ID: 113112 Summary: RISC-V: Dynamic LMUL feature stabilization for GCC-14 release Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: juzhe.zhong at rivai dot ai Target Milestone: --- Created attachment 56922 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56922&action=edit dynamic LMUL fail case Hi, as we known that we have supported dynamic LMUL feature but not stable. As far as I known, we only have these 2 execution FAILs: FAIL: gcc.dg/pr30957-1.c execution test FAIL: gcc.dg/signbit-5.c execution test in full coverage testing. And they are not the real FAIL. Tests need to be adjusted. And I have tested on K230 and other hardware, turns out we will have over 30% performance improvement (compare with default LMUL = M1) for various benchmark if we can select reasonable big LMUL (no additional registers spillings). However, I also find that there are some benchmarks have significantly performance drop (compare with default LMUL = M1) when using dynamic LMUL. I am pretty sure because we pick the wrong big LMUL (LMUL>1) which causes additional register spillings then we have bad performance for such situations. For example: #include <stdint-gcc.h> #define N 40 int a[N]; __attribute__ ((noinline)) int foo (int n){ int i,j; int sum,x; for (i = 0; i < n; i++) { sum = 0; for (j = 0; j < n; j++) { sum += (i + j); } a[i] = sum; } return 0; } -march=rv64gcv -mabi=lp64d -O3 --param riscv-autovec-lmul=dynamic --param riscv-autovec-preference=fixed-vlmax ASM: foo: ble a0,zero,.L11 lui a2,%hi(.LANCHOR0) addi sp,sp,-128 addi a2,a2,%lo(.LANCHOR0) mv a1,a0 vsetvli a6,zero,e32,m8,ta,ma vid.v v8 vs8r.v v8,0(sp) .L3: vl8re32.v v16,0(sp) vsetvli a4,a1,e8,m2,ta,ma li a3,0 vsetvli a5,zero,e32,m8,ta,ma vmv8r.v v0,v16 vmv.v.x v8,a4 vmv.v.i v24,0 vadd.vv v8,v16,v8 vmv8r.v v16,v24 vs8r.v v8,0(sp) .L4: addiw a3,a3,1 vadd.vv v8,v0,v16 vadd.vi v16,v16,1 vadd.vv v24,v24,v8 bne a0,a3,.L4 vsetvli zero,a4,e32,m8,ta,ma sub a1,a1,a4 vse32.v v24,0(a2) slli a4,a4,2 add a2,a2,a4 bne a1,zero,.L3 li a0,0 addi sp,sp,128 jr ra .L11: li a0,0 ret As we can see, pick up LMUL = 8 then spills. This case is found by this following code I add into mov pattern: if (known_gt (GET_MODE_SIZE (mode), BYTES_PER_RISCV_VECTOR) && riscv_autovec_lmul == RVV_DYNAMIC && lra_in_progress) gcc_unreachable (); The attachment is the file shows the cases that we pick up incorrect too big LMUL which cause addiontial spillings. I will work on this issue in the following days.