[Bug middle-end/113474] RISC-V: Fail to use vmerge.vim for constant vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113474 JuzheZhong changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #4 from JuzheZhong --- Fixed
[Bug target/115093] RISC-V Vector ICE in extract_insn: unrecognizable insn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115093 JuzheZhong changed: What|Removed |Added CC||juzhe.zhong at rivai dot ai --- Comment #1 from JuzheZhong --- I think it's fixed by: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=44e7855e4e817a7f5a1e332cd95e780e57052dba Confirmed on compiler explorer: https://godbolt.org/z/qf5GzoKre
[Bug c/115104] RISC-V: GCC-14 can combine vsext+vadd -> vwadd but Trunk GCC (GCC 15) Failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115104 --- Comment #1 from JuzheZhong --- I wonder whether RIVOS CI already found which commit cause this regression ?
[Bug c/115104] New: RISC-V: GCC-14 can combine vsext+vadd -> vwadd but Trunk GCC (GCC 15) Failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115104 Bug ID: 115104 Summary: RISC-V: GCC-14 can combine vsext+vadd -> vwadd but Trunk GCC (GCC 15) Failed Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: juzhe.zhong at rivai dot ai Target Milestone: --- I notice there are these following regression in testing: FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvfwadd\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvfwadd\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvfwadd\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvfwadd\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvfwadd\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvfwadd\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwadd\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwadd\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwadd\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwadd\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwadd\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwadd\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwaddu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwaddu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwaddu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwaddu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwaddu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c scan-assembler-times \\tvwaddu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvfwsub\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvfwsub\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvfwsub\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvfwsub\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvfwsub\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvfwsub\\.vv 6 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsub\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsub\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsub\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsub\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsub\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsub\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsubu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsubu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsubu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsubu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsubu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c scan-assembler-times \\tvwsubu\\.vv 9 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c scan-assembler-times \\tvfwmul\\.vv 8 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c scan-assembler-times \\tvfwmul\\.vv 8 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c scan-assembler-times \\tvfwmul\\.vv 8 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c scan-assembler-times \\tvfwmul\\.vv 8 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c scan-assembler-times \\tvfwmul\\.vv 8 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c scan-assembler-times \\tvfwmul\\.vv 8 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c scan-assembler-times \\tvwmul\\.vv 12 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c scan-assembler-times \\tvwmul\\.vv 12 FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c
[Bug c/115068] New: RISC-V: Illegal instruction of vfwadd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115068 Bug ID: 115068 Summary: RISC-V: Illegal instruction of vfwadd Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: juzhe.zhong at rivai dot ai Target Milestone: --- #include #include vfloat64m8_t test_vfwadd_wf_f64m8_m(vbool8_t vm, vfloat64m8_t vs2, float rs1, size_t vl) { return __riscv_vfwadd_wf_f64m8_m(vm, vs2, rs1, vl); } char global_memory[1024]; void *fake_memory = (void *)global_memory; int main () { asm volatile("fence":::"memory"); vfloat64m8_t vfwadd_wf_f64m8_m_vd = test_vfwadd_wf_f64m8_m(__riscv_vreinterpret_v_i8m1_b8(__riscv_vundefined_i8m1()), __riscv_vundefined_f64m8(), 1.0, __riscv_vsetvlmax_e64m8()); asm volatile(""::"vr"(vfwadd_wf_f64m8_m_vd):"memory"); return 0; } https://compiler-explorer.com/z/rq7K33zE5 main: fence lui a5,%hi(.LC0) flw fa5,%lo(.LC0)(a5) vsetvli a5,zero,e32,m4,ta,ma vfwadd.wf v0,v8,fa5,v0.t ---> vd should not be v0. li a0,0 ret
[Bug target/114988] RISC-V: ICE in intrinsic __riscv_vfwsub_wf_f32mf2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114988 --- Comment #2 from JuzheZhong --- Li Pan is going to work on it. Hi, kito and Jeff. Can this fix backport to GCC-14 ?
[Bug c/114988] RISC-V: ICE in intrinsic __riscv_vfwsub_wf_f32mf2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114988 --- Comment #1 from JuzheZhong --- Ideally, it should be reported as (-march=rv64gc): https://godbolt.org/z/3P76YEb9s : In function 'test_vfwsub_wf_f32mf2': :4:15: error: return type 'vfloat32mf2_t' requires the V ISA extension 4 | vfloat32mf2_t test_vfwsub_wf_f32mf2(vfloat32mf2_t vs2, _Float16 rs1, size_t vl) { | ^ Compiler returned: 1
[Bug c/114988] New: RISC-V: ICE in intrinsic __riscv_vfwsub_wf_f32mf2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114988 Bug ID: 114988 Summary: RISC-V: ICE in intrinsic __riscv_vfwsub_wf_f32mf2 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: juzhe.zhong at rivai dot ai Target Milestone: --- https://godbolt.org/z/ncxrx3fK9 #include #include vfloat32mf2_t test_vfwsub_wf_f32mf2(vfloat32mf2_t vs2, _Float16 rs1, size_t vl) { return __riscv_vfwsub_wf_f32mf2(vs2, rs1, vl); } with -march=rv64gcv -O3: :6:1: error: unrecognizable insn: 6 | } | ^ (insn 8 5 12 2 (set (reg:RVVMF2SF 134 [ ]) (if_then_else:RVVMF2SF (unspec:RVVMF64BI [ (const_vector:RVVMF64BI repeat [ (const_int 1 [0x1]) ]) (reg/v:DI 137 [ vl ]) (const_int 2 [0x2]) repeated x2 (const_int 0 [0]) (const_int 7 [0x7]) (reg:SI 66 vl) (reg:SI 67 vtype) (reg:SI 69 frm) ] UNSPEC_VPREDICATE) (minus:RVVMF2SF (reg/v:RVVMF2SF 135 [ vs2 ]) (float_extend:RVVMF2SF (vec_duplicate:RVVMF4HF (reg/v:HF 136 [ rs1 ] (unspec:RVVMF2SF [ (reg:DI 0 zero) ] UNSPEC_VUNDEF))) "":5:10 -1 (nil)) FP16 vector need zvfh, so such intrinsic should be reported as illegal intrinsic in frontend instead of an ICE. with -rv64gcv_zvfh: It can be compiled: vsetvli zero,a0,e16,mf4,ta,ma vfwsub.wf v8,v8,fa0 ret
[Bug target/114887] RISC-V: expect M8 but M4 generated with dynamic LMUL for TSVC s319
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114887 --- Comment #2 from JuzheZhong --- I think there is a too conservative analysis here: note: _1: type = float, start = 1, end = 6 note: _5: type = float, start = 6, end = 8 note: _3: type = float, start = 3, end = 7 note: _4: type = float, start = 5, end = 6 note: _2: type = float, start = 2, end = 3 note: _28: type = float, start = 7, end = 9 note: sum_18: type = real_t, start = 9, end = 9 note: sum_26: type = real_t, start = 0, end = 9 The variables live at point 6 should be: 1. _1 2. _3 3. _4 4. sum_26 So there are total 4 variables each variable occupies 8 register at LMUL = 8. Then the total live register should 4 * 8 = 32 which is ok to pick LMUL = 8.
[Bug target/114887] RISC-V: expect M8 but M4 generated with dynamic LMUL for TSVC s319
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114887 --- Comment #1 from JuzheZhong --- The "vect" cost model analysis: https://godbolt.org/z/qbqzon8x1 note: Maximum lmul = 8, At most 40 number of live V_REG at program point 6 for bb 3 It seems that we count one more variable in program point 6 ?
[Bug target/114639] [riscv] ICE in create_pre_exit, at mode-switching.cc:451
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114639 --- Comment #18 from JuzheZhong --- (In reply to Li Pan from comment #17) > According to the V abi, looks like the asm code tries to save/restore the > callee-saved registers when there is a call in function body. > > | Name| ABI Mnemonic | Meaning | Preserved across > calls? > = > > | v0 | | Argument register| No > | v1-v7 | | Callee-saved registers | Yes > | v8-v23 | | Argument registers | No > | v24-v31 | | Callee-saved registers | Yes I see, https://godbolt.org/z/7bx1EEdGn When we use 44 instead of get_vl (), the load/store instructions are gone.
[Bug target/114639] [riscv] ICE in create_pre_exit, at mode-switching.cc:451
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114639 --- Comment #16 from JuzheZhong --- This issue is not fully fixed since the fixed patch only fixes ICE but there is a regression in codegen: https://godbolt.org/z/4nvxeqb6K Terrible codege: test(__rvv_uint64m4_t): addisp,sp,-16 csrrt0,vlenb sd ra,8(sp) sub sp,sp,t0 vs1r.v v1,0(sp) sub sp,sp,t0 vs1r.v v2,0(sp) sub sp,sp,t0 vs1r.v v3,0(sp) sub sp,sp,t0 vs1r.v v4,0(sp) sub sp,sp,t0 vs1r.v v5,0(sp) sub sp,sp,t0 vs1r.v v6,0(sp) sub sp,sp,t0 vs1r.v v7,0(sp) sub sp,sp,t0 vs1r.v v24,0(sp) sub sp,sp,t0 vs1r.v v25,0(sp) sub sp,sp,t0 vs1r.v v26,0(sp) sub sp,sp,t0 vs1r.v v27,0(sp) sub sp,sp,t0 vs1r.v v28,0(sp) sub sp,sp,t0 vs1r.v v29,0(sp) sub sp,sp,t0 vs1r.v v30,0(sp) sub sp,sp,t0 csrrt0,vlenb sllit1,t0,2 vs1r.v v31,0(sp) sub sp,sp,t1 vs4r.v v8,0(sp) callget_vl() csrrt0,vlenb sllit1,t0,2 vl4re64.v v8,0(sp) csrrt0,vlenb add sp,sp,t1 vl1re64.v v31,0(sp) add sp,sp,t0 vl1re64.v v30,0(sp) add sp,sp,t0 vl1re64.v v29,0(sp) add sp,sp,t0 vl1re64.v v28,0(sp) add sp,sp,t0 vl1re64.v v27,0(sp) add sp,sp,t0 vl1re64.v v26,0(sp) add sp,sp,t0 vl1re64.v v25,0(sp) add sp,sp,t0 vl1re64.v v24,0(sp) add sp,sp,t0 vl1re64.v v7,0(sp) add sp,sp,t0 vl1re64.v v6,0(sp) add sp,sp,t0 vl1re64.v v5,0(sp) add sp,sp,t0 vl1re64.v v4,0(sp) add sp,sp,t0 vl1re64.v v3,0(sp) add sp,sp,t0 vl1re64.v v2,0(sp) add sp,sp,t0 vl1re64.v v1,0(sp) add sp,sp,t0 ld ra,8(sp) vsetvli zero,a0,e64,m4,ta,ma vmsne.viv0,v8,0 addisp,sp,16 jr ra
[Bug target/114809] [RISC-V RVV] Counting elements might be simpler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809 JuzheZhong changed: What|Removed |Added CC||juzhe.zhong at rivai dot ai --- Comment #3 from JuzheZhong --- For missed peephole optimization, I already noticed it long time ago, and I have filed PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014 Such issue will gone after Richard Standiford @arm merged late-combine PASS in GCC 15. Also, GCC support dynamic LMUL optimization with -mrvv-max-lmul=dynamic: https://godbolt.org/z/646nYoKbv ASM: count_chars(char const*, unsigned long, char): beq a1,zero,.L4 vsetvli a4,zero,e8,m1,ta,ma vmv.v.x v1,a2 vsetvli zero,zero,e64,m8,ta,ma vmv.v.i v8,0 .L3: vsetvli a5,a1,e8,m1,ta,ma vle8.v v0,0(a0) sub a1,a1,a5 add a0,a0,a5 vmseq.vvv0,v0,v1 vsetvli zero,zero,e64,m8,tu,mu vadd.vi v8,v8,1,v0.t bne a1,zero,.L3 vsetvli a5,zero,e64,m8,ta,ma li a4,0 vmv.s.x v1,a4 vredsum.vs v8,v8,v1 vmv.x.s a0,v8 ret .L4: li a0,0 ret GCC picks LMUL = 8, since it doesn't cause additional register spillings according to the program register pressure.
[Bug target/114714] [RISC-V][RVV] ICE: insn does not satisfy its constraints (postreload)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114714 JuzheZhong changed: What|Removed |Added CC||juzhe.zhong at rivai dot ai --- Comment #6 from JuzheZhong --- (In reply to Robin Dapp from comment #5) > Did anybody do some further investigation here? Juzhe messaged me that this > PR is the original reason for the reversal but I don't yet understand why > the register filters don't encompass the full semantics of RVV overlap. > > I looked into the test case and what happens is that, in order to determine > the validity of the alternatives, riscv_get_v_regno_alignment is first being > called with an M2 mode. Our destination is actually a (subreg:RVVM2SI > (reg:RVVM4SI ...) 0), though. I suppose lra/reload check whether a > non-subreg destination also works and hands us a (reg:RVVM4SI ...) as > operand[0]. We pass this to riscv_get_v_regno_alignment which, for an LMUL4 > mode, returns 4, thus wrongly enabling the W42 alternatives. > A W42 alternative permits hard regs % 4 == 2, which causes us to eventually > choose vr2 as destination and source. Once the constraints are actually > checked we have a mismatch as none of the alternatives work. > > Now I'm not at all sure how lra/reload use operand[0] here but this can > surely be found out. A quick and dirty hack (attached) that checks the > insn's destination mode instead of operand[0]'s mode gets rid of the ICE and > doesn't cause regressions. > > I suppose we're too far ahead with the reversal already but I'd really have > preferred more details. Maybe somebody has had in-depth look but it just > wasn't posted yet? > > --- a/gcc/config/riscv/riscv.cc > +++ b/gcc/config/riscv/riscv.cc > @@ -6034,6 +6034,22 @@ riscv_get_v_regno_alignment (machine_mode mode) >return lmul; > } > > +int > +riscv_get_dest_alignment (rtx_insn *insn, rtx operand) > +{ > + const_rtx set = 0; > + if (GET_CODE (PATTERN (insn)) == SET) > +{ > + set = PATTERN (insn); > + rtx op = SET_DEST (set); > + return riscv_get_v_regno_alignment (GET_MODE (op)); > +} > + else > +{ > + return riscv_get_v_regno_alignment (GET_MODE (operand)); > +} > +} > + > /* Define ASM_OUTPUT_OPCODE to do anything special before > emitting an opcode. */ > const char * > diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md > index ce1ee6b9c5e..5113daf2ac7 100644 > --- a/gcc/config/riscv/riscv.md > +++ b/gcc/config/riscv/riscv.md > @@ -550,15 +550,15 @@ (define_attr "group_overlap_valid" "no,yes" > (const_string "yes") > > (and (eq_attr "group_overlap" "W21") > - (match_test "riscv_get_v_regno_alignment (GET_MODE > (operands[0])) != 2")) > + (match_test "riscv_get_dest_alignment (insn, operands[0]) != > 2")) > (const_string "no") > > (and (eq_attr "group_overlap" "W42") > - (match_test "riscv_get_v_regno_alignment (GET_MODE > (operands[0])) != 4")) > + (match_test "riscv_get_dest_alignment (insn, operands[0]) != > 4")) > (const_string "no") > > (and (eq_attr "group_overlap" "W84") > - (match_test "riscv_get_v_regno_alignment (GET_MODE > (operands[0])) != 8")) > + (match_test "riscv_get_dest_alignment (insn, operands[0]) != > 8")) > (const_string "no") This hack looks good to me. But we already reverted multiple patches (Sorry for that). And I think we eventually need to revert them and support register group overlap in another optimal way (Extend constraint for RVV in IRA/LRA).
[Bug tree-optimization/114749] [13 Regression] RISC-V rv64gcv ICE: in vectorizable_load, at tree-vect-stmts.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114749 --- Comment #4 from JuzheZhong --- Hi, Patrick. It seems that Richard didn't append the testcase in the patch. Could you send a patch to add the testcase for RISC-V port ? Thangks.
[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729 JuzheZhong changed: What|Removed |Added CC||juzhe.zhong at rivai dot ai --- Comment #5 from JuzheZhong --- Did you try another scheduler ? -fselective-scheduling to see whether the spill issues still exist ?
[Bug target/114686] Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686 JuzheZhong changed: What|Removed |Added CC||juzhe.zhong at rivai dot ai --- Comment #2 from JuzheZhong --- CCing RISC-V folks who may be interested at it. Yeah, I agree to set dynamic lmul as default. I have mentioned it long time ago. However, almost all other RISC-V folks disagree with that. Here is data from Li Pan@intel: https://github.com/Incarnation-p-lee/Incarnation-p-lee/blob/master/performance/coremark-pro/coremark-pro_in_k230_evb.png Doing auto-vectorization on both LLVM and GCC (all LMUL) of coremark-pro. Turns out dynamic LMUL is beneficial. >> The vrgather.vv instruction should be except from that, because an LMUL=8 >> vrgather.vv is way more powerful than eight LMUL=1 vrgather.vv instructions, >> and thus disproportionately complex to implement. When you don't need to >> cross >> lanes, it's possible to unrolling LMUL=1 vrgathers manually, instead of >> choosing a higher LMUL. Agree, I think for some instructions like vrgather, we shouldn't pick the large LMUL even though the register pressure of the program is ok. We can consider large LMUL of vrgather as expensive in dynamic LMUL cost model and optimize it in GCC-15. >> vcompress.vm doesn't scale linearly with LMUL on the XuanTie chips either, >> but >> a better implementation is conceivable, because the work can be better >> distributed/subdivided. GCC currently doesn't seem to generate vcompress.vm >> via >> auto-vectorization anyway: https://godbolt.org/z/Mb5Kba865 GCC may generate compress in auto-vectorization, your case is because GCC failed to vectorize it, we may will optimize it in GCC-15. Here is some cases that GCC may generate vcompress: https://godbolt.org/z/5GKh4eM7z
[Bug target/114639] [riscv] ICE in create_pre_exit, at mode-switching.cc:451
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114639 --- Comment #6 from JuzheZhong --- Definitely it is a regression: https://compiler-explorer.com/z/e68x5sT9h GCC 13.2 is ok, but GCC 14 ICE. I think you should bisect first.
[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vect-cost-model (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476 --- Comment #7 from JuzheZhong --- Hi, Robin. Will you fix this bug ?
[Bug target/114506] RISC-V: expect M8 but M4 generated with dynamic LMUL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114506 JuzheZhong changed: What|Removed |Added CC||juzhe.zhong at rivai dot ai --- Comment #4 from JuzheZhong --- (In reply to Andrew Pinski from comment #2) > Using -fno-vect-cost-model forces the use of M8 though. > > I have no idea how this cost model is trying to prove here. We shouldn't force M8. We have support dynamic LMUL cost model heuristically analyze the vector register pressure in SSA level. So that we could pick the optimal LMUL. This PR presents shows that RVV dynamic LMUL cost model pick LMUL 4 instead of LMUL 8 unexpectedly. So we should adjust the dynamic LMUL cost model to fix this issue.
[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #19 from JuzheZhong --- I think it's better to add pr114396.c into vect testsuite instead of x86 target test since it's the bug not only happens on x86.
[Bug tree-optimization/113281] [11/12/13 Regression] Latent wrong code due to vectorization of shift reduction and missing promotions since r9-1590
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 --- Comment #28 from JuzheZhong --- The original cost model I did work for all cases but with some middle-end changes the cost model failed. I don't have time to figure out what's going on here. Robin may be interested at it.
[Bug middle-end/114109] x264 satd vectorization vs LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109 --- Comment #3 from JuzheZhong --- (In reply to Robin Dapp from comment #2) > It is vectorized with a higher zvl, e.g. zvl512b, refer > https://godbolt.org/z/vbfjYn5Kd. OK. I see. But Clang generates many slide instruction which are expensive in real hardware. And also vluxei64 is also expensive. I am not sure which is better. It should be tested on real RISC-V hardware to evaluate their performance rather than simply tested on SPIKE/QEMU dynamic instructions count.
[Bug middle-end/114109] x264 satd vectorization vs LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109 --- Comment #1 from JuzheZhong --- It seems RISC-V Clang didn't vectorize it ? https://godbolt.org/z/G4han6vM3
[Bug target/113913] [14] RISC-V: suboptimal code gen for intrinsic vcreate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113913 --- Comment #2 from JuzheZhong --- It's the known issue we are trying to fix it in GCC-15. My colleague Lehua is taking care of it. CCing Lehua.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #16 from JuzheZhong --- The FMA is generated in widening_mul PASS: Before widening_mul (fab1): _5 = 3.33314829616256247390992939472198486328125e-1 - _4; _6 = _5 * 1.229982236431605997495353221893310546875e-1; _8 = _4 + _6; After widening_mul: _5 = 3.33314829616256247390992939472198486328125e-1 - _4; _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, _4); I think it's obvious, widening_mul choose to transform later 2 STMTs: _6 = _5 * 1.229982236431605997495353221893310546875e-1; _8 = _4 + _6; into: _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, _4); without any re-association.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #15 from JuzheZhong --- (In reply to rguent...@suse.de from comment #14) > On Wed, 7 Feb 2024, juzhe.zhong at rivai dot ai wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 > > > > --- Comment #13 from JuzheZhong --- > > Ok. I found the optimized tree: > > > > > > _5 = 3.33314829616256247390992939472198486328125e-1 - _4; > > _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, > > _4); > > > > Let CST0 = 3.33314829616256247390992939472198486328125e-1, > > CST1 = 1.229982236431605997495353221893310546875e-1 > > > > The expression is equivalent to the following: > > > > _5 = CST0 - _4; > > _8 = _5 * CST1 + 4; > > > > That is: > > > > _8 = (CST0 - _4) * CST1 + 4; > > > > So, We should be able to re-associate it like Clang: > > > > _8 = CST0 * CST1 - _4 * CST1 + 4; ---> _8 = CST0 * CST1 + _4 * (1 - CST1); > > > > Since both CST0 * CST1 and 1 - CST1 can be pre-computed during compilation > > time. > > > > Let say CST2 = CST0 * CST1, CST3 = 1 - CST1, then we can re-associate as > > Clang: > > > > _8 = FMA (_4, CST3, CST2). > > > > Any suggestions for this re-association ? Is match.pd the right place to > > do it > > ? > > You need to look at the IL before we do .FMA forming, specifically > before/after the late reassoc pass. There pass applying match.pd > patterns everywhere is forwprop. > > I also wonder which compilation flags you are using (note clang > has different defaults for example for -ftrapping-math) Both GCC and Clang are using -Ofast -ffast-math.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #13 from JuzheZhong --- Ok. I found the optimized tree: _5 = 3.33314829616256247390992939472198486328125e-1 - _4; _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, _4); Let CST0 = 3.33314829616256247390992939472198486328125e-1, CST1 = 1.229982236431605997495353221893310546875e-1 The expression is equivalent to the following: _5 = CST0 - _4; _8 = _5 * CST1 + 4; That is: _8 = (CST0 - _4) * CST1 + 4; So, We should be able to re-associate it like Clang: _8 = CST0 * CST1 - _4 * CST1 + 4; ---> _8 = CST0 * CST1 + _4 * (1 - CST1); Since both CST0 * CST1 and 1 - CST1 can be pre-computed during compilation time. Let say CST2 = CST0 * CST1, CST3 = 1 - CST1, then we can re-associate as Clang: _8 = FMA (_4, CST3, CST2). Any suggestions for this re-association ? Is match.pd the right place to do it ? Thanks.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #12 from JuzheZhong --- Ok. I found it even without vectorization: GCC is worse than Clang: https://godbolt.org/z/addr54Gc6 GCC (14 instructions inside the loop): fld fa3,0(a0) fld fa5,8(a0) fld fa1,16(a0) fsub.d fa4,ft2,fa3 addia0,a0,160 fadd.d fa5,fa5,fa1 addia1,a1,160 addia5,a5,160 fmadd.d fa4,fa4,fa2,fa3 fnmsub.dfa5,fa5,ft1,ft0 fsd fa4,-160(a1) fld fa4,-152(a0) fadd.d fa4,fa4,fa0 fmadd.d fa5,fa5,fa2,fa4 fsd fa5,-160(a5) Clang (12 instructions inside the loop): fld fa1, -8(a0) fld fa0, 0(a0) fld ft0, 8(a0) fmadd.d fa1, fa1, fa4, fa5 fsd fa1, 0(a1) fld fa1, 0(a0) fadd.d fa0, ft0, fa0 fmadd.d fa0, fa0, fa2, fa3 fadd.d fa1, fa0, fa1 add a4, a1, a3 fsd fa1, -376(a4) addia1, a1, 160 addia0, a0, 160 The critical things is that: GCC has fsub.d fa4,ft2,fa3 fadd.d fa5,fa5,fa1 fmadd.d fa4,fa4,fa2,fa3 fnmsub.dfa5,fa5,ft1,ft0 fadd.d fa4,fa4,fa0 fmadd.d fa5,fa5,fa2,fa4 6 floating-point operations. Clang has: fmadd.d fa1, fa1, fa4, fa5 fadd.d fa0, ft0, fa0 fmadd.d fa0, fa0, fa2, fa3 fadd.d fa1, fa0, fa1 Clang has 4. 2 more floating-point operations are very critical to the performance I think since double floating-point operations are usually costly in real hardware.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #11 from JuzheZhong --- Hi, I think this RVV compiler codegen is that optimal codegen we want for RVV: https://repo.hca.bsc.es/epic/z/P6QXCc .LBB0_5:# %vector.body sub a4, t0, a3 vsetvli t1, a4, e64, m1, ta, mu mul a2, a3, t2 add a5, t3, a2 vlse64.vv8, (a5), t2 add a4, a6, a2 vlse64.vv9, (a4), t2 add a4, a0, a2 vlse64.vv10, (a4), t2 vfadd.vvv8, v8, v9 vfmul.vfv8, v8, fa5 vfadd.vfv9, v10, fa4 vfmadd.vf v9, fa3, v10 vlse64.vv10, (a5), t2 add a4, a1, a2 vsse64.vv9, (a4), t2 vfadd.vfv8, v8, fa2 vfmadd.vf v8, fa3, v10 vfadd.vfv8, v8, fa1 add a2, a2, a7 add a3, a3, t1 vsse64.vv8, (a2), t2 bne a3, t0, .LBB0_5
[Bug tree-optimization/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #22 from JuzheZhong --- I have done this following experiment. diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc index bf017137260..8c36cc63d3b 100644 --- a/gcc/tree-ssa-loop-ivcanon.cc +++ b/gcc/tree-ssa-loop-ivcanon.cc @@ -1260,6 +1260,39 @@ canonicalize_loop_induction_variables (class loop *loop, may_be_zero = false; } + if (!exit) + { + auto_vec exits = get_loop_exit_edges (loop); + exit = exits[0]; + class tree_niter_desc desc1; + class tree_niter_desc desc2; + if (number_of_iterations_exit (loop, exits[0], , false) + && number_of_iterations_exit (loop, exits[1], , false)) + { + niter = fold_build2 (MIN_EXPR, unsigned_type_node, desc1.niter, + desc2.niter); + create_canonical_iv (loop, exit, niter); + gcond *cond_stmt; + class nb_iter_bound *elt; + for (elt = loop->bounds; elt; elt = elt->next) + { + if (elt->is_exit + && !wi::ltu_p (loop->nb_iterations_upper_bound, +elt->bound)) + { + cond_stmt = as_a (elt->stmt); + break; + } + } + if (exits[1]->flags & EDGE_TRUE_VALUE) + gimple_cond_make_false (cond_stmt); + else + gimple_cond_make_true (cond_stmt); + update_stmt (cond_stmt); + return false; + } + } + I know the check is wrong just for experiment, Then: [local count: 69202658]: _21 = (unsigned int) N_13(D); _22 = MIN_EXPR <_21, 1001>; > Use MIN_EXPR as the check. _23 = _22 + 1; goto ; [100.00%] [local count: 1014686025]: _1 = (long unsigned int) i_9; _2 = _1 * 4; _3 = a_14(D) + _2; _4 = *_3; _5 = b_15(D) + _2; _6 = *_5; _7 = c_16(D) + _2; _8 = _4 + _6; *_7 = _8; if (0 != 0) goto ; [1.00%] else goto ; [99.00%] [local count: 1004539166]: i_18 = i_9 + 1; [local count: 1073741824]: # i_9 = PHI <0(2), i_18(4)> # ivtmp_19 = PHI <_23(2), ivtmp_20(4)> ivtmp_20 = ivtmp_19 - 1; if (ivtmp_20 != 0) goto ; [94.50%] else goto ; [5.50%] [local count: 69202658]: return; Then it can vectorize. I am not sure whether it is the right place to put the codes.
[Bug target/113608] RISC-V: Vector spills after enabling vector abi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113608 --- Comment #2 from JuzheZhong --- vuint16m2_t vadd(vuint16m2_t a, vuint8m1_t b) { int vl = __riscv_vsetvlmax_e8m1(); vuint16m2_t c = __riscv_vzext_vf2_u16m2(b, vl); return __riscv_vadd_vv_u16m2(a, c, vl); }
[Bug tree-optimization/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #21 from JuzheZhong --- Hi, Richard. I looked into ivcanon. I found that: /* If the loop has more than one exit, try checking all of them for # of iterations determinable through scev. */ if (!exit) niter = find_loop_niter (loop, ); In find_loop_niter, we iterate 2 exit edges: 1. bb 5 -> bb 6 with niter = (unsigned int) N_13(D). 2. bb 3 -> bb 6 with niter = 1001. It just skip niter = (unsigned int) N_13(D) in: if (!integer_zerop (desc.may_be_zero)) continue; find_loop_niter (loop, ) return 1001 with skipping (unsigned int) N_13(D). Should it return MIN (1001, (unsigned int) N_13(D)). I prefer fix it in ivcanon since I believe it would be more elegant than fix it in loop splitter. I am still investigating, any guides will be really appreciated. Thanks.
[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492 --- Comment #11 from JuzheZhong --- Hi, Tamar. We are interested in supporting saturating and rounding. We may need to support scalar first. Do you have any suggestions ? Or you are already working on it? Thanks.
[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492 --- Comment #10 from JuzheZhong --- Hi, Tamar. We are interested in supporting saturating and rounding. We may need to support scalar first. Do you have any suggestions ? Or you are already working on it? Thanks.
[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492 --- Comment #9 from JuzheZhong --- Ok. After investigation of LLVM: Before loop vectorizer: %cond12 = tail call i32 @llvm.usub.sat.i32(i32 %conv5, i32 %wsize) %conv13 = trunc i32 %cond12 to i16 After loop vectorizer: %10 = call <16 x i32> @llvm.usub.sat.v16i32(<16 x i32> %9, <16 x i32> %broadcast.splat) %11 = trunc <16 x i32> %10 to <16 x i16> I think GCC can follow this approach, that is, first recognize scalar saturation, then fall into loop vectorizer to vectorize it into the saturation.
[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492 --- Comment #8 from JuzheZhong --- Missing saturate vectorization causes RVV Clang 20% performance better than RVV GCC during recent benchmark evaluation. In coremark pro zip-test, I believe other targets should be the same. I wonder how we should start to support it. Or did some body has already started it ?
[Bug c/113695] RISC-V: Sources with different EEW must use different registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113695 --- Comment #1 from JuzheZhong --- Since both operand are input operand, early clobber "&" constraint can not help.
[Bug c/113695] New: RISC-V: Sources with different EEW must use different registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113695 Bug ID: 113695 Summary: RISC-V: Sources with different EEW must use different registers Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: juzhe.zhong at rivai dot ai Target Milestone: --- As this PR in LLVM, https://github.com/llvm/llvm-project/issues/80099 RVV ISA: A vector register cannot be used to provide source operands with more than one EEW for a single instruction. A mask register source is considered to have EEW=1 for this constraint. An encoding that would result in the same vector register being read with two or more different EEWs, including when the vector register appears at different positions within two or more vector register groups, is reserved. #include #include void foo(vuint64m2_t colidx, uint32_t* base_addr, size_t vl) { vuint32m1_t values = __riscv_vget_v_u32m2_u32m1(__riscv_vreinterpret_v_u64m2_u32m2 (colidx), 0); __riscv_vsuxei64_v_u32m1(base_addr, colidx, values, vl); } foo: vsetvli zero,a1,e32,m1,ta,ma vsuxei64.v v8,(a0),v8 ret It is incorrect those 2 input operand with different EEW should not be the same register (v8). Current GCC RTL machine description and constraint can not allow us to fix it. Even though it is a bug, I think we can only revisit it in GCC-15.
[Bug tree-optimization/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #19 from JuzheZhong --- The loop is: bb 3 -> bb 4 -> bb 5 | |__⬆ |__⬆ The condition in bb 3 is if (i_21 == 1001). The condition in bb 4 is if (N_13(D) > i_18). Look into lsplit: This loop doesn't satisfy the check of: if (split_loop (loop) || split_loop_on_cond (loop)) In split_loop_on_cond, it's trying to split the loop that condition is loop invariant. However, no matter bb 3 or bb 4, their conditions are not loop invariant. I wonder whether we should add a new kind of loop splitter like: diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc index 04215fe7937..a4081b9b6f5 100644 --- a/gcc/tree-ssa-loop-split.cc +++ b/gcc/tree-ssa-loop-split.cc @@ -1769,7 +1769,8 @@ tree_ssa_split_loops (void) if (optimize_loop_for_size_p (loop)) continue; - if (split_loop (loop) || split_loop_on_cond (loop)) + if (split_loop (loop) || split_loop_on_cond (loop) + || split_loop_for_early_break (loop)) { /* Mark our containing loop as having had some split inner loops. */ loop_outer (loop)->aux = loop;
[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 --- Comment #18 from JuzheZhong --- (In reply to rguent...@suse.de from comment #17) > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 > > > > --- Comment #16 from JuzheZhong --- > > (In reply to rguent...@suse.de from comment #15) > > > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote: > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 > > > > > > > > --- Comment #14 from JuzheZhong --- > > > > Thanks Richard. > > > > > > > > It seems that we can't fix this issue for now. Is that right ? > > > > > > > > If I understand correctly, do you mean we should wait after SLP > > > > representations > > > > are finished and then revisit this PR? > > > > > > Yes. > > > > It seems to be a big refactor work. > > It's not too bad if people wouldn't continue to add features not > implementing SLP ... > > > I wonder I can do anything to help with SLP representations ? > > I hope to get back to this before stage1 re-opens and will post > another request for testing. It's really mostly going to be making > sure all paths have coverage which means testing all the various > architectures - I can only easily test x86. There's a branch > I worked on last year, refs/users/rguenth/heads/vect-force-slp, > which I use to hunt down cases not supporting SLP (it's a bit > overeager to trigger, and it has known holes so it's not really > a good starting point yet for folks to try other archs). Ok. It seems that you almost done with that but needs more testing in various targets. So, if I want to work on optimizing vectorization (start with TSVC), I should avoid touching the failed vectorized due to data reference/dependence analysis (e.g. this PR case, s116). and avoid adding new features into loop vectorizer, e.g. min/max reduction with index (s315). To not to make your SLP refactoring work heavier. Am I right ?
[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 --- Comment #16 from JuzheZhong --- (In reply to rguent...@suse.de from comment #15) > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 > > > > --- Comment #14 from JuzheZhong --- > > Thanks Richard. > > > > It seems that we can't fix this issue for now. Is that right ? > > > > If I understand correctly, do you mean we should wait after SLP > > representations > > are finished and then revisit this PR? > > Yes. It seems to be a big refactor work. I wonder I can do anything to help with SLP representations ?
[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 --- Comment #14 from JuzheZhong --- Thanks Richard. It seems that we can't fix this issue for now. Is that right ? If I understand correctly, do you mean we should wait after SLP representations are finished and then revisit this PR?
[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 --- Comment #12 from JuzheZhong --- OK. It seems it has data dependency issue: missed: not vectorized, possible dependence between data-refs a[i_15] and a[_4] a[i_15] = _3; STMT 1 _4 = i_15 + 2; _5 = a[_4];STMT 2 STMT2 should not depend on STMT1. It's recognized as dependency in vect_analyze_data_ref_dependence. Is is reasonable to fix it in vect_analyze_data_ref_dependence ?
[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 --- Comment #11 from JuzheZhong --- It seems that we should fix this case (Richard gave) first which I think it's not the SCEV or value-numbering issue: double a[1024]; void foo () { for (int i = 0; i < 1022; i += 2) { double tem = a[i+1]; a[i] = tem * a[i]; a[i+1] = a[i+2] * tem; } } auto.c:13:21: missed: couldn't vectorize loop auto.c:15:14: missed: not vectorized: no vectype for stmt: tem_10 = a[_1]; scalar_type: double
[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 --- Comment #10 from JuzheZhong --- I think the root cause is we think i_16 and _1 are alias due to scalar evolution: (get_scalar_evolution (scalar = i_16) (scalar_evolution = {0, +, 2}_1)) (get_scalar_evolution (scalar = _1) (scalar_evolution = {1, +, 2}_1)) Even though I didn't understand what it is. diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc index 25e3130e2f1..2df6de67043 100644 --- a/gcc/tree-scalar-evolution.cc +++ b/gcc/tree-scalar-evolution.cc @@ -553,7 +553,7 @@ get_scalar_evolution (basic_block instantiated_below, tree scalar) if (SSA_NAME_IS_DEFAULT_DEF (scalar)) res = scalar; else - res = *find_var_scev_info (instantiated_below, scalar); + res = scalar; break; case REAL_CST: Ah... I tried an ugly hack which is definitely wrong (just for experiment) in scalar evolution. Then, we can vectorize it: foo: lui a1,%hi(a) addia1,a1,%lo(a) li a2,511 li a3,0 vsetivlizero,2,e64,m1,ta,ma .L2: addiw a5,a3,1 sllia5,a5,3 add a5,a1,a5 fld fa5,0(a5) sllia4,a3,3 add a4,a1,a4 vlse64.vv2,0(a4),zero vle64.v v1,0(a5) vfslide1down.vf v2,v2,fa5 addiw a2,a2,-1 vfmul.vvv1,v1,v2 vse64.v v1,0(a4) addiw a3,a3,2 bne a2,zero,.L2 ret I think we can add some simple memory access index recognition, but I don't known where to add this recognition. Would you mind giving me some more hints ? Thanks.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 JuzheZhong changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #34 from JuzheZhong --- Fixed.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #20 from JuzheZhong --- (In reply to Robin Dapp from comment #19) > What seems odd to me is that in fre5 we simplify > > _429 = .COND_SHL (mask_patt_205.47_276, vect_cst__262, vect_cst__262, { 0, > ... }); > vect_prephitmp_129.51_282 = _429; > vect_iftmp.55_287 = VEC_COND_EXPR vect_prephitmp_129.51_282, vect_cst__262>; > > to > > Applying pattern match.pd:9607, gimple-match-10.cc:3817 > gimple_simplified to vect_iftmp.55_287 = .COND_SHL (mask_patt_205.47_276, > vect_cst__262, vect_cst__262, { 0, ... }); > > so fold > > vec_cond (mask209, prephitmp129, vect_cst262) > with prephitmp129 = cond_shl (mask205, vect_cst262, vect_cst262, 0) > > into > cond_shl = (mask205, vect_cst262, vect_cst262, 0)? > > That doesn't look valid to me because the vec_cond's else value > (vect_cst262) gets lost. Wouldn't such a simplification have a conditional > else value? > Like !mask1 ? else1 : else2 instead of else2 unconditionally? Does ARM SVE have the same issue too ? Since I think we should be using same folding optimization as ARM SVE.
[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 --- Comment #8 from JuzheZhong --- Hi, Richard. Now, I find the time to GCC vectorization optimization. I find this case: _2 = a[_1]; ... a[i_16] = _4; ,,, _7 = a[_1];---> This load should be eliminated and re-use _2. Am I right ? Could you guide me which pass should do this CSE optimization ? Thanks.
[Bug middle-end/113166] RISC-V: Redundant move instructions in RVV intrinsic codes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113166 --- Comment #3 from JuzheZhong --- #include #include template inline vuint8m1_t tail_load(void const* data); template<> inline vuint8m1_t tail_load(void const* data) { uint64_t const* ptr64 = reinterpret_cast(data); #if 1 const vuint64m1_t zero = __riscv_vmv_v_x_u64m1(0, __riscv_vsetvlmax_e64m1()); vuint64m1_t v64 = __riscv_vslide1up(zero, *ptr64, __riscv_vsetvlmax_e64m1()); return __riscv_vreinterpret_u8m1(v64); #elif 1 vuint64m1_t v64 = __riscv_vmv_s_x_u64m1(*ptr64, 1); const vuint64m1_t zero = __riscv_vmv_v_x_u64m1(0, __riscv_vsetvlmax_e64m1()); v64 = __riscv_vslideup(v64, zero, 1, __riscv_vsetvlmax_e8m1()); return __riscv_vreinterpret_u8m1(v64); #elif 1 vuint64m1_t v64 = __riscv_vle64_v_u64m1(ptr64, 1); const vuint64m1_t zero = __riscv_vmv_v_x_u64m1(0, __riscv_vsetvlmax_e64m1()); v64 = __riscv_vslideup(v64, zero, 1, __riscv_vsetvlmax_e8m1()); return __riscv_vreinterpret_u8m1(v64); #else vuint8m1_t v = __riscv_vreinterpret_u8m1(__riscv_vle64_v_u64m1(ptr64, 1)); const vuint8m1_t zero = __riscv_vmv_v_x_u8m1(0, __riscv_vsetvlmax_e8m1()); return __riscv_vslideup(v, zero, sizeof(uint64_t), __riscv_vsetvlmax_e8m1()); #endif } vuint8m1_t test2(uint64_t data) { return tail_load(); } GCC ASM: test2(unsigned long): vsetvli a5,zero,e64,m1,ta,ma vmv.v.i v8,0 vmv1r.v v9,v8 vslide1up.vxv8,v9,a0 ret LLVM ASM: test2(unsigned long): # @test2(unsigned long) vsetvli a1, zero, e64, m1, ta, ma vmv.v.i v9, 0 vslide1up.vxv8, v9, a0 ret
[Bug c/113666] New: RISC-V: Cost model test regression due to recent middle-end loop vectorizer changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113666 Bug ID: 113666 Summary: RISC-V: Cost model test regression due to recent middle-end loop vectorizer changes Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: juzhe.zhong at rivai dot ai Target Milestone: --- FAIL: gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c scan-assembler-not vset FAIL: gcc.dg/vect/costmodel/riscv/rvv/pr113281-2.c scan-assembler-not vset FAIL: gcc.dg/vect/costmodel/riscv/rvv/pr113281-5.c scan-assembler-not vset unsigned char a; int main() { short b = a = 0; for (; a != 19; a++) if (a) b = 32872 >> a; if (b == 0) return 0; else return 1; } -march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize We expect: lui a5,%hi(a) li a4,19 sb a4,%lo(a)(a5) li a0,0 ret However, we now have: vsetvli a5,zero,e8,mf4,ta,ma li a6,17 li a3,32768 vid.v v2 addiw a3,a3,104 vadd.vx v2,v2,a6 lui a1,%hi(a) vsetvli zero,zero,e32,m1,ta,ma li a0,19 vmv.v.x v1,a3 vzext.vf4 v3,v2 sb a0,%lo(a)(a1) vsra.vv v1,v1,v3 vsetvli zero,zero,e16,mf2,ta,ma vncvt.x.x.w v1,v1 vslidedown.vi v1,v1,1 vmv.x.s a0,v1 sneza0,a0 ret I guess it is caused by this commit: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1a8261e047f7a2c2b0afb95716f7615cba718cd1 I don't known how to fix RISC-V backend cost model to recover back since current cost scalar_to_vec seems is already very high, it makes no sense to keep raising the cost of it.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #15 from JuzheZhong --- Hi, Robin. I tried to disable vec_extract, then the case passed. diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md index 3b32369f68c..b61b886ef3d 100644 --- a/gcc/config/riscv/autovec.md +++ b/gcc/config/riscv/autovec.md @@ -1386,7 +1386,7 @@ (match_operand:V_VLS 1 "register_operand") (parallel [(match_operand 2 "nonmemory_operand")])))] - "TARGET_VECTOR" + "0" { /* Element extraction can be done by sliding down the requested element to index 0 and then v(f)mv.[xf].s it to a scalar register. */ I am not so familiar with it (vec extract stuff), could you take a look at it ?
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #13 from JuzheZhong --- Ok. I found a regression between rvv-next and trunk. I believe it is GCC-12 vs GCC-14: rvv-next: ... .L11: li t1,31 mv a2,a1 bleua7,t1,.L12 bne a6,zero,.L13 li a5,1 subwa5,a5,a3 andia5,a5,0xff vsetvli a4,zero,e64,m1,ta,mu vmv.v.i v24,0 vmv.v.x v27,a1 vmv1r.v v26,v24 .L14: vsetvli a3,a5,e64,m1,tu,mu sub a5,a5,a3 vmsne.viv0,v27,0 vmerge.vim v25,v26,1,v0 vadd.vv v24,v24,v25 bne a5,zero,.L14 vsetvli a5,zero,e64,m1,ta,mu vmv.s.x v25,zero li a3,0 vredsum.vs v25,v24,v25 vmv.x.s a5,v25 j .L17 ... RVV trunk GCC: .L8: lui a0,%hi(h) lb a4,%lo(h)(a0) bgt a4,zero,.L37 lui a5,%hi(f) lh t1,%lo(f)(a5) lui a3,%hi(g) addia3,a3,%lo(g) lw a6,4(a3) not a1,a6 slliw a5,t1,3 sraia1,a1,63 subwa5,a5,t1 lw a7,32(a3) and a1,a6,a1 addiw a2,a5,1 bne a7,zero,.L13 bne t1,zero,.L14 mv a5,a6 blt a6,zero,.L44 .L15: li a3,31 sext.w a2,a5 bleua6,a3,.L16 li a3,1 .L20: addiw a5,a4,1 bgt a6,zero,.L45 slliw a4,a5,24 sraiw a4,a4,24 bne a4,a3,.L20 li a5,0 li a2,0 j .L19 .L37: lui a5,%hi(c) .L11: lw a0,%lo(c)(a5) addia0,a0,-6 sneza0,a0 ret I don't think it will affect the correctness. But it's interesting observations..
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #11 from JuzheZhong --- (In reply to Robin Dapp from comment #10) > The compile farm machine I'm using doesn't have SVE. > Compiling with -march=armv8-a -O3 pr113607.c -fno-vect-cost-model and > running it returns 0 (i.e. ok). > > pr113607.c:35:5: note: vectorized 3 loops in function. Ok. Thanks. I just checked rvv-next which has similiar vectorized IR as upstream RVV GCC. But rvv-next return 0. I will investigate what difference between them.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #9 from JuzheZhong --- Hi, Robin. Could you try this case on latest ARM SVE ? with -march=armv8-a+sve -O3 -fno-vect-cost-model. I want to make sure first it is not an middle-end bug. The RVV vectorized IR is same as ARM SVE. Thanks.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #8 from JuzheZhong --- Ok. I can reproduce it too. I am gonna work on fixing it. Thanks.
[Bug c/113608] New: RISC-V: Vector spills after enabling vector abi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113608 Bug ID: 113608 Summary: RISC-V: Vector spills after enabling vector abi Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: juzhe.zhong at rivai dot ai Target Milestone: --- https://godbolt.org/z/srdd4qhdc #include "riscv_vector.h" vint32m8_t foo (int32_t *__restrict a, int32_t *__restrict b, int32_t *__restrict c, int32_t *__restrict a2, int32_t *__restrict b2, int32_t *__restrict c2, int32_t *__restrict a3, int32_t *__restrict b3, int32_t *__restrict c3, int32_t *__restrict a4, int32_t *__restrict b4, int32_t *__restrict c4, int32_t *__restrict a5, int32_t *__restrict b5, int32_t *__restrict c5, int32_t *__restrict d, int32_t *__restrict d2, int32_t *__restrict d3, int32_t *__restrict d4, int32_t *__restrict d5, int n, vint32m8_t vector) { for (int i = 0; i < n; i++) { a[i] = b[i] + c[i]; b5[i] = b[i] + c[i]; a2[i] = b2[i] + c2[i]; a3[i] = b3[i] + c3[i]; a4[i] = b4[i] + c4[i]; a5[i] = a[i] + a4[i]; d2[i] = a2[i] + c2[i]; d3[i] = a3[i] + c3[i]; d4[i] = a4[i] + c4[i]; d5[i] = a[i] + a4[i]; a[i] = a5[i] + b5[i] + a[i]; c2[i] = a[i] + c[i]; c3[i] = b5[i] * a5[i]; c4[i] = a2[i] * a3[i]; c5[i] = b5[i] * a2[i]; c[i] = a[i] + c3[i]; c2[i] = a[i] + c4[i]; a5[i] = a[i] + a4[i]; a[i] = a[i] + b5[i] + a[i] * a2[i] * a3[i] * a4[i] * a5[i] * c[i] * c2[i] * c3[i] * c4[i] * c5[i] * d[i] * d2[i] * d3[i] * d4[i] * d5[i]; } return vector; } This case will have vector spills after enabling default vector ABI.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #3 from JuzheZhong --- I tried trunk GCC to run your case with SPIKE, still didn't reproduce this issue.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #2 from JuzheZhong --- I can't reproduce this issue. Could you test it with this patch applied ? https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643934.html
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #1 from JuzheZhong --- I can reproduce this issue. Could you test it with this patch applied ? https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643934.html
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #7 from JuzheZhong --- (In reply to rguent...@suse.de from comment #6) > On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 > > > > --- Comment #5 from JuzheZhong --- > > Both ICC and Clang X86 can vectorize SPEC 2017 lbm: > > > > https://godbolt.org/z/MjbTbYf1G > > > > But I am not sure X86 ICC is better or X86 Clang is better. > > gather/scatter are possibly slow (and gather now has that Intel > security issue). The reason is a "cost" one: > > t.c:47:21: note: ==> examining statement: _4 = *_3; > t.c:47:21: missed: no array mode for V8DF[20] > t.c:47:21: missed: no array mode for V8DF[20] > t.c:47:21: missed: the size of the group of accesses is not a power of 2 > or not equal to 3 > t.c:47:21: missed: not falling back to elementwise accesses > t.c:58:15: missed: not vectorized: relevant stmt not supported: _4 = > *_3; > t.c:47:21: missed: bad operation or unsupported loop bound. > > where we don't consider using gather because we have a known constant > stride (20). Since the stores are really scatters we don't attempt > to SLP either. > > Disabling the above heuristic we get this vectorized as well, avoiding > gather/scatter by manually implementing them and using a quite high > VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely > faster code in the end). But yes, I doubt that any of ICC or clang > vectorized codes are faster anywhere (but without specifying an > uarch you get some generic cost modelling applied). Maybe SPR doesn't > have the gather bug and it does have reasonable gather and scatter > (zen4 scatter sucks). > > .L3: > vmovsd 952(%rax), %xmm0 > vmovsd -8(%rax), %xmm2 > addq$1280, %rsi > addq$1280, %rax > vmovhpd -168(%rax), %xmm0, %xmm1 > vmovhpd -1128(%rax), %xmm2, %xmm2 > vmovsd -648(%rax), %xmm0 > vmovhpd -488(%rax), %xmm0, %xmm0 > vinsertf32x4$0x1, %xmm1, %ymm0, %ymm0 > vmovsd -968(%rax), %xmm1 > vmovhpd -808(%rax), %xmm1, %xmm1 > vinsertf32x4$0x1, %xmm1, %ymm2, %ymm2 > vinsertf64x4$0x1, %ymm0, %zmm2, %zmm2 > vmovsd -320(%rax), %xmm0 > vmovhpd -160(%rax), %xmm0, %xmm1 > vmovsd -640(%rax), %xmm0 > vmovhpd -480(%rax), %xmm0, %xmm0 > vinsertf32x4$0x1, %xmm1, %ymm0, %ymm1 > vmovsd -960(%rax), %xmm0 > vmovhpd -800(%rax), %xmm0, %xmm8 > vmovsd -1280(%rax), %xmm0 > vmovhpd -1120(%rax), %xmm0, %xmm0 > vinsertf32x4$0x1, %xmm8, %ymm0, %ymm0 > vinsertf64x4$0x1, %ymm1, %zmm0, %zmm0 > vmovsd -312(%rax), %xmm1 > vmovhpd -152(%rax), %xmm1, %xmm8 > vmovsd -632(%rax), %xmm1 > vmovhpd -472(%rax), %xmm1, %xmm1 > vinsertf32x4$0x1, %xmm8, %ymm1, %ymm8 > vmovsd -952(%rax), %xmm1 > vmovhpd -792(%rax), %xmm1, %xmm9 > vmovsd -1272(%rax), %xmm1 > vmovhpd -1112(%rax), %xmm1, %xmm1 > vinsertf32x4$0x1, %xmm9, %ymm1, %ymm1 > vinsertf64x4$0x1, %ymm8, %zmm1, %zmm1 > vaddpd %zmm1, %zmm0, %zmm0 > vaddpd %zmm7, %zmm2, %zmm1 > vfnmadd132pd%zmm3, %zmm2, %zmm1 > vfmadd132pd %zmm6, %zmm5, %zmm0 > valignq $3, %ymm1, %ymm1, %ymm2 > vmovlpd %xmm1, -1280(%rsi) > vextractf64x2 $1, %ymm1, %xmm8 > vmovhpd %xmm1, -1120(%rsi) > vextractf64x4 $0x1, %zmm1, %ymm1 > vmovlpd %xmm1, -640(%rsi) > vmovhpd %xmm1, -480(%rsi) > vmovsd %xmm2, -800(%rsi) > vextractf64x2 $1, %ymm1, %xmm2 > vmovsd %xmm8, -960(%rsi) > valignq $3, %ymm1, %ymm1, %ymm1 > vmovsd %xmm2, -320(%rsi) > vmovsd %xmm1, -160(%rsi) > vmovsd -320(%rax), %xmm1 > vmovhpd -160(%rax), %xmm1, %xmm2 > vmovsd -640(%rax), %xmm1 > vmovhpd -480(%rax), %xmm1, %xmm1 > vinsertf32x4$0x1, %xmm2, %ymm1, %ymm2 > vmovsd -960(%rax), %xmm1 > vmovhpd -800(%rax), %xmm1, %xmm8 > vmovsd -1280(%rax), %xmm1 > vmovhpd -1120(%rax), %xmm1, %xmm1 > vinsertf32x4$0x1, %xmm8, %ymm1, %ymm1 > vinsertf64x4$0x1, %ymm2, %zmm1, %zmm1 > vfnmadd132pd%zmm3, %zmm1, %zmm0 > vaddpd %zmm4, %zmm0, %zmm0 > valignq $3, %ymm0, %ymm0, %ymm1 > vmovlpd %xmm0, 14728(%rsi) > vextractf64x2 $1, %ymm0, %xmm2 > vmovhpd %xmm0, 14888(%rsi) > vextractf64x4 $0x1, %zmm0, %ymm0 > vmovlpd %xmm0, 15368(%rsi) > vmovhpd %xmm0, 15528(%rsi) > vmovsd %xmm1, 15208(%rsi) > vextractf64x2 $1, %ymm0, %xmm1 > vmovsd %xmm2, 15048(%rsi) > valignq $3, %ymm0, %ymm0, %ymm0 > vmovsd %xmm1, 15688(%rsi) > vmovsd %xmm0, 15848(%rsi) > cmpq%rdx, %rsi > jne
[Bug target/113570] RISC-V: SPEC2017 549 fotonik3d miscompilation in autovec VLS 256 build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113570 --- Comment #5 from JuzheZhong --- It seems that we don't have any bugs in current SPEC 2017 testing. So I strongly suggest "full coverage" testing on SPEC 2017 which I mentioned in PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087 -march=rv64gcv --param=riscv-autovec-lmul=m2 -march=rv64gcv --param=riscv-autovec-lmul=m4 -march=rv64gcv --param=riscv-autovec-lmul=m8 -march=rv64gcv --param=riscv-autovec-lmul=dynamic -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2 -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4 -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8 -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2 -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4 -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8 -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic -march=rv64gcv --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax Could you trigger these testing ?
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #5 from JuzheZhong --- Both ICC and Clang X86 can vectorize SPEC 2017 lbm: https://godbolt.org/z/MjbTbYf1G But I am not sure X86 ICC is better or X86 Clang is better.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #4 from JuzheZhong --- OK. Confirm on X86 GCC failed to vectorize it, wheras Clang X86 can vectorize it. https://godbolt.org/z/EaTjGbPGW X86 Clang and RISC-V Clang IR are same: %12 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %11, i32 8, <8 x i1> , <8 x double> poison), !dbg !62 %13 = or disjoint <8 x i64> %10, , !dbg !72 %14 = getelementptr inbounds double, ptr %0, <8 x i64> %13, !dbg !72 %15 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %14, i32 8, <8 x i1> , <8 x double> poison), !dbg !72 %16 = or disjoint <8 x i64> %10, , !dbg !73 %17 = getelementptr inbounds double, ptr %0, <8 x i64> %16, !dbg !73 %18 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %17, i32 8, <8 x i1> , <8 x double> poison), !dbg !73 %19 = fadd <8 x double> %15, %18, !dbg !74 %20 = fmul <8 x double> %19, , !dbg !75 %21 = fadd <8 x double> %12, , !dbg !76 %22 = tail call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %21, <8 x double> , <8 x double> %12), !dbg !77 %23 = getelementptr inbounds double, ptr %1, <8 x i64> %10, !dbg !77 tail call void @llvm.masked.scatter.v8f64.v8p0(<8 x double> %22, <8 x ptr> %23, i32 8, <8 x i1> ), !dbg !78 %24 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %14, i32 8, <8 x i1> , <8 x double> poison), !dbg !81 %25 = fadd <8 x double> %20, , !dbg !82 %26 = tail call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %25, <8 x double> , <8 x double> %24), !dbg !83 %27 = fadd <8 x double> %26, , !dbg !84 %28 = getelementptr double, <8 x ptr> %23, i64 2001, !dbg !84 tail call void @llvm.masked.scatter.v8f64.v8p0(<8 x double> %27, <8 x ptr> %28, i32 8, <8 x i1> ), !dbg !85 Hi, Richard. Do you have suggestions about this issue ? Thanks.
[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087 --- Comment #44 from JuzheZhong --- (In reply to Patrick O'Neill from comment #43) > (In reply to Patrick O'Neill from comment #42) > > I kicked off a run roughly 10 hours ago with your memory-hog fix patch > > applied to a1b2953924c451ce90a3fdce6841b63bf05f335f. I'll post the results > > here when the runs complete. Thanks! > > No new failures! > > zvl128b: > no fails! > > zvl256b: > 549.fotonik3d (runtime) - pr113570 (looks like this fail is since I used > -Ofast) Thanks. Could you trigger full coverage testing of SPEC with these following combination compile option: -march=rv64gcv --param=riscv-autovec-lmul=m2 -march=rv64gcv --param=riscv-autovec-lmul=m4 -march=rv64gcv --param=riscv-autovec-lmul=m8 -march=rv64gcv --param=riscv-autovec-lmul=dynamic -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2 -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4 -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8 -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2 -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4 -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8 -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic -march=rv64gcv --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax -march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax I believe they can be separate tasks assigned muitl-cores or muti-thread run simultaneously.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #3 from JuzheZhong --- Ok I see. If we change NN into 8, then we can vectorize it with load_lanes/store_lanes with group size = 8: https://godbolt.org/z/doe9c3hfo We will use vlseg8e64 which is RVVM1DF[8] == RVVM1x8DFmode. Here there is report: /app/example.c:47:21: missed: no array mode for RVVM1DF[20] /app/example.c:47:21: missed: no array mode for RVVM1DF[20] I believe we enable vec_load_lanes/vec_store_lanes for RVVM1DF[20] which RVVM1x20DF mode, then we can vectorize it. But it's not reasonable and not general way to do that. This code require array size = 20. How about other codes, we may have codes require array size = 21, 22,.. 23, etc... The array size can be any number. We can't leverage this approach for infinite array size. So, the idea is that we try to find vec_load_lanes/vec_store_lanes first check whether it support lanes vectorization for specific array size. If not, we should be able to lower them into multiple gather/scatter or strided load/stores.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #1 from JuzheZhong --- It's interesting, for Clang only RISC-V can vectorize it. I think there are 2 topics: 1. Support vectorization of this codes of in loop vectorizer. 2. Transform gather/scatter into strided load/store for RISC-V. For 2nd topic: LLVM does it by RISC-V target specific lowering pass: RISC-V gather/scatter lowering (riscv-gather-scatter-lowering) This is the RISC-V LLVM backend codes: if (II->getIntrinsicID() == Intrinsic::masked_gather) Call = Builder.CreateIntrinsic( Intrinsic::riscv_masked_strided_load, {DataType, BasePtr->getType(), Stride->getType()}, {II->getArgOperand(3), BasePtr, Stride, II->getArgOperand(2)}); else Call = Builder.CreateIntrinsic( Intrinsic::riscv_masked_strided_store, {DataType, BasePtr->getType(), Stride->getType()}, {II->getArgOperand(0), BasePtr, Stride, II->getArgOperand(3)}); I have ever tried to support strided load/store in GCC loop vectorizer, but it seems to be unacceptable. Maybe we can support strided load/stores by leveraging LLVM approach ??? Btw, LLVM risc-v gather/scatter didn't do a perfect job here: vid.v v8 vmul.vx v8, v8, a3 vsoxei64.v v10, (s2), v14 This is in-order indexed store which is very costly in hardware. It should be unorder indexed store or strided store. Anyway, I think we should investigate first how to support vectorization of lbm in loop vectorizer.
[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087 --- Comment #41 from JuzheZhong --- Hi, Patrick. Could you trigger test again base on latest trunk GCC? We have recent memory-hog fix patch: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=3132d2d36b4705bb762e61b1c8ca4da7c78a8321 I want to make sure it doesn't cause a regression on SPEC. I have tested it with full coverage GCC testsuite, no regression. But I want to know about SPEC 2017
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #24 from JuzheZhong --- (In reply to Richard Biener from comment #19) > (In reply to Richard Biener from comment #18) > > (In reply to Tamar Christina from comment #17) > > > Ok, bisected to > > > > > > g:2efe3a7de0107618397264017fb045f237764cc7 is the first bad commit > > > commit 2efe3a7de0107618397264017fb045f237764cc7 > > > Author: Hao Liu > > > Date: Wed Dec 6 14:52:19 2023 +0800 > > > > > > tree-optimization/112774: extend the SCEV CHREC tree with a > > > nonwrapping > > > flag > > > > > > Before this commit we were unable to analyse the stride of the access. > > > After this niters seems to estimate the loop trip count at 4 and after > > > that > > > the logs diverge enormously. > > > > Hum, but that's backward and would match to what I said in comment#2 - we > > should get better code with that. > > > > Juzhe - when you revert the above ontop of trunk does the generated code > > look better for Risc-V? > > It doesn't revert but you can do > > diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc > index 25e3130e2f1..7870c8d76fb 100644 > --- a/gcc/tree-scalar-evolution.cc > +++ b/gcc/tree-scalar-evolution.cc > @@ -2054,7 +2054,7 @@ analyze_scalar_evolution (class loop *loop, tree var) > > void record_nonwrapping_chrec (tree chrec) > { > - CHREC_NOWRAP(chrec) = 1; > + CHREC_NOWRAP(chrec) = 0; > >if (dump_file && (dump_flags & TDF_SCEV)) > { Hmmm. With experiments. The codegen looks slightly better but still didn't recover back to GCC-12. Btw, I compare ARM SVE codegen, even with cost model: https://godbolt.org/z/cKc1PG3dv I think GCC 13.2 codegen is better than GCC trunk with cost model.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #14 from JuzheZhong --- I just tried again both GCC-13.2 and GCC-14 with -fno-vect-cost-model. https://godbolt.org/z/enEG3qf5K GCC-14 requires scalar epilogue loop, whereas GCC-13.2 doesn't. I believe it's not cost model issue.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #12 from JuzheZhong --- (In reply to Richard Biener from comment #11) > (In reply to Tamar Christina from comment #9) > > There is a weird costing going on in the PHI nodes though: > > > > m_108 = PHI 1 times vector_stmt costs 0 in body > > m_108 = PHI 2 times scalar_to_vec costs 0 in prologue > > > > they have collapsed to 0. which can't be right.. > > Note this is likely because of the backend going wrong. > > bool > vectorizable_phi (vec_info *, > stmt_vec_info stmt_info, gimple **vec_stmt, > slp_tree slp_node, stmt_vector_for_cost *cost_vec) > { > .. > > /* For single-argument PHIs assume coalescing which means zero cost > for the scalar and the vector PHIs. This avoids artificially > favoring the vector path (but may pessimize it in some cases). */ > if (gimple_phi_num_args (as_a (stmt_info->stmt)) > 1) > record_stmt_cost (cost_vec, SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node), > vector_stmt, stmt_info, vectype, 0, vect_body); > > You could check if we call this with sane values. Do you mean it's RISC-V backend cost model issue ?
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #10 from JuzheZhong --- (In reply to Tamar Christina from comment #9) > So on SVE the change is cost modelling. > > Bisect landed on g:33c2b70dbabc02788caabcbc66b7baeafeb95bcf which changed > the compiler's defaults to using the new throughput matched cost modelling > used be newer cores. > > It looks like this changes which mode the compiler picks for when using a > fixed register size. > > This is because the new cost model (correctly) models the costs for FMAs and > promotions. > > Before: > > array1[0][_1] 1 times scalar_load costs 1 in prologue > int) _2 1 times scalar_stmt costs 1 in prologue > > after: > > array1[0][_1] 1 times scalar_load costs 1 in prologue > (int) _2 1 times scalar_stmt costs 0 in prologue > > and the cost goes from: > > Vector inside of loop cost: 125 > > to > > Vector inside of loop cost: 83 > > so far, nothing sticks out, and in fact the profitability for VNx4QI drops > from > > Calculated minimum iters for profitability: 5 > > to > > Calculated minimum iters for profitability: 3 > > This causes a clash, as this is now exactly the same cost as VNx2QI which > used to be what it preferred before. > > Which then leads it to pick the higher VF. > > In the end smaller VF shows: > > ;; Guessed iterations of loop 4 is 0.500488. New upper bound 1. > > and now we get: > > Vectorization factor 16 seems too large for profile prevoiusly believed to > be consistent; reducing. > ;; Guessed iterations of loop 4 is 0.500488. New upper bound 0. > ;; Scaling loop 4 with scale 66.6% (guessed) to reach upper bound 0 > > which I guess is the big difference. > > There is a weird costing going on in the PHI nodes though: > > m_108 = PHI 1 times vector_stmt costs 0 in body > m_108 = PHI 2 times scalar_to_vec costs 0 in prologue > > they have collapsed to 0. which can't be right.. I don't think this change makes the regression since the regression not only happens on ARM SVE but also on RVV. It should be middle-end. I believe you'd better use -fno-vect-cost-model.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #31 from JuzheZhong --- machine dep reorg : 403.69 ( 56%) 23.48 ( 93%) 427.17 ( 57%) 5290k ( 0%) Confirm remove RTL DF checking, LICM is no longer be compile-time hog issue. VSETVL PASS count 56% compile-time. Even though I can' see memory-hog in GGC -ftime-report, I can see 33G memory usage in htop. Confirm both compile-hog and memory-hog are VSETVL PASS issue. I will work on optimize compile-time as well as memory-usage of VSETVL PASS.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #30 from JuzheZhong --- Ok. I believe m_avl_def_in && m_avl_def_out can be removed with a better algorthm. Then the memory-hog should be fixed soon. I am gonna rewrite avl_vl_unmodified_between_p and trigger full coverage testingl Since it's going to be a big change there.
[Bug tree-optimization/113441] [13/14 Regression] Fail to fold the last element with multiple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #8 from JuzheZhong --- I believe the change between Nov and Dec causes regression. But I don't continue on bisection. Hope this information can help with your bisection. Thanks.
[Bug tree-optimization/113441] [13/14 Regression] Fail to fold the last element with multiple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #7 from JuzheZhong --- (In reply to Tamar Christina from comment #6) > Hello, > > I can bisect it if you want. it should only take a few seconds. Ok. Thanks a lot ... I take 2 hours to bisect it manually but still didn't locate the accurate commit which causes regression... It's great that you can bisect it easily.
[Bug tree-optimization/113441] [13/14 Regression] Fail to fold the last element with multiple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #5 from JuzheZhong --- Confirm at Nov, 1. The regression is gone. https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=eac0917bd3d2ead4829d56c8f2769176087c7b3d This commit is ok, which has no regressions. Still bisecting manually.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #28 from JuzheZhong --- (In reply to Robin Dapp from comment #27) > Following up on this: > > I'm seeing the same thing Patrick does. We create a lot of large non-sparse > sbitmaps that amount to around 33G in total. > > I did local experiments replacing all sbitmaps that are not needed for LCM > by regular bitmaps. Apart from output differences vs the original version > the testsuite is unchanged. > > As expected, wrf now takes longer to compiler, 8 mins vs 4ish mins before > and we still use 2.7G of RAM for this single file (Likely because of the > remaining sbitmaps) compared to a max of 1.2ish G that the rest of the > commpilation uses. > > One possibility to get the best of both worlds would be to threshold based > on num_bbs * num_exprs. Once we exceed it switch to the bitmap pass, > otherwise keep sbitmaps for performance. > > Messaging with Juzhe offline, his best guess for the LICM time is that he > enabled checking for dataflow which slows down this particular compilation > by a lot. Therefore it doesn't look like a generic problem. Thanks. I don't think replacing sbitmap is the best solution. Let's me first disable DF check and reproduce 33G memory consumption in my local machine. I think the best way to optimize the memory consumption is to optimize the VSETLV PASS algorithm and codes. I have an idea to optimize. I am gonna work on it. Thanks for reporting.
[Bug target/113420] risc-v vector: ICE when using C compiler compile C++ RVV intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113420 JuzheZhong changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #2 from JuzheZhong --- Fixed.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #25 from JuzheZhong --- RISC-V backend memory-hog issue is fixed. But compile time hog in LICM still there, so keep this PR open.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #22 from JuzheZhong --- (In reply to Richard Biener from comment #21) > I once tried to avoid df_reorganize_refs and/or optimize this with the > blocks involved but failed. I am considering whether we should disable LICM for RISC-V by default if vector is enabled ? Since the compile time explode 10 times is really horrible.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #19 from JuzheZhong --- (In reply to JuzheZhong from comment #18) > Hi, Robin. > > I have fixed patch for memory-hog: > https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643418.html > > I will commit it after the testing. > > But compile-time hog still exists which is loop invariant motion PASS. > > with -fno-move-loop-invariants, we become quite faster. > > Could you take a look at it ? Note that with default -march=rv64gcv_zvl256b -O3: real63m18.771s user60m19.036s sys 2m59.787s But with -march=rv64gcv_zvl256b -O3 -fno-move-loop-invariants: real6m52.984s user6m42.473s sys 0m10.375s 10 times faster without loop invariant motion.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #18 from JuzheZhong --- Hi, Robin. I have fixed patch for memory-hog: https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643418.html I will commit it after the testing. But compile-time hog still exists which is loop invariant motion PASS. with -fno-move-loop-invariants, we become quite faster. Could you take a look at it ?
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #17 from JuzheZhong --- Ok. Confirm the original test 33383M -> 4796k now.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #16 from JuzheZhong --- (In reply to Andrew Pinski from comment #15) > (In reply to JuzheZhong from comment #14) > > Oh. I known the reason now. > > > > The issue is not RISC-V backend VSETVL PASS. > > > > It's memory bug of rtx_equal_p I think. > > > It is not rtx_equal_p but rather RVV_VLMAX which is defined as: > riscv-protos.h:#define RVV_VLMAX gen_rtx_REG (Pmode, X0_REGNUM) > > Seems like you could cache that somewhere ... Oh. Make sense to me. Thank you so much. I think memory-hog issue will be fixed soon. But the compile-time hog issue of loop invariant motion is still not fixed.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #14 from JuzheZhong --- Oh. I known the reason now. The issue is not RISC-V backend VSETVL PASS. It's memory bug of rtx_equal_p I think. We are calling rtx_equal_p which is very costly. For example, has_nonvlmax_reg_avl is calling rtx_equal_p. So I keep all codes unchange, then replace comparison as follows: diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc index 93a1238a5ab..1c85c8ee3c6 100644 --- a/gcc/config/riscv/riscv-v.cc +++ b/gcc/config/riscv/riscv-v.cc @@ -4988,7 +4988,7 @@ nonvlmax_avl_type_p (rtx_insn *rinsn) bool vlmax_avl_p (rtx x) { - return x && rtx_equal_p (x, RVV_VLMAX); + return x && REG_P (x) && REGNO (x) == X0_REGNUM/*rtx_equal_p (x, RVV_VLMAX)*/; } Use REGNO (x) == X0_REGNUM instead of rtx_equal_p. Memory-hog issue is gone: 939M -> 725k. So I am gonna send a patch to walk around rtx_equal_p issues which cause memory-hog.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #13 from JuzheZhong --- So I think we should investigate why calling has_nonvlmax_reg_avl cost so much memory.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #12 from JuzheZhong --- Ok. Here is a simple fix which give some hints: diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc index 2067073185f..ede818140dc 100644 --- a/gcc/config/riscv/riscv-vsetvl.cc +++ b/gcc/config/riscv/riscv-vsetvl.cc @@ -2719,10 +2719,11 @@ pre_vsetvl::compute_lcm_local_properties () for (int i = 0; i < num_exprs; i += 1) { const vsetvl_info = *m_exprs[i]; - if (!info.has_nonvlmax_reg_avl () && !info.has_vl ()) + bool has_nonvlmax_reg_avl_p = info.has_nonvlmax_reg_avl (); + if (!has_nonvlmax_reg_avl_p && !info.has_vl ()) continue; - if (info.has_nonvlmax_reg_avl ()) + if (has_nonvlmax_reg_avl_p) { unsigned int regno; sbitmap_iterator sbi; @@ -3556,7 +3557,7 @@ const pass_data pass_data_vsetvl = { RTL_PASS, /* type */ "vsetvl", /* name */ OPTGROUP_NONE, /* optinfo_flags */ - TV_NONE, /* tv_id */ + TV_MACH_DEP, /* tv_id */ 0,/* properties_required */ 0,/* properties_provided */ 0,/* properties_destroyed */ Memory usage from 931M -> 781M. Memory usage reduce significantly. Note that I didn't change all has_nonvlmax_reg_avl, We have so many places calling has_nonvlmax_reg_avl...
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #11 from JuzheZhong --- It should be compute_lcm_local_properties. The memory usage reduce 50% after I remove this function. I am still investigating.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #10 from JuzheZhong --- No, it's not caused here. I removed the whole function compute_avl_def_data. The memory usage doesn't change.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #6 from JuzheZhong --- (In reply to Andrew Pinski from comment #5) > Note "loop invariant motion" is the RTL based loop invariant motion pass. So you mean it should be still RISC-V issue, right ?
[Bug tree-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #4 from JuzheZhong --- Also, the original file with -fno-move-loop-invariants reduce compile time from 60 minutes into 7 minutes: real7m12.528s user6m55.214s sys 0m17.147s machine dep reorg : 75.93 ( 18%) 14.23 ( 88%) 90.15 ( 21%) 33383M ( 95% The memory report is quite obvious (consume 95% memory). So, I believe VSETVL PASS is not the main reason of compile-time-hog, it should be loop invariant PASS. But VSETVL PASS is main reason of memory-hog. I am not familiar with loop invariant pass. Can anyone help to debug compile-time hog of loop invariant PASS. Or should we disable loop invariant pass by default for RISC-V ?
[Bug tree-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #3 from JuzheZhong --- Ok. The reduced case: # 1 "module_first_rk_step_part1.fppized.f90" # 1 "" # 1 "" # 1 "module_first_rk_step_part1.fppized.f90" !WRF:MEDIATION_LAYER:SOLVER MODULE module_first_rk_step_part1 CONTAINS SUBROUTINE first_rk_step_part1 ( grid , config_flags & , moist , moist_tend & , chem , chem_tend& , tracer, tracer_tend & , scalar , scalar_tend & , fdda3d, fdda2d & , aerod& , ru_tendf, rv_tendf & , rw_tendf, t_tendf& , ph_tendf, mu_tendf & , tke_tend & , adapt_step_flag , curr_secs & , psim , psih , wspd , gz1oz0 , br , chklowq & , cu_act_flag , hol , th_phy & , pi_phy , p_phy , t_phy & , dz8w , p8w , t8w & , ids, ide, jds, jde, kds, kde & , ims, ime, jms, jme, kms, kme & , ips, ipe, jps, jpe, kps, kpe & , imsx,imex,jmsx,jmex,kmsx,kmex& , ipsx,ipex,jpsx,jpex,kpsx,kpex& , imsy,imey,jmsy,jmey,kmsy,kmey& , ipsy,ipey,jpsy,jpey,kpsy,kpey& , k_start , k_end & , f_flux & ) USE module_state_description USE module_model_constants USE module_domain, ONLY : domain, domain_clock_get, get_ijk_from_subgrid USE module_configure, ONLY : grid_config_rec_type, model_config_rec USE module_radiation_driver, ONLY : pre_radiation_driver, radiation_driver USE module_surface_driver, ONLY : surface_driver USE module_cumulus_driver, ONLY : cumulus_driver USE module_shallowcu_driver, ONLY : shallowcu_driver USE module_pbl_driver, ONLY : pbl_driver USE module_fr_fire_driver_wrf, ONLY : fire_driver_em_step USE module_fddagd_driver, ONLY : fddagd_driver USE module_em, ONLY : init_zero_tendency USE module_force_scm USE module_convtrans_prep USE module_big_step_utilities_em, ONLY : phy_prep use module_scalar_tables USE module_dm, ONLY : local_communicator, mytask, ntasks, ntasks_x, ntasks_y, local_communicator_periodic, wrf_dm_maxval USE module_comm_dm, ONLY : halo_em_phys_a_sub,halo_em_fdda_sfc_sub,halo_pwp_sub,halo_em_chem_e_3_sub, & halo_em_chem_e_5_sub, halo_em_hydro_noahmp_sub USE module_utility IMPLICIT NONE TYPE ( domain ), INTENT(INOUT) :: grid TYPE ( grid_config_rec_type ), INTENT(IN) :: config_flags TYPE(WRFU_Time):: currentTime INTEGER, INTENT(IN) :: ids, ide, jds, jde, kds, kde, & ims, ime, jms, jme, kms, kme, & ips, ipe, jps, jpe, kps, kpe, & imsx,imex,jmsx,jmex,kmsx,kmex,& ipsx,ipex,jpsx,jpex,kpsx,kpex,& imsy,imey,jmsy,jmey,kmsy,kmey,& ipsy,ipey,jpsy,jpey,kpsy,kpey LOGICAL ,INTENT(IN):: adapt_step_flag REAL, INTENT(IN) :: curr_secs REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_moist),INTENT(INOUT) :: moist REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_moist),INTENT(INOUT) :: moist_tend REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_chem),INTENT(INOUT) :: chem REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_chem),INTENT(INOUT) :: chem_tend REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_tracer),INTENT(INOUT) :: tracer REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_tracer),INTENT(INOUT) :: tracer_tend REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_scalar),INTENT(INOUT) :: scalar REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_scalar),INTENT(INOUT) :: scalar_tend REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_fdda3d),INTENT(INOUT) :: fdda3d REAL,DIMENSION(ims:ime,1:1,jms:jme,num_fdda2d),INTENT(INOUT) :: fdda2d REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_aerod),INTENT(INOUT) :: aerod REAL,DIMENSION(ims:ime,jms:jme), INTENT(INOUT) :: psim REAL,DIMENSION(ims:ime,jms:jme), INTENT(INOUT) :: psih REAL,DIMENSION(ims:ime,jms:jme), INTENT(INOUT) :: wspd REAL
[Bug tree-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #2 from JuzheZhong --- To build the attachment file, we need these following file from SPEC2017: module_big_step_utilities_em.mod module_cumulus_driver.mod module_fddagd_driver.modmodule_model_constants.mod module_shallowcu_driver.mod module_comm_dm.modmodule_dm.mod module_first_rk_step_part1.mod module_pbl_driver.mod module_state_description.mod module_configure.mod module_domain.mod module_force_scm.modmodule_radiation_driver.mod module_surface_driver.mod module_convtrans_prep.mod module_em.mod module_fr_fire_driver_wrf.mod module_scalar_tables.mod module_utility.mod But I failed to create attachment for them since they are too big.
[Bug tree-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #1 from JuzheZhong --- Created attachment 57149 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57149=edit spec2017 wrf spec2017 wrf
[Bug c/113495] New: RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 Bug ID: 113495 Summary: RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: juzhe.zhong at rivai dot ai Target Milestone: --- riscv64-unknown-linux-gnu-gfortran -march=rv64gcv_zvl256b -O3 -S -ftime-report real63m18.771s user60m19.036s sys 2m59.787s 60+ minutes. After investigation, the time report show 2 PASS are critical: loop invariant motion :2600.28 ( 72%) 1.68 ( 1%)2602.12 ( 69%) 2617k ( 0%) loop invariant consume most of the time 72% time. The other is the VSETVL PASS: vsetvl: earliest_fuse_vsetvl_info : 438.26 ( 12%) 79.82 ( 47%) 518.08 ( 14%)221807M ( 75%) vsetvl: pre_global_vsetvl_info : 135.98 ( 4%) 31.71 ( 19%) 167.69 ( 4%) 71950M ( 24%) The phase 2 and phase 3 of VSETVL PASS consume 16% times and 99% memory. I will look into VSETVL PASS issue but I am not able to take care of loop invariant issue.
[Bug middle-end/113166] RISC-V: Redundant move instructions in RVV intrinsic codes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113166 --- Comment #2 from JuzheZhong --- #include #if TO_16 # define uintOut_t uint16_t # define utf8_to_utf32_scalar utf8_to_utf16_scalar # define utf8_to_utf32_rvv utf8_to_utf16_rvv #else # define uintOut_t uint32_t #endif size_t utf8_to_utf32_scalar(char const *src, size_t count, uintOut_t *dest); size_t utf8_to_utf32_rvv(char const *src, size_t count, uintOut_t *dest) { size_t tail = 3; if (count < tail) return utf8_to_utf32_scalar(src, count, dest); /* validate first three bytes */ { size_t idx = tail; while (idx < count && (src[idx] >> 6) == 0b10) ++idx; uintOut_t buf[10]; if (idx > tail + 3 || !utf8_to_utf32_scalar(src, idx, buf)) return 0; } size_t n = count - tail; uintOut_t *destBeg = dest; static const uint64_t err1m[] = { 0x0202020202020202, 0x4915012180808080 }; static const uint64_t err2m[] = { 0xcbcbcb8b8383a3e7, 0xcbcbdbcbcbcbcbcb }; static const uint64_t err3m[] = { 0x0101010101010101, 0x01010101babaaee6 }; const vuint8m1_t err1tbl = __riscv_vreinterpret_v_u64m1_u8m1(__riscv_vle64_v_u64m1(err1m, 2)); const vuint8m1_t err2tbl = __riscv_vreinterpret_v_u64m1_u8m1(__riscv_vle64_v_u64m1(err2m, 2)); const vuint8m1_t err3tbl = __riscv_vreinterpret_v_u64m1_u8m1(__riscv_vle64_v_u64m1(err3m, 2)); const vuint8m2_t v64u8m2 = __riscv_vmv_v_x_u8m2(1<<6, __riscv_vsetvlmax_e8m2()); const size_t vl8m1 = __riscv_vsetvlmax_e8m1(); const size_t vl16m2 = __riscv_vsetvlmax_e16m2(); #if TO_16 size_t vl8m2 = __riscv_vsetvlmax_e8m2(); const vbool4_t m4odd = __riscv_vmsne_vx_u8m2_b4(__riscv_vand_vx_u8m2(__riscv_vid_v_u8m2(vl8m2), 1, vl8m2), 0, vl8m2); #endif for (size_t vl, vlOut; n > 0; n -= vl, src += vl, dest += vlOut) { vl = __riscv_vsetvl_e8m2(n); vuint8m2_t v0 = __riscv_vle8_v_u8m2((uint8_t const*)src, vl); uint64_t max = __riscv_vmv_x_s_u8m1_u8(__riscv_vredmaxu_vs_u8m2_u8m1(v0, __riscv_vmv_s_x_u8m1(0, vl), vl)); /* fast path: ASCII */ if (max < 0b1000) { vlOut = vl; #if TO_16 __riscv_vse16_v_u16m4(dest, __riscv_vzext_vf2_u16m4(v0, vlOut), vlOut); #else __riscv_vse32_v_u32m8(dest, __riscv_vzext_vf4_u32m8(v0, vlOut), vlOut); #endif continue; } /* see "Validating UTF-8 In Less Than One Instruction Per Byte" * https://arxiv.org/abs/2010.03090 */ vuint8m2_t v1 = __riscv_vslide1down_vx_u8m2(v0, src[vl+0], vl); vuint8m2_t v2 = __riscv_vslide1down_vx_u8m2(v1, src[vl+1], vl); vuint8m2_t v3 = __riscv_vslide1down_vx_u8m2(v2, src[vl+2], vl); vuint8m2_t s1 = __riscv_vreinterpret_v_u16m2_u8m2(__riscv_vsrl_vx_u16m2(__riscv_vreinterpret_v_u8m2_u16m2(v2), 4, vl16m2)); vuint8m2_t s3 = __riscv_vreinterpret_v_u16m2_u8m2(__riscv_vsrl_vx_u16m2(__riscv_vreinterpret_v_u8m2_u16m2(v3), 4, vl16m2)); vuint8m2_t idx2 = __riscv_vand_vx_u8m2(v2, 0xf, vl); vuint8m2_t idx1 = __riscv_vand_vx_u8m2(s1, 0xf, vl); vuint8m2_t idx3 = __riscv_vand_vx_u8m2(s3, 0xf, vl); #define VRGATHER_u8m1x2(tbl, idx) \ __riscv_vset_v_u8m1_u8m2(__riscv_vlmul_ext_v_u8m1_u8m2( \ __riscv_vrgather_vv_u8m1(tbl, __riscv_vget_v_u8m2_u8m1(idx, 0), vl8m1)), 1, \ __riscv_vrgather_vv_u8m1(tbl, __riscv_vget_v_u8m2_u8m1(idx, 1), vl8m1)); vuint8m2_t err1 = VRGATHER_u8m1x2(err1tbl, idx1); vuint8m2_t err2 = VRGATHER_u8m1x2(err2tbl, idx2); vuint8m2_t err3 = VRGATHER_u8m1x2(err3tbl, idx3); vuint8m2_t errs = __riscv_vand_vv_u8m2(__riscv_vand_vv_u8m2(err1, err2, vl), err3, vl); vbool4_t is_3 = __riscv_vmsgtu_vx_u8m2_b4(v1, 0b1110-1, vl); vbool4_t is_4 = __riscv_vmsgtu_vx_u8m2_b4(v0, 0b-1, vl); vbool4_t is_34 = __riscv_vmor_mm_b4(is_3, is_4, vl); vbool4_t err34 = __riscv_vmxor_mm_b4(is_34, __riscv_vmsgtu_vx_u8m2_b4(errs, 0b0111, vl), vl); vbool4_t errm = __riscv_vmor_mm_b4(__riscv_vmsgt_vx_i8m2_b4(__riscv_vreinterpret_v_u8m2_i8m2(errs), 0, vl), err34, vl); if (__riscv_vfirst_m_b4(errm , vl) >= 0) return 0; /* decoding */ /* mask of non continuation bytes */ vbool4_t m = __riscv_vmsne_vx_u8m2_b4(__riscv_vsrl_vx_u8m2(v0, 6, vl), 0b10, vl); vlOut = __riscv_vcpop_m_b4(m, vl); /* extract first and second bytes */
[Bug c/113474] RISC-V: Fail to use vmerge.vim for constant vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113474 --- Comment #2 from JuzheZhong --- Oh. It's pretty simple fix. I am not sure whether Richards allow it since it's stage4 but worth to have a try. Could you send a patch ?
[Bug c/113474] New: RISC-V: Fail to use vmerge.vim for constant vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113474 Bug ID: 113474 Summary: RISC-V: Fail to use vmerge.vim for constant vector Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: juzhe.zhong at rivai dot ai Target Milestone: --- void foo (int n, int **__restrict a) { int b; int c; int d; for (b = 0; b < n; b++) for (long e = 8; e > 0; e--) a[b][e] = a[b][e] == 15; } ASM: foo: ble a0,zero,.L5 sllia3,a0,3 add a3,a1,a3 vsetivlizero,4,e32,m1,ta,ma vmv.v.i v3,1-> redundant vmv.v.i v2,0 .L3: ld a5,0(a1) addia4,a5,4 addia5,a5,20 vle32.v v1,0(a5) vle32.v v0,0(a4) vmseq.viv0,v0,15 vmerge.vvm v4,v2,v3,v0 > It should be vmerge.vim vse32.v v4,0(a4) vmseq.viv0,v1,15 addia1,a1,8 vmerge.vvm v1,v2,v3,v0 > It should be vmerge.vim vse32.v v1,0(a5) bne a1,a3,.L3 .L5: ret It's odd we can generate vmseq.vi but fail to generate vmerge.vim. Look into patterns of vcond_mask: (define_insn_and_split "vcond_mask_" [(set (match_operand:V_VLS 0 "register_operand") (if_then_else:V_VLS (match_operand: 3 "register_operand") (match_operand:V_VLS 1 "nonmemory_operand") -> relax the predicate (match_operand:V_VLS 2 "register_operand")))] Why GCC doesn't fold const_vector into operand 1 ?
[Bug target/113429] RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429 --- Comment #10 from JuzheZhong --- I have commit V3 patch with rebasing since V2 patch conflicts with the trunk. I think you can use trunk GCC validate CAM4 directly now.