[Bug middle-end/113474] RISC-V: Fail to use vmerge.vim for constant vector

2024-05-17 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113474

JuzheZhong  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from JuzheZhong  ---
Fixed

[Bug target/115093] RISC-V Vector ICE in extract_insn: unrecognizable insn

2024-05-15 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115093

JuzheZhong  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #1 from JuzheZhong  ---
I think it's fixed by:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=44e7855e4e817a7f5a1e332cd95e780e57052dba

Confirmed on compiler explorer:

https://godbolt.org/z/qf5GzoKre

[Bug c/115104] RISC-V: GCC-14 can combine vsext+vadd -> vwadd but Trunk GCC (GCC 15) Failed

2024-05-15 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115104

--- Comment #1 from JuzheZhong  ---
I wonder whether RIVOS CI already found which commit cause this regression ?

[Bug c/115104] New: RISC-V: GCC-14 can combine vsext+vadd -> vwadd but Trunk GCC (GCC 15) Failed

2024-05-15 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115104

Bug ID: 115104
   Summary: RISC-V: GCC-14 can combine vsext+vadd -> vwadd but
Trunk GCC (GCC 15) Failed
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: juzhe.zhong at rivai dot ai
  Target Milestone: ---

I notice there are these following regression in testing:

FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvfwadd\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvfwadd\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvfwadd\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvfwadd\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvfwadd\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvfwadd\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwadd\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwadd\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwadd\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwadd\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwadd\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwadd\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwaddu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwaddu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwaddu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwaddu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwaddu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c
scan-assembler-times \\tvwaddu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvfwsub\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvfwsub\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvfwsub\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvfwsub\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvfwsub\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvfwsub\\.vv 6
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsub\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsub\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsub\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsub\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsub\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsub\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsubu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsubu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsubu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsubu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsubu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c
scan-assembler-times \\tvwsubu\\.vv 9
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c
scan-assembler-times \\tvfwmul\\.vv 8
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c
scan-assembler-times \\tvfwmul\\.vv 8
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c
scan-assembler-times \\tvfwmul\\.vv 8
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c
scan-assembler-times \\tvfwmul\\.vv 8
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c
scan-assembler-times \\tvfwmul\\.vv 8
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c
scan-assembler-times \\tvfwmul\\.vv 8
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c
scan-assembler-times \\tvwmul\\.vv 12
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c
scan-assembler-times \\tvwmul\\.vv 12
FAIL: gcc.target/riscv/rvv/autovec/widen/widen-complicate-3.c

[Bug c/115068] New: RISC-V: Illegal instruction of vfwadd

2024-05-13 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115068

Bug ID: 115068
   Summary: RISC-V: Illegal instruction of vfwadd
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: juzhe.zhong at rivai dot ai
  Target Milestone: ---

#include 
#include 

vfloat64m8_t test_vfwadd_wf_f64m8_m(vbool8_t vm, vfloat64m8_t vs2, float rs1,
size_t vl) {
  return __riscv_vfwadd_wf_f64m8_m(vm, vs2, rs1, vl);
}

char global_memory[1024];
void *fake_memory = (void *)global_memory;

int main ()
{
  asm volatile("fence":::"memory");
  vfloat64m8_t vfwadd_wf_f64m8_m_vd =
test_vfwadd_wf_f64m8_m(__riscv_vreinterpret_v_i8m1_b8(__riscv_vundefined_i8m1()),
__riscv_vundefined_f64m8(), 1.0, __riscv_vsetvlmax_e64m8());
  asm volatile(""::"vr"(vfwadd_wf_f64m8_m_vd):"memory");

  return 0;
}

https://compiler-explorer.com/z/rq7K33zE5

main:
fence
lui a5,%hi(.LC0)
flw fa5,%lo(.LC0)(a5)
vsetvli a5,zero,e32,m4,ta,ma
vfwadd.wf   v0,v8,fa5,v0.t ---> vd should not be v0.
li  a0,0
ret

[Bug target/114988] RISC-V: ICE in intrinsic __riscv_vfwsub_wf_f32mf2

2024-05-08 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114988

--- Comment #2 from JuzheZhong  ---
Li Pan is going to work on it.

Hi, kito and Jeff.

Can this fix backport to GCC-14 ?

[Bug c/114988] RISC-V: ICE in intrinsic __riscv_vfwsub_wf_f32mf2

2024-05-08 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114988

--- Comment #1 from JuzheZhong  ---
Ideally, it should be reported as (-march=rv64gc):

https://godbolt.org/z/3P76YEb9s


: In function 'test_vfwsub_wf_f32mf2':
:4:15: error: return type 'vfloat32mf2_t' requires the V ISA extension
4 | vfloat32mf2_t test_vfwsub_wf_f32mf2(vfloat32mf2_t vs2, _Float16 rs1,
size_t vl) {
  |   ^
Compiler returned: 1

[Bug c/114988] New: RISC-V: ICE in intrinsic __riscv_vfwsub_wf_f32mf2

2024-05-08 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114988

Bug ID: 114988
   Summary: RISC-V: ICE in intrinsic __riscv_vfwsub_wf_f32mf2
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: juzhe.zhong at rivai dot ai
  Target Milestone: ---

https://godbolt.org/z/ncxrx3fK9

#include 
#include 

vfloat32mf2_t test_vfwsub_wf_f32mf2(vfloat32mf2_t vs2, _Float16 rs1, size_t vl)
{
  return __riscv_vfwsub_wf_f32mf2(vs2, rs1, vl);
}

with -march=rv64gcv -O3:

:6:1: error: unrecognizable insn:
6 | }
  | ^
(insn 8 5 12 2 (set (reg:RVVMF2SF 134 [  ])
(if_then_else:RVVMF2SF (unspec:RVVMF64BI [
(const_vector:RVVMF64BI repeat [
(const_int 1 [0x1])
])
(reg/v:DI 137 [ vl ])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(const_int 7 [0x7])
(reg:SI 66 vl)
(reg:SI 67 vtype)
(reg:SI 69 frm)
] UNSPEC_VPREDICATE)
(minus:RVVMF2SF (reg/v:RVVMF2SF 135 [ vs2 ])
(float_extend:RVVMF2SF (vec_duplicate:RVVMF4HF (reg/v:HF 136 [
rs1 ]
(unspec:RVVMF2SF [
(reg:DI 0 zero)
] UNSPEC_VUNDEF))) "":5:10 -1
 (nil))

FP16 vector need zvfh, so such intrinsic should be reported as illegal
intrinsic in frontend instead of an ICE.

with -rv64gcv_zvfh:
It can be compiled:

vsetvli zero,a0,e16,mf4,ta,ma
vfwsub.wf   v8,v8,fa0
ret

[Bug target/114887] RISC-V: expect M8 but M4 generated with dynamic LMUL for TSVC s319

2024-04-29 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114887

--- Comment #2 from JuzheZhong  ---
I think there is a too conservative analysis here:

note:   _1: type = float, start = 1, end = 6
note:   _5: type = float, start = 6, end = 8
note:   _3: type = float, start = 3, end = 7
note:   _4: type = float, start = 5, end = 6
note:   _2: type = float, start = 2, end = 3
note:   _28: type = float, start = 7, end = 9
note:   sum_18: type = real_t, start = 9, end = 9
note:   sum_26: type = real_t, start = 0, end = 9

The variables live at point 6 should be:
1. _1
2. _3
3. _4
4. sum_26

So there are total 4 variables each variable occupies 8 register at LMUL = 8.

Then the total live register should 4 * 8 = 32 which is ok to pick LMUL = 8.

[Bug target/114887] RISC-V: expect M8 but M4 generated with dynamic LMUL for TSVC s319

2024-04-29 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114887

--- Comment #1 from JuzheZhong  ---
The "vect" cost model analysis:

https://godbolt.org/z/qbqzon8x1

note:   Maximum lmul = 8, At most 40 number of live V_REG at program point 6
for bb 3

It seems that we count one more variable in program point 6 ?

[Bug target/114639] [riscv] ICE in create_pre_exit, at mode-switching.cc:451

2024-04-28 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114639

--- Comment #18 from JuzheZhong  ---
(In reply to Li Pan from comment #17)
> According to the V abi, looks like the asm code tries to save/restore the
> callee-saved registers when there is a call in function body.
> 
> | Name| ABI Mnemonic | Meaning  | Preserved across
> calls?
> =
> 
> | v0  |  | Argument register| No
> | v1-v7   |  | Callee-saved registers   | Yes
> | v8-v23  |  | Argument registers   | No
> | v24-v31 |  | Callee-saved registers   | Yes

I see, https://godbolt.org/z/7bx1EEdGn
When we use 44 instead of get_vl (), the load/store instructions are gone.

[Bug target/114639] [riscv] ICE in create_pre_exit, at mode-switching.cc:451

2024-04-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114639

--- Comment #16 from JuzheZhong  ---
This issue is not fully fixed since the fixed patch only fixes ICE but there is
a regression in codegen:

https://godbolt.org/z/4nvxeqb6K

Terrible codege:

test(__rvv_uint64m4_t):
addisp,sp,-16
csrrt0,vlenb
sd  ra,8(sp)
sub sp,sp,t0
vs1r.v  v1,0(sp)
sub sp,sp,t0
vs1r.v  v2,0(sp)
sub sp,sp,t0
vs1r.v  v3,0(sp)
sub sp,sp,t0
vs1r.v  v4,0(sp)
sub sp,sp,t0
vs1r.v  v5,0(sp)
sub sp,sp,t0
vs1r.v  v6,0(sp)
sub sp,sp,t0
vs1r.v  v7,0(sp)
sub sp,sp,t0
vs1r.v  v24,0(sp)
sub sp,sp,t0
vs1r.v  v25,0(sp)
sub sp,sp,t0
vs1r.v  v26,0(sp)
sub sp,sp,t0
vs1r.v  v27,0(sp)
sub sp,sp,t0
vs1r.v  v28,0(sp)
sub sp,sp,t0
vs1r.v  v29,0(sp)
sub sp,sp,t0
vs1r.v  v30,0(sp)
sub sp,sp,t0
csrrt0,vlenb
sllit1,t0,2
vs1r.v  v31,0(sp)
sub sp,sp,t1
vs4r.v  v8,0(sp)
callget_vl()
csrrt0,vlenb
sllit1,t0,2
vl4re64.v   v8,0(sp)
csrrt0,vlenb
add sp,sp,t1
vl1re64.v   v31,0(sp)
add sp,sp,t0
vl1re64.v   v30,0(sp)
add sp,sp,t0
vl1re64.v   v29,0(sp)
add sp,sp,t0
vl1re64.v   v28,0(sp)
add sp,sp,t0
vl1re64.v   v27,0(sp)
add sp,sp,t0
vl1re64.v   v26,0(sp)
add sp,sp,t0
vl1re64.v   v25,0(sp)
add sp,sp,t0
vl1re64.v   v24,0(sp)
add sp,sp,t0
vl1re64.v   v7,0(sp)
add sp,sp,t0
vl1re64.v   v6,0(sp)
add sp,sp,t0
vl1re64.v   v5,0(sp)
add sp,sp,t0
vl1re64.v   v4,0(sp)
add sp,sp,t0
vl1re64.v   v3,0(sp)
add sp,sp,t0
vl1re64.v   v2,0(sp)
add sp,sp,t0
vl1re64.v   v1,0(sp)
add sp,sp,t0
ld  ra,8(sp)
vsetvli zero,a0,e64,m4,ta,ma
vmsne.viv0,v8,0
addisp,sp,16
jr  ra

[Bug target/114809] [RISC-V RVV] Counting elements might be simpler

2024-04-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809

JuzheZhong  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #3 from JuzheZhong  ---
For missed peephole optimization, I already noticed it long time ago,
and I have filed PR:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014

Such issue will gone after Richard Standiford @arm merged late-combine PASS in
GCC 15.

Also, GCC support dynamic LMUL optimization with -mrvv-max-lmul=dynamic:

https://godbolt.org/z/646nYoKbv

ASM:

count_chars(char const*, unsigned long, char):
beq a1,zero,.L4
vsetvli a4,zero,e8,m1,ta,ma
vmv.v.x v1,a2
vsetvli zero,zero,e64,m8,ta,ma
vmv.v.i v8,0
.L3:
vsetvli a5,a1,e8,m1,ta,ma
vle8.v  v0,0(a0)
sub a1,a1,a5
add a0,a0,a5
vmseq.vvv0,v0,v1
vsetvli zero,zero,e64,m8,tu,mu
vadd.vi v8,v8,1,v0.t
bne a1,zero,.L3
vsetvli a5,zero,e64,m8,ta,ma
li  a4,0
vmv.s.x v1,a4
vredsum.vs  v8,v8,v1
vmv.x.s a0,v8
ret
.L4:
li  a0,0
ret

GCC picks LMUL = 8, since it doesn't cause additional register spillings
according to the program register pressure.

[Bug target/114714] [RISC-V][RVV] ICE: insn does not satisfy its constraints (postreload)

2024-04-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114714

JuzheZhong  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #6 from JuzheZhong  ---
(In reply to Robin Dapp from comment #5)
> Did anybody do some further investigation here?  Juzhe messaged me that this
> PR is the original reason for the reversal but I don't yet understand why
> the register filters don't encompass the full semantics of RVV overlap.
> 
> I looked into the test case and what happens is that, in order to determine
> the validity of the alternatives, riscv_get_v_regno_alignment is first being
> called with an M2 mode.  Our destination is actually a (subreg:RVVM2SI
> (reg:RVVM4SI ...) 0), though.  I suppose lra/reload check whether a
> non-subreg destination also works and hands us a (reg:RVVM4SI ...) as
> operand[0].  We pass this to riscv_get_v_regno_alignment which, for an LMUL4
> mode, returns 4, thus wrongly enabling the W42 alternatives.
> A W42 alternative permits hard regs % 4 == 2, which causes us to eventually
> choose vr2 as destination and source.  Once the constraints are actually
> checked we have a mismatch as none of the alternatives work.
> 
> Now I'm not at all sure how lra/reload use operand[0] here but this can
> surely be found out.  A quick and dirty hack (attached) that checks the
> insn's destination mode instead of operand[0]'s mode gets rid of the ICE and
> doesn't cause regressions.
> 
> I suppose we're too far ahead with the reversal already but I'd really have
> preferred more details.  Maybe somebody has had in-depth look but it just
> wasn't posted yet?
> 
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -6034,6 +6034,22 @@ riscv_get_v_regno_alignment (machine_mode mode)
>return lmul;
>  }
>  
> +int
> +riscv_get_dest_alignment (rtx_insn *insn, rtx operand)
> +{
> +  const_rtx set = 0;
> +  if (GET_CODE (PATTERN (insn)) == SET)
> +{
> +  set = PATTERN (insn);
> +  rtx op = SET_DEST (set);
> +  return riscv_get_v_regno_alignment (GET_MODE (op));
> +}
> +  else
> +{
> +  return riscv_get_v_regno_alignment (GET_MODE (operand));
> +}
> +}
> +
>  /* Define ASM_OUTPUT_OPCODE to do anything special before
> emitting an opcode.  */
>  const char *
> diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
> index ce1ee6b9c5e..5113daf2ac7 100644
> --- a/gcc/config/riscv/riscv.md
> +++ b/gcc/config/riscv/riscv.md
> @@ -550,15 +550,15 @@ (define_attr "group_overlap_valid" "no,yes"
>   (const_string "yes")
>  
>   (and (eq_attr "group_overlap" "W21")
> - (match_test "riscv_get_v_regno_alignment (GET_MODE
> (operands[0])) != 2"))
> + (match_test "riscv_get_dest_alignment (insn, operands[0]) !=
> 2"))
>  (const_string "no")
>  
>   (and (eq_attr "group_overlap" "W42")
> - (match_test "riscv_get_v_regno_alignment (GET_MODE
> (operands[0])) != 4"))
> + (match_test "riscv_get_dest_alignment (insn, operands[0]) !=
> 4"))
>  (const_string "no")
>  
>   (and (eq_attr "group_overlap" "W84")
> - (match_test "riscv_get_v_regno_alignment (GET_MODE
> (operands[0])) != 8"))
> + (match_test "riscv_get_dest_alignment (insn, operands[0]) !=
> 8"))
>  (const_string "no")

This hack looks good to me. But we already reverted multiple patches (Sorry for
that).

And I think we eventually need to revert them and support register group
overlap 
in another optimal way (Extend constraint for RVV in IRA/LRA).

[Bug tree-optimization/114749] [13 Regression] RISC-V rv64gcv ICE: in vectorizable_load, at tree-vect-stmts.cc

2024-04-17 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114749

--- Comment #4 from JuzheZhong  ---
Hi, Patrick.

It seems that Richard didn't append the testcase in the patch.
Could you send a patch to add the testcase for RISC-V port ?

Thangks.

[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns

2024-04-15 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

JuzheZhong  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #5 from JuzheZhong  ---
Did you try another scheduler ?

-fselective-scheduling to see whether the spill issues still exist ?

[Bug target/114686] Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension

2024-04-13 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686

JuzheZhong  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #2 from JuzheZhong  ---
CCing RISC-V folks who may be interested at it.

Yeah, I agree to set dynamic lmul as default. I have mentioned it long time
ago.
However, almost all other RISC-V folks disagree with that.

Here is data from Li Pan@intel:
https://github.com/Incarnation-p-lee/Incarnation-p-lee/blob/master/performance/coremark-pro/coremark-pro_in_k230_evb.png

Doing auto-vectorization on both LLVM and GCC (all LMUL) of coremark-pro.
Turns out dynamic LMUL is beneficial.

>> The vrgather.vv instruction should be except from that, because an LMUL=8
>> vrgather.vv is way more powerful than eight LMUL=1 vrgather.vv instructions,
>> and thus disproportionately complex to implement. When you don't need to 
>> cross
>> lanes, it's possible to unrolling LMUL=1 vrgathers manually, instead of
>> choosing a higher LMUL.

Agree, I think for some instructions like vrgather, we shouldn't pick the large
LMUL even though the register pressure of the program is ok.
We can consider large LMUL of vrgather as expensive in dynamic LMUL cost model
and optimize it in GCC-15.

>> vcompress.vm doesn't scale linearly with LMUL on the XuanTie chips either, 
>> but
>> a better implementation is conceivable, because the work can be better
>> distributed/subdivided. GCC currently doesn't seem to generate vcompress.vm 
>> via
>> auto-vectorization anyway: https://godbolt.org/z/Mb5Kba865

GCC may generate compress in auto-vectorization, your case is because GCC
failed to vectorize it, we may will optimize it in GCC-15.
Here is some cases that GCC may generate vcompress:

https://godbolt.org/z/5GKh4eM7z

[Bug target/114639] [riscv] ICE in create_pre_exit, at mode-switching.cc:451

2024-04-08 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114639

--- Comment #6 from JuzheZhong  ---
Definitely it is a regression:

https://compiler-explorer.com/z/e68x5sT9h

GCC 13.2 is ok, but GCC 14 ICE.

I think you should bisect first.

[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vect-cost-model (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)

2024-04-02 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476

--- Comment #7 from JuzheZhong  ---
Hi, Robin.

Will you fix this bug ?

[Bug target/114506] RISC-V: expect M8 but M4 generated with dynamic LMUL

2024-03-28 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114506

JuzheZhong  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #4 from JuzheZhong  ---
(In reply to Andrew Pinski from comment #2)
> Using -fno-vect-cost-model forces the use of M8 though.
> 
> I have no idea how this cost model is trying to prove here.

We shouldn't force M8.

We have support dynamic LMUL cost model heuristically analyze the vector
register
pressure in SSA level. So that we could pick the optimal LMUL.

This PR presents shows that RVV dynamic LMUL cost model pick LMUL 4 instead of
LMUL 8 unexpectedly.

So we should adjust the dynamic LMUL cost model to fix this issue.

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-21 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #19 from JuzheZhong  ---
I think it's better to add pr114396.c into vect testsuite instead of x86 target
test since it's the bug not only happens on x86.

[Bug tree-optimization/113281] [11/12/13 Regression] Latent wrong code due to vectorization of shift reduction and missing promotions since r9-1590

2024-03-13 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

--- Comment #28 from JuzheZhong  ---
The original cost model I did work for all cases but with some middle-end
changes
the cost model failed.

I don't have time to figure out what's going on here.

Robin may be interested at it.

[Bug middle-end/114109] x264 satd vectorization vs LLVM

2024-02-26 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #3 from JuzheZhong  ---
(In reply to Robin Dapp from comment #2)
> It is vectorized with a higher zvl, e.g. zvl512b, refer
> https://godbolt.org/z/vbfjYn5Kd.

OK. I see. But Clang generates many slide instruction which are expensive in
real hardware.

And also vluxei64 is also expensive.

I am not sure which is better. It should be tested on real RISC-V hardware to
evaluate their performance rather than simply tested on SPIKE/QEMU dynamic
instructions count.

[Bug middle-end/114109] x264 satd vectorization vs LLVM

2024-02-26 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #1 from JuzheZhong  ---
It seems RISC-V Clang didn't vectorize it ?

https://godbolt.org/z/G4han6vM3

[Bug target/113913] [14] RISC-V: suboptimal code gen for intrinsic vcreate

2024-02-16 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113913

--- Comment #2 from JuzheZhong  ---
It's the known issue we are trying to fix it in GCC-15.

My colleague Lehua is taking care of it.

CCing Lehua.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-02-07 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #16 from JuzheZhong  ---
The FMA is generated in widening_mul PASS:

Before widening_mul (fab1):

  _5 = 3.33314829616256247390992939472198486328125e-1 - _4;
  _6 = _5 * 1.229982236431605997495353221893310546875e-1;
  _8 = _4 + _6;

After widening_mul:

  _5 = 3.33314829616256247390992939472198486328125e-1 - _4;
  _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, _4);

I think it's obvious, widening_mul choose to transform later 2 STMTs:

  _6 = _5 * 1.229982236431605997495353221893310546875e-1;
  _8 = _4 + _6;

into:

 _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, _4);

without any re-association.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-02-07 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #15 from JuzheZhong  ---
(In reply to rguent...@suse.de from comment #14)
> On Wed, 7 Feb 2024, juzhe.zhong at rivai dot ai wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> > 
> > --- Comment #13 from JuzheZhong  ---
> > Ok. I found the optimized tree:
> > 
> > 
> >   _5 = 3.33314829616256247390992939472198486328125e-1 - _4;
> >   _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, 
> > _4);
> > 
> > Let CST0 = 3.33314829616256247390992939472198486328125e-1,
> > CST1 = 1.229982236431605997495353221893310546875e-1
> > 
> > The expression is equivalent to the following:
> > 
> > _5 = CST0 - _4;
> > _8 = _5 * CST1 + 4;
> > 
> > That is:
> > 
> > _8 = (CST0 - _4) * CST1 + 4;
> > 
> > So, We should be able to re-associate it like Clang:
> > 
> > _8 = CST0 * CST1 - _4 * CST1 + 4; ---> _8 = CST0 * CST1 + _4 * (1 - CST1);
> > 
> > Since both CST0 * CST1 and 1 - CST1 can be pre-computed during compilation
> > time.
> > 
> > Let say CST2 = CST0 * CST1, CST3 = 1 - CST1, then we can re-associate as 
> > Clang:
> > 
> > _8 = FMA (_4, CST3, CST2).
> > 
> > Any suggestions for this re-association ?  Is match.pd the right place to 
> > do it
> > ?
> 
> You need to look at the IL before we do .FMA forming, specifically 
> before/after the late reassoc pass.  There pass applying match.pd
> patterns everywhere is forwprop.
> 
> I also wonder which compilation flags you are using (note clang
> has different defaults for example for -ftrapping-math)

Both GCC and Clang are using   -Ofast -ffast-math.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-02-06 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #13 from JuzheZhong  ---
Ok. I found the optimized tree:


  _5 = 3.33314829616256247390992939472198486328125e-1 - _4;
  _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, _4);

Let CST0 = 3.33314829616256247390992939472198486328125e-1,
CST1 = 1.229982236431605997495353221893310546875e-1

The expression is equivalent to the following:

_5 = CST0 - _4;
_8 = _5 * CST1 + 4;

That is:

_8 = (CST0 - _4) * CST1 + 4;

So, We should be able to re-associate it like Clang:

_8 = CST0 * CST1 - _4 * CST1 + 4; ---> _8 = CST0 * CST1 + _4 * (1 - CST1);

Since both CST0 * CST1 and 1 - CST1 can be pre-computed during compilation
time.

Let say CST2 = CST0 * CST1, CST3 = 1 - CST1, then we can re-associate as Clang:

_8 = FMA (_4, CST3, CST2).

Any suggestions for this re-association ?  Is match.pd the right place to do it
?

Thanks.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-02-06 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #12 from JuzheZhong  ---
Ok. I found it even without vectorization:

GCC is worse than Clang:

https://godbolt.org/z/addr54Gc6

GCC (14 instructions inside the loop):

fld fa3,0(a0)
fld fa5,8(a0)
fld fa1,16(a0)
fsub.d  fa4,ft2,fa3
addia0,a0,160
fadd.d  fa5,fa5,fa1
addia1,a1,160
addia5,a5,160
fmadd.d fa4,fa4,fa2,fa3
fnmsub.dfa5,fa5,ft1,ft0
fsd fa4,-160(a1)
fld fa4,-152(a0)
fadd.d  fa4,fa4,fa0
fmadd.d fa5,fa5,fa2,fa4
fsd fa5,-160(a5)

Clang (12 instructions inside the loop):

fld fa1, -8(a0)
fld fa0, 0(a0)
fld ft0, 8(a0)
fmadd.d fa1, fa1, fa4, fa5
fsd fa1, 0(a1)
fld fa1, 0(a0)
fadd.d  fa0, ft0, fa0
fmadd.d fa0, fa0, fa2, fa3
fadd.d  fa1, fa0, fa1
add a4, a1, a3
fsd fa1, -376(a4)
addia1, a1, 160
addia0, a0, 160

The critical things is that:

GCC has 

fsub.d  fa4,ft2,fa3
fadd.d  fa5,fa5,fa1
fmadd.d fa4,fa4,fa2,fa3
fnmsub.dfa5,fa5,ft1,ft0
fadd.d  fa4,fa4,fa0
fmadd.d fa5,fa5,fa2,fa4

6 floating-point operations.

Clang has:

fmadd.d fa1, fa1, fa4, fa5
fadd.d  fa0, ft0, fa0
fmadd.d fa0, fa0, fa2, fa3
fadd.d  fa1, fa0, fa1

Clang has 4.

2 more floating-point operations are very critical to the performance I think
since double floating-point operations are usually costly in real hardware.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-02-04 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #11 from JuzheZhong  ---
Hi, I think this RVV compiler codegen is that optimal codegen we want for RVV:

https://repo.hca.bsc.es/epic/z/P6QXCc

.LBB0_5:# %vector.body
sub a4, t0, a3
vsetvli t1, a4, e64, m1, ta, mu
mul a2, a3, t2
add a5, t3, a2
vlse64.vv8, (a5), t2
add a4, a6, a2
vlse64.vv9, (a4), t2
add a4, a0, a2
vlse64.vv10, (a4), t2
vfadd.vvv8, v8, v9
vfmul.vfv8, v8, fa5
vfadd.vfv9, v10, fa4
vfmadd.vf   v9, fa3, v10
vlse64.vv10, (a5), t2
add a4, a1, a2
vsse64.vv9, (a4), t2
vfadd.vfv8, v8, fa2
vfmadd.vf   v8, fa3, v10
vfadd.vfv8, v8, fa1
add a2, a2, a7
add a3, a3, t1
vsse64.vv8, (a2), t2
bne a3, t0, .LBB0_5

[Bug tree-optimization/113134] gcc does not version loops with early break conditions that don't have side-effects

2024-02-02 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #22 from JuzheZhong  ---
I have done this following experiment.


diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
index bf017137260..8c36cc63d3b 100644
--- a/gcc/tree-ssa-loop-ivcanon.cc
+++ b/gcc/tree-ssa-loop-ivcanon.cc
@@ -1260,6 +1260,39 @@ canonicalize_loop_induction_variables (class loop *loop,
  may_be_zero = false;
}

+  if (!exit)
+   {
+ auto_vec exits = get_loop_exit_edges (loop);
+ exit = exits[0];
+ class tree_niter_desc desc1;
+ class tree_niter_desc desc2;
+ if (number_of_iterations_exit (loop, exits[0], , false)
+ && number_of_iterations_exit (loop, exits[1], , false))
+   {
+ niter = fold_build2 (MIN_EXPR, unsigned_type_node, desc1.niter,
+  desc2.niter);
+ create_canonical_iv (loop, exit, niter);
+ gcond *cond_stmt;
+ class nb_iter_bound *elt;
+ for (elt = loop->bounds; elt; elt = elt->next)
+   {
+ if (elt->is_exit
+ && !wi::ltu_p (loop->nb_iterations_upper_bound,
+elt->bound))
+   {
+ cond_stmt = as_a  (elt->stmt);
+ break;
+   }
+   }
+ if (exits[1]->flags & EDGE_TRUE_VALUE)
+   gimple_cond_make_false (cond_stmt);
+ else
+   gimple_cond_make_true (cond_stmt);
+ update_stmt (cond_stmt);
+ return false;
+   }
+   }
+

I know the check is wrong just for experiment, Then:

   [local count: 69202658]:
  _21 = (unsigned int) N_13(D);
  _22 = MIN_EXPR <_21, 1001>; > Use MIN_EXPR as the check.
  _23 = _22 + 1;
  goto ; [100.00%]

   [local count: 1014686025]:
  _1 = (long unsigned int) i_9;
  _2 = _1 * 4;
  _3 = a_14(D) + _2;
  _4 = *_3;
  _5 = b_15(D) + _2;
  _6 = *_5;
  _7 = c_16(D) + _2;
  _8 = _4 + _6;
  *_7 = _8;
  if (0 != 0)
goto ; [1.00%]
  else
goto ; [99.00%]

   [local count: 1004539166]:
  i_18 = i_9 + 1;

   [local count: 1073741824]:
  # i_9 = PHI <0(2), i_18(4)>
  # ivtmp_19 = PHI <_23(2), ivtmp_20(4)>
  ivtmp_20 = ivtmp_19 - 1;
  if (ivtmp_20 != 0)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 69202658]:
  return;

Then it can vectorize.

I am not sure whether it is the right place to put the codes.

[Bug target/113608] RISC-V: Vector spills after enabling vector abi

2024-02-01 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113608

--- Comment #2 from JuzheZhong  ---
vuint16m2_t vadd(vuint16m2_t a, vuint8m1_t b) {
int vl = __riscv_vsetvlmax_e8m1();
vuint16m2_t c = __riscv_vzext_vf2_u16m2(b, vl);
return __riscv_vadd_vv_u16m2(a, c, vl);
}

[Bug tree-optimization/113134] gcc does not version loops with early break conditions that don't have side-effects

2024-02-01 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #21 from JuzheZhong  ---
Hi, Richard. I looked into ivcanon.

I found that:

  /* If the loop has more than one exit, try checking all of them
 for # of iterations determinable through scev.  */
  if (!exit)
niter = find_loop_niter (loop, );

In find_loop_niter, we iterate 2 exit edges:

1. bb 5 -> bb 6 with niter = (unsigned int) N_13(D).
2. bb 3 -> bb 6 with niter = 1001.

It just skip niter = (unsigned int) N_13(D) in:
  if (!integer_zerop (desc.may_be_zero))
continue;

find_loop_niter (loop, ) return 1001 with skipping  (unsigned int)
N_13(D).

Should it return MIN (1001, (unsigned int) N_13(D)).

I prefer fix it in ivcanon since I believe it would be more elegant than fix it
in loop splitter.

I am still investigating, any guides will be really appreciated.

Thanks.

[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns

2024-02-01 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #11 from JuzheZhong  ---
Hi, Tamar.

We are interested in supporting saturating and rounding.

We may need to support scalar first.

Do you have any suggestions ?

Or you are already working on it?

Thanks.

[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns

2024-02-01 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #10 from JuzheZhong  ---
Hi, Tamar.

We are interested in supporting saturating and rounding.

We may need to support scalar first.

Do you have any suggestions ?

Or you are already working on it?

Thanks.

[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns

2024-02-01 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #9 from JuzheZhong  ---
Ok. After investigation of LLVM:

Before loop vectorizer:

  %cond12 = tail call i32 @llvm.usub.sat.i32(i32 %conv5, i32 %wsize)
  %conv13 = trunc i32 %cond12 to i16

After loop vectorizer:

  %10 = call <16 x i32> @llvm.usub.sat.v16i32(<16 x i32> %9, <16 x i32>
%broadcast.splat)
  %11 = trunc <16 x i32> %10 to <16 x i16>

I think GCC can follow this approach, that is, first recognize scalar
saturation,
then fall into loop vectorizer to vectorize it into the saturation.

[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns

2024-02-01 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #8 from JuzheZhong  ---
Missing saturate vectorization causes RVV Clang 20% performance better than RVV
GCC during recent benchmark evaluation.

In coremark pro zip-test, I believe other targets should be the same.

I wonder how we should start to support it.  Or did some body has already
started it ?

[Bug c/113695] RISC-V: Sources with different EEW must use different registers

2024-01-31 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113695

--- Comment #1 from JuzheZhong  ---
Since both operand are input operand, early clobber "&" constraint can not
help.

[Bug c/113695] New: RISC-V: Sources with different EEW must use different registers

2024-01-31 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113695

Bug ID: 113695
   Summary: RISC-V: Sources with different EEW must use different
registers
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: juzhe.zhong at rivai dot ai
  Target Milestone: ---

As this PR in LLVM, https://github.com/llvm/llvm-project/issues/80099

RVV ISA:
A vector register cannot be used to provide source operands with more than one
EEW for a single instruction. A
mask register source is considered to have EEW=1 for this constraint. An
encoding that would result in the
same vector register being read with two or more different EEWs, including when
the vector register appears at
different positions within two or more vector register groups, is reserved.

#include 
#include 

void foo(vuint64m2_t colidx, uint32_t* base_addr, size_t vl) {
  vuint32m1_t values =
__riscv_vget_v_u32m2_u32m1(__riscv_vreinterpret_v_u64m2_u32m2 (colidx), 0);
  __riscv_vsuxei64_v_u32m1(base_addr, colidx, values, vl);
}

foo:
vsetvli zero,a1,e32,m1,ta,ma
vsuxei64.v  v8,(a0),v8
ret

It is incorrect those 2 input operand with different EEW should not be the same
register (v8).

Current GCC RTL machine description and constraint can not allow us to fix it.
Even though it is a bug, I think we can only revisit it in GCC-15.

[Bug tree-optimization/113134] gcc does not version loops with early break conditions that don't have side-effects

2024-01-31 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #19 from JuzheZhong  ---

The loop is:

bb 3 -> bb 4 -> bb 5
  |   |__⬆
  |__⬆

The condition in bb 3 is if (i_21 == 1001).
The condition in bb 4 is if (N_13(D) > i_18).

Look into lsplit:
This loop doesn't satisfy the check of:
if (split_loop (loop) || split_loop_on_cond (loop))

In split_loop_on_cond, it's trying to split the loop that condition
is loop invariant.  However, no matter bb 3 or bb 4, their conditions
are not loop invariant.

I wonder whether we should add a new kind of loop splitter like:

diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc
index 04215fe7937..a4081b9b6f5 100644
--- a/gcc/tree-ssa-loop-split.cc
+++ b/gcc/tree-ssa-loop-split.cc
@@ -1769,7 +1769,8 @@ tree_ssa_split_loops (void)
   if (optimize_loop_for_size_p (loop))
continue;

-  if (split_loop (loop) || split_loop_on_cond (loop))
+  if (split_loop (loop) || split_loop_on_cond (loop)
+ || split_loop_for_early_break (loop))
{
  /* Mark our containing loop as having had some split inner loops.  */
  loop_outer (loop)->aux = loop;

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-31 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #18 from JuzheZhong  ---
(In reply to rguent...@suse.de from comment #17)
> On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > 
> > --- Comment #16 from JuzheZhong  ---
> > (In reply to rguent...@suse.de from comment #15)
> > > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > > 
> > > > --- Comment #14 from JuzheZhong  ---
> > > > Thanks Richard.
> > > > 
> > > > It seems that we can't fix this issue for now. Is that right ?
> > > > 
> > > > If I understand correctly, do you mean we should wait after SLP 
> > > > representations
> > > > are finished and then revisit this PR?
> > > 
> > > Yes.
> > 
> > It seems to be a big refactor work.
> 
> It's not too bad if people wouldn't continue to add features not 
> implementing SLP ...
> 
> > I wonder I can do anything to help with SLP representations ?
> 
> I hope to get back to this before stage1 re-opens and will post
> another request for testing.  It's really mostly going to be making
> sure all paths have coverage which means testing all the various
> architectures - I can only easily test x86.  There's a branch
> I worked on last year, refs/users/rguenth/heads/vect-force-slp,
> which I use to hunt down cases not supporting SLP (it's a bit
> overeager to trigger, and it has known holes so it's not really
> a good starting point yet for folks to try other archs).

Ok. It seems that you almost done with that but needs more testing in
various targets.

So, if I want to work on optimizing vectorization (start with TSVC),
I should avoid touching the failed vectorized due to data reference/dependence
analysis (e.g. this PR case, s116).

and avoid adding new features into loop vectorizer, e.g. min/max reduction with
index (s315).

To not to make your SLP refactoring work heavier.

Am I right ?

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-31 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #16 from JuzheZhong  ---
(In reply to rguent...@suse.de from comment #15)
> On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > 
> > --- Comment #14 from JuzheZhong  ---
> > Thanks Richard.
> > 
> > It seems that we can't fix this issue for now. Is that right ?
> > 
> > If I understand correctly, do you mean we should wait after SLP 
> > representations
> > are finished and then revisit this PR?
> 
> Yes.

It seems to be a big refactor work.

I wonder I can do anything to help with SLP representations ?

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-31 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #14 from JuzheZhong  ---
Thanks Richard.

It seems that we can't fix this issue for now. Is that right ?

If I understand correctly, do you mean we should wait after SLP representations
are finished and then revisit this PR?

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-30 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #12 from JuzheZhong  ---
OK. It seems it has data dependency issue:

missed:   not vectorized, possible dependence between data-refs a[i_15] and
a[_4]

a[i_15] = _3;  STMT 1
_4 = i_15 + 2;
_5 = a[_4];STMT 2

STMT2 should not depend on STMT1.

It's recognized as dependency in vect_analyze_data_ref_dependence.

Is is reasonable to fix it in vect_analyze_data_ref_dependence ?

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-30 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #11 from JuzheZhong  ---
It seems that we should fix this case (Richard gave) first which I think it's
not the SCEV or value-numbering issue:

double a[1024];
void foo ()
{
  for (int i = 0; i < 1022; i += 2)
{
  double tem = a[i+1];
  a[i] = tem * a[i];
  a[i+1] = a[i+2] * tem;
}
}

auto.c:13:21: missed: couldn't vectorize loop
auto.c:15:14: missed: not vectorized: no vectype for stmt: tem_10 = a[_1];
 scalar_type: double

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-30 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #10 from JuzheZhong  ---

I think the root cause is we think i_16 and _1 are alias due to scalar
evolution:

(get_scalar_evolution 
  (scalar = i_16)
  (scalar_evolution = {0, +, 2}_1))

(get_scalar_evolution 
  (scalar = _1)
  (scalar_evolution = {1, +, 2}_1))

Even though I didn't understand what it is.

diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
index 25e3130e2f1..2df6de67043 100644
--- a/gcc/tree-scalar-evolution.cc
+++ b/gcc/tree-scalar-evolution.cc
@@ -553,7 +553,7 @@ get_scalar_evolution (basic_block instantiated_below, tree
scalar)
 if (SSA_NAME_IS_DEFAULT_DEF (scalar))
  res = scalar;
else
- res = *find_var_scev_info (instantiated_below, scalar);
+ res = scalar;
break;

   case REAL_CST:

Ah... I tried an ugly hack which is definitely wrong (just for experiment) in
scalar evolution.

Then, we can vectorize it:

foo:
lui a1,%hi(a)
addia1,a1,%lo(a)
li  a2,511
li  a3,0
vsetivlizero,2,e64,m1,ta,ma
.L2:
addiw   a5,a3,1
sllia5,a5,3
add a5,a1,a5
fld fa5,0(a5)
sllia4,a3,3
add a4,a1,a4
vlse64.vv2,0(a4),zero
vle64.v v1,0(a5)
vfslide1down.vf v2,v2,fa5
addiw   a2,a2,-1
vfmul.vvv1,v1,v2
vse64.v v1,0(a4)
addiw   a3,a3,2
bne a2,zero,.L2
ret

I think we can add some simple memory access index recognition, but I don't
known where to add this recognition.

Would you mind giving me some more hints ?

Thanks.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-30 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

JuzheZhong  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #34 from JuzheZhong  ---
Fixed.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-30 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #20 from JuzheZhong  ---
(In reply to Robin Dapp from comment #19)
> What seems odd to me is that in fre5 we simplify
> 
>   _429 = .COND_SHL (mask_patt_205.47_276, vect_cst__262, vect_cst__262, { 0,
> ... });
>   vect_prephitmp_129.51_282 = _429;
>   vect_iftmp.55_287 = VEC_COND_EXPR  vect_prephitmp_129.51_282, vect_cst__262>;
> 
> to
> 
> Applying pattern match.pd:9607, gimple-match-10.cc:3817
> gimple_simplified to vect_iftmp.55_287 = .COND_SHL (mask_patt_205.47_276,
> vect_cst__262, vect_cst__262, { 0, ... });
> 
> so fold
> 
> vec_cond (mask209, prephitmp129, vect_cst262)
> with prephitmp129 = cond_shl (mask205, vect_cst262, vect_cst262, 0)
> 
> into
> cond_shl = (mask205, vect_cst262, vect_cst262, 0)?
> 
> That doesn't look valid to me because the vec_cond's else value
> (vect_cst262) gets lost.  Wouldn't such a simplification have a conditional
> else value?
> Like !mask1 ? else1 : else2 instead of else2 unconditionally?

Does ARM SVE have the same issue too ? Since I think we should be using same
folding optimization as ARM SVE.

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-30 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #8 from JuzheZhong  ---
Hi, Richard.

Now, I find the time to GCC vectorization optimization.

I find this case:

  _2 = a[_1];
  ...
  a[i_16] = _4;
  ,,,
  _7 = a[_1];---> This load should be eliminated and re-use _2.

Am I right ?

Could you guide me which pass should do this CSE optimization ?

Thanks.

[Bug middle-end/113166] RISC-V: Redundant move instructions in RVV intrinsic codes

2024-01-30 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113166

--- Comment #3 from JuzheZhong  ---
#include 
#include 

template 
inline vuint8m1_t tail_load(void const* data);

template<>
inline vuint8m1_t tail_load(void const* data) {
uint64_t const* ptr64 = reinterpret_cast(data);
#if 1
const vuint64m1_t zero = __riscv_vmv_v_x_u64m1(0,
__riscv_vsetvlmax_e64m1());
vuint64m1_t v64 = __riscv_vslide1up(zero, *ptr64,
__riscv_vsetvlmax_e64m1());
return __riscv_vreinterpret_u8m1(v64);
#elif 1
vuint64m1_t v64 = __riscv_vmv_s_x_u64m1(*ptr64, 1);
const vuint64m1_t zero = __riscv_vmv_v_x_u64m1(0,
__riscv_vsetvlmax_e64m1());
v64 = __riscv_vslideup(v64, zero, 1, __riscv_vsetvlmax_e8m1());
return __riscv_vreinterpret_u8m1(v64);
#elif 1
vuint64m1_t v64 = __riscv_vle64_v_u64m1(ptr64, 1);
const vuint64m1_t zero = __riscv_vmv_v_x_u64m1(0,
__riscv_vsetvlmax_e64m1());
v64 = __riscv_vslideup(v64, zero, 1, __riscv_vsetvlmax_e8m1());
return __riscv_vreinterpret_u8m1(v64);
#else
vuint8m1_t v = __riscv_vreinterpret_u8m1(__riscv_vle64_v_u64m1(ptr64, 1));
const vuint8m1_t zero = __riscv_vmv_v_x_u8m1(0, __riscv_vsetvlmax_e8m1());
return __riscv_vslideup(v, zero, sizeof(uint64_t),
__riscv_vsetvlmax_e8m1());
#endif
}

vuint8m1_t test2(uint64_t data) {
return tail_load();
}

GCC ASM:

test2(unsigned long):
vsetvli a5,zero,e64,m1,ta,ma
vmv.v.i v8,0
vmv1r.v v9,v8   
vslide1up.vxv8,v9,a0
ret

LLVM ASM:

test2(unsigned long):  # @test2(unsigned long)
vsetvli a1, zero, e64, m1, ta, ma
vmv.v.i v9, 0
vslide1up.vxv8, v9, a0
ret

[Bug c/113666] New: RISC-V: Cost model test regression due to recent middle-end loop vectorizer changes

2024-01-29 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113666

Bug ID: 113666
   Summary: RISC-V: Cost model test regression due to recent
middle-end loop vectorizer changes
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: juzhe.zhong at rivai dot ai
  Target Milestone: ---

FAIL: gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c scan-assembler-not vset
FAIL: gcc.dg/vect/costmodel/riscv/rvv/pr113281-2.c scan-assembler-not vset
FAIL: gcc.dg/vect/costmodel/riscv/rvv/pr113281-5.c scan-assembler-not vset

unsigned char a;

int main() {
  short b = a = 0;
  for (; a != 19; a++)
if (a)
  b = 32872 >> a;

  if (b == 0)
return 0;
  else
return 1;
}

-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize

We expect:

lui a5,%hi(a)
li  a4,19
sb  a4,%lo(a)(a5)
li  a0,0
ret

However, we now have:

vsetvli a5,zero,e8,mf4,ta,ma
li  a6,17
li  a3,32768
vid.v   v2
addiw   a3,a3,104
vadd.vx v2,v2,a6
lui a1,%hi(a)
vsetvli zero,zero,e32,m1,ta,ma
li  a0,19
vmv.v.x v1,a3
vzext.vf4   v3,v2
sb  a0,%lo(a)(a1)
vsra.vv v1,v1,v3
vsetvli zero,zero,e16,mf2,ta,ma
vncvt.x.x.w v1,v1
vslidedown.vi   v1,v1,1
vmv.x.s a0,v1
sneza0,a0
ret

I guess it is caused by this commit:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1a8261e047f7a2c2b0afb95716f7615cba718cd1

I don't known how to fix RISC-V backend cost model to recover back since
current
cost scalar_to_vec seems is already very high, it makes no sense to keep
raising 
the cost of it.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-28 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #15 from JuzheZhong  ---
Hi, Robin.

I tried to disable vec_extract, then the case passed.

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index 3b32369f68c..b61b886ef3d 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1386,7 +1386,7 @@
(match_operand:V_VLS  1 "register_operand")
(parallel
 [(match_operand  2 "nonmemory_operand")])))]
-  "TARGET_VECTOR"
+  "0"
 {
   /* Element extraction can be done by sliding down the requested element
  to index 0 and then v(f)mv.[xf].s it to a scalar register.  */


I am not so familiar with it (vec extract stuff), could you take a look at it ?

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-26 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #13 from JuzheZhong  ---
Ok. I found a regression between rvv-next and trunk.
I believe it is GCC-12 vs GCC-14:

rvv-next:
...
.L11:
li  t1,31
mv  a2,a1
bleua7,t1,.L12
bne a6,zero,.L13
li  a5,1
subwa5,a5,a3
andia5,a5,0xff
vsetvli a4,zero,e64,m1,ta,mu
vmv.v.i v24,0
vmv.v.x v27,a1
vmv1r.v v26,v24
.L14:
vsetvli a3,a5,e64,m1,tu,mu
sub a5,a5,a3
vmsne.viv0,v27,0
vmerge.vim  v25,v26,1,v0
vadd.vv v24,v24,v25
bne a5,zero,.L14
vsetvli a5,zero,e64,m1,ta,mu
vmv.s.x v25,zero
li  a3,0
vredsum.vs  v25,v24,v25
vmv.x.s a5,v25
j   .L17
...

RVV trunk GCC:

.L8:
lui a0,%hi(h)
lb  a4,%lo(h)(a0)
bgt a4,zero,.L37
lui a5,%hi(f)
lh  t1,%lo(f)(a5)
lui a3,%hi(g)
addia3,a3,%lo(g)
lw  a6,4(a3)
not a1,a6
slliw   a5,t1,3
sraia1,a1,63
subwa5,a5,t1
lw  a7,32(a3)
and a1,a6,a1
addiw   a2,a5,1
bne a7,zero,.L13
bne t1,zero,.L14
mv  a5,a6
blt a6,zero,.L44
.L15:
li  a3,31
sext.w  a2,a5
bleua6,a3,.L16
li  a3,1
.L20:
addiw   a5,a4,1
bgt a6,zero,.L45
slliw   a4,a5,24
sraiw   a4,a4,24
bne a4,a3,.L20
li  a5,0
li  a2,0
j   .L19
.L37:
lui a5,%hi(c)
.L11:
lw  a0,%lo(c)(a5)
addia0,a0,-6
sneza0,a0
ret

I don't think it will affect the correctness. But it's interesting
observations..

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-26 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #11 from JuzheZhong  ---
(In reply to Robin Dapp from comment #10)
> The compile farm machine I'm using doesn't have SVE.
> Compiling with -march=armv8-a -O3 pr113607.c -fno-vect-cost-model and
> running it returns 0 (i.e. ok).
> 
> pr113607.c:35:5: note: vectorized 3 loops in function.

Ok. Thanks. I just checked rvv-next which has similiar vectorized IR as
upstream RVV GCC.

But rvv-next return 0.

I will investigate what difference between them.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-26 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #9 from JuzheZhong  ---
Hi, Robin.

Could you try this case on latest ARM SVE ?

with -march=armv8-a+sve -O3 -fno-vect-cost-model.

I want to make sure first it is not an middle-end bug.

The RVV vectorized IR is same as ARM SVE.

Thanks.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-26 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #8 from JuzheZhong  ---
Ok. I can reproduce it too.

I am gonna work on fixing it.

Thanks.

[Bug c/113608] New: RISC-V: Vector spills after enabling vector abi

2024-01-25 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113608

Bug ID: 113608
   Summary: RISC-V: Vector spills after enabling vector abi
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: juzhe.zhong at rivai dot ai
  Target Milestone: ---

https://godbolt.org/z/srdd4qhdc

#include "riscv_vector.h"

vint32m8_t
foo (int32_t *__restrict a, int32_t *__restrict b, int32_t *__restrict c,
 int32_t *__restrict a2, int32_t *__restrict b2, int32_t *__restrict c2,
 int32_t *__restrict a3, int32_t *__restrict b3, int32_t *__restrict c3,
 int32_t *__restrict a4, int32_t *__restrict b4, int32_t *__restrict c4,
 int32_t *__restrict a5, int32_t *__restrict b5, int32_t *__restrict c5,
 int32_t *__restrict d, int32_t *__restrict d2, int32_t *__restrict d3,
 int32_t *__restrict d4, int32_t *__restrict d5, int n, vint32m8_t vector)
{
  for (int i = 0; i < n; i++)
{
  a[i] = b[i] + c[i];
  b5[i] = b[i] + c[i];
  a2[i] = b2[i] + c2[i];
  a3[i] = b3[i] + c3[i];
  a4[i] = b4[i] + c4[i];
  a5[i] = a[i] + a4[i];
  d2[i] = a2[i] + c2[i];
  d3[i] = a3[i] + c3[i];
  d4[i] = a4[i] + c4[i];
  d5[i] = a[i] + a4[i];
  a[i] = a5[i] + b5[i] + a[i];

  c2[i] = a[i] + c[i];
  c3[i] = b5[i] * a5[i];
  c4[i] = a2[i] * a3[i];
  c5[i] = b5[i] * a2[i];
  c[i] = a[i] + c3[i];
  c2[i] = a[i] + c4[i];
  a5[i] = a[i] + a4[i];
  a[i] = a[i] + b5[i]
 + a[i] * a2[i] * a3[i] * a4[i] * a5[i] * c[i] * c2[i] * c3[i]
 * c4[i] * c5[i] * d[i] * d2[i] * d3[i] * d4[i] * d5[i];
}
return vector;
}

This case will have vector spills after enabling default vector ABI.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-25 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #3 from JuzheZhong  ---
I tried trunk GCC to run your case with SPIKE, still didn't reproduce this
issue.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-25 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #2 from JuzheZhong  ---
I can't reproduce this issue.

Could you test it with this patch applied ?

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643934.html

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-25 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #1 from JuzheZhong  ---
I can reproduce this issue.

Could you test it with this patch applied ?

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643934.html

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-25 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #7 from JuzheZhong  ---
(In reply to rguent...@suse.de from comment #6)
> On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> > 
> > --- Comment #5 from JuzheZhong  ---
> > Both ICC and Clang X86 can vectorize SPEC 2017 lbm:
> > 
> > https://godbolt.org/z/MjbTbYf1G
> > 
> > But I am not sure X86 ICC is better or X86 Clang is better.
> 
> gather/scatter are possibly slow (and gather now has that Intel
> security issue).  The reason is a "cost" one:
> 
> t.c:47:21: note:   ==> examining statement: _4 = *_3;
> t.c:47:21: missed:   no array mode for V8DF[20]
> t.c:47:21: missed:   no array mode for V8DF[20]
> t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> or not equal to 3
> t.c:47:21: missed:   not falling back to elementwise accesses
> t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> *_3;
> t.c:47:21: missed:  bad operation or unsupported loop bound.
> 
> where we don't consider using gather because we have a known constant
> stride (20).  Since the stores are really scatters we don't attempt
> to SLP either.
> 
> Disabling the above heuristic we get this vectorized as well, avoiding
> gather/scatter by manually implementing them and using a quite high
> VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> faster code in the end).  But yes, I doubt that any of ICC or clang
> vectorized codes are faster anywhere (but without specifying an
> uarch you get some generic cost modelling applied).  Maybe SPR doesn't
> have the gather bug and it does have reasonable gather and scatter
> (zen4 scatter sucks).
> 
> .L3:
> vmovsd  952(%rax), %xmm0
> vmovsd  -8(%rax), %xmm2
> addq$1280, %rsi
> addq$1280, %rax
> vmovhpd -168(%rax), %xmm0, %xmm1
> vmovhpd -1128(%rax), %xmm2, %xmm2
> vmovsd  -648(%rax), %xmm0
> vmovhpd -488(%rax), %xmm0, %xmm0
> vinsertf32x4$0x1, %xmm1, %ymm0, %ymm0
> vmovsd  -968(%rax), %xmm1
> vmovhpd -808(%rax), %xmm1, %xmm1
> vinsertf32x4$0x1, %xmm1, %ymm2, %ymm2
> vinsertf64x4$0x1, %ymm0, %zmm2, %zmm2
> vmovsd  -320(%rax), %xmm0
> vmovhpd -160(%rax), %xmm0, %xmm1
> vmovsd  -640(%rax), %xmm0
> vmovhpd -480(%rax), %xmm0, %xmm0
> vinsertf32x4$0x1, %xmm1, %ymm0, %ymm1
> vmovsd  -960(%rax), %xmm0
> vmovhpd -800(%rax), %xmm0, %xmm8
> vmovsd  -1280(%rax), %xmm0
> vmovhpd -1120(%rax), %xmm0, %xmm0
> vinsertf32x4$0x1, %xmm8, %ymm0, %ymm0
> vinsertf64x4$0x1, %ymm1, %zmm0, %zmm0
> vmovsd  -312(%rax), %xmm1
> vmovhpd -152(%rax), %xmm1, %xmm8
> vmovsd  -632(%rax), %xmm1
> vmovhpd -472(%rax), %xmm1, %xmm1
> vinsertf32x4$0x1, %xmm8, %ymm1, %ymm8
> vmovsd  -952(%rax), %xmm1
> vmovhpd -792(%rax), %xmm1, %xmm9
> vmovsd  -1272(%rax), %xmm1
> vmovhpd -1112(%rax), %xmm1, %xmm1
> vinsertf32x4$0x1, %xmm9, %ymm1, %ymm1
> vinsertf64x4$0x1, %ymm8, %zmm1, %zmm1
> vaddpd  %zmm1, %zmm0, %zmm0
> vaddpd  %zmm7, %zmm2, %zmm1
> vfnmadd132pd%zmm3, %zmm2, %zmm1
> vfmadd132pd %zmm6, %zmm5, %zmm0
> valignq $3, %ymm1, %ymm1, %ymm2
> vmovlpd %xmm1, -1280(%rsi)
> vextractf64x2   $1, %ymm1, %xmm8
> vmovhpd %xmm1, -1120(%rsi)
> vextractf64x4   $0x1, %zmm1, %ymm1
> vmovlpd %xmm1, -640(%rsi)
> vmovhpd %xmm1, -480(%rsi)
> vmovsd  %xmm2, -800(%rsi)
> vextractf64x2   $1, %ymm1, %xmm2
> vmovsd  %xmm8, -960(%rsi)
> valignq $3, %ymm1, %ymm1, %ymm1
> vmovsd  %xmm2, -320(%rsi)
> vmovsd  %xmm1, -160(%rsi)
> vmovsd  -320(%rax), %xmm1
> vmovhpd -160(%rax), %xmm1, %xmm2
> vmovsd  -640(%rax), %xmm1
> vmovhpd -480(%rax), %xmm1, %xmm1
> vinsertf32x4$0x1, %xmm2, %ymm1, %ymm2
> vmovsd  -960(%rax), %xmm1
> vmovhpd -800(%rax), %xmm1, %xmm8
> vmovsd  -1280(%rax), %xmm1
> vmovhpd -1120(%rax), %xmm1, %xmm1
> vinsertf32x4$0x1, %xmm8, %ymm1, %ymm1
> vinsertf64x4$0x1, %ymm2, %zmm1, %zmm1
> vfnmadd132pd%zmm3, %zmm1, %zmm0
> vaddpd  %zmm4, %zmm0, %zmm0
> valignq $3, %ymm0, %ymm0, %ymm1
> vmovlpd %xmm0, 14728(%rsi)
> vextractf64x2   $1, %ymm0, %xmm2
> vmovhpd %xmm0, 14888(%rsi)
> vextractf64x4   $0x1, %zmm0, %ymm0
> vmovlpd %xmm0, 15368(%rsi)
> vmovhpd %xmm0, 15528(%rsi)
> vmovsd  %xmm1, 15208(%rsi)
> vextractf64x2   $1, %ymm0, %xmm1
> vmovsd  %xmm2, 15048(%rsi)
> valignq $3, %ymm0, %ymm0, %ymm0
> vmovsd  %xmm1, 15688(%rsi)
> vmovsd  %xmm0, 15848(%rsi)
> cmpq%rdx, %rsi
> jne

[Bug target/113570] RISC-V: SPEC2017 549 fotonik3d miscompilation in autovec VLS 256 build

2024-01-24 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113570

--- Comment #5 from JuzheZhong  ---
It seems that we don't have any bugs in current SPEC 2017 testing.

So I strongly suggest "full coverage" testing on SPEC 2017 which I mentioned
in PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

-march=rv64gcv --param=riscv-autovec-lmul=m2
-march=rv64gcv --param=riscv-autovec-lmul=m4
-march=rv64gcv --param=riscv-autovec-lmul=m8
-march=rv64gcv --param=riscv-autovec-lmul=dynamic

-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic

-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic

-march=rv64gcv --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv --param=riscv-autovec-lmul=m2
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv --param=riscv-autovec-lmul=m4
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv --param=riscv-autovec-lmul=m8
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv --param=riscv-autovec-lmul=dynamic
--param=riscv-autovec-preference=fixed-vlmax

-march=rv64gcv_zvl256b --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic
--param=riscv-autovec-preference=fixed-vlmax

-march=rv64gcv_zvl512b --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic
--param=riscv-autovec-preference=fixed-vlmax

Could you trigger these testing ?

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-24 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #5 from JuzheZhong  ---
Both ICC and Clang X86 can vectorize SPEC 2017 lbm:

https://godbolt.org/z/MjbTbYf1G

But I am not sure X86 ICC is better or X86 Clang is better.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-24 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #4 from JuzheZhong  ---
OK. Confirm on X86 GCC failed to vectorize it, wheras Clang X86 can vectorize
it.

https://godbolt.org/z/EaTjGbPGW

X86 Clang and RISC-V Clang IR are same:

  %12 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %11,
i32 8, <8 x i1> , <8 x double> poison), !dbg !62
  %13 = or disjoint <8 x i64> %10, , !dbg !72
  %14 = getelementptr inbounds double, ptr %0, <8 x i64> %13, !dbg !72
  %15 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %14,
i32 8, <8 x i1> , <8 x double> poison), !dbg !72
  %16 = or disjoint <8 x i64> %10, , !dbg !73
  %17 = getelementptr inbounds double, ptr %0, <8 x i64> %16, !dbg !73
  %18 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %17,
i32 8, <8 x i1> , <8 x double> poison), !dbg !73
  %19 = fadd <8 x double> %15, %18, !dbg !74
  %20 = fmul <8 x double> %19, , !dbg !75
  %21 = fadd <8 x double> %12, , !dbg !76
  %22 = tail call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %21, <8 x
double> , <8 x double> %12), !dbg !77
  %23 = getelementptr inbounds double, ptr %1, <8 x i64> %10, !dbg !77
  tail call void @llvm.masked.scatter.v8f64.v8p0(<8 x double> %22, <8 x ptr>
%23, i32 8, <8 x i1> ), !dbg !78
  %24 = tail call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> %14,
i32 8, <8 x i1> , <8 x double> poison), !dbg !81
  %25 = fadd <8 x double> %20, , !dbg !82
  %26 = tail call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %25, <8 x
double> , <8 x double> %24), !dbg !83
  %27 = fadd <8 x double> %26, , !dbg !84
  %28 = getelementptr double, <8 x ptr> %23, i64 2001, !dbg !84
  tail call void @llvm.masked.scatter.v8f64.v8p0(<8 x double> %27, <8 x ptr>
%28, i32 8, <8 x i1> ), !dbg !85

Hi, Richard. Do you have suggestions about this issue ?
Thanks.

[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc

2024-01-24 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

--- Comment #44 from JuzheZhong  ---
(In reply to Patrick O'Neill from comment #43)
> (In reply to Patrick O'Neill from comment #42)
> > I kicked off a run roughly 10 hours ago with your memory-hog fix patch
> > applied to a1b2953924c451ce90a3fdce6841b63bf05f335f. I'll post the results
> > here when the runs complete. Thanks!
> 
> No new failures!
> 
> zvl128b:
> no fails!
> 
> zvl256b:
> 549.fotonik3d (runtime) - pr113570 (looks like this fail is since I used
> -Ofast)

Thanks. Could you trigger full coverage testing of SPEC with these following
combination compile option:


-march=rv64gcv --param=riscv-autovec-lmul=m2
-march=rv64gcv --param=riscv-autovec-lmul=m4
-march=rv64gcv --param=riscv-autovec-lmul=m8
-march=rv64gcv --param=riscv-autovec-lmul=dynamic

-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic

-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic

-march=rv64gcv --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv --param=riscv-autovec-lmul=m2
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv --param=riscv-autovec-lmul=m4
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv --param=riscv-autovec-lmul=m8
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv --param=riscv-autovec-lmul=dynamic
--param=riscv-autovec-preference=fixed-vlmax

-march=rv64gcv_zvl256b --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic
--param=riscv-autovec-preference=fixed-vlmax

-march=rv64gcv_zvl512b --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8
--param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic
--param=riscv-autovec-preference=fixed-vlmax

I believe they can be separate tasks assigned muitl-cores or muti-thread run
simultaneously.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-24 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #3 from JuzheZhong  ---
Ok I see.

If we change NN into 8, then we can vectorize it with load_lanes/store_lanes
with group size = 8:

https://godbolt.org/z/doe9c3hfo

We will use vlseg8e64 which is RVVM1DF[8] == RVVM1x8DFmode.

Here there is report:

/app/example.c:47:21: missed:   no array mode for RVVM1DF[20]
/app/example.c:47:21: missed:   no array mode for RVVM1DF[20]

I believe we enable vec_load_lanes/vec_store_lanes for RVVM1DF[20] which
RVVM1x20DF mode, then we can vectorize it.

But it's not reasonable and not general way to do that. This code require array
size = 20. How about other codes, we may have codes require array size  = 21,
22,..
23, etc... The array size can be any number. We can't leverage this
approach
for infinite array size.

So, the idea is that we try to find vec_load_lanes/vec_store_lanes first check
whether it support lanes vectorization for specific array size.

If not, we should be able to lower them into multiple gather/scatter or strided
load/stores.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-24 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #1 from JuzheZhong  ---
It's interesting, for Clang only RISC-V can vectorize it.

I think there are 2 topics:

1. Support vectorization of this codes of in loop vectorizer.
2. Transform gather/scatter into strided load/store for RISC-V.

For 2nd topic: LLVM does it by RISC-V target specific lowering pass:

RISC-V gather/scatter lowering (riscv-gather-scatter-lowering)

This is the RISC-V LLVM backend codes:

  if (II->getIntrinsicID() == Intrinsic::masked_gather)
Call = Builder.CreateIntrinsic(
Intrinsic::riscv_masked_strided_load,
{DataType, BasePtr->getType(), Stride->getType()},
{II->getArgOperand(3), BasePtr, Stride, II->getArgOperand(2)});
  else
Call = Builder.CreateIntrinsic(
Intrinsic::riscv_masked_strided_store,
{DataType, BasePtr->getType(), Stride->getType()},
{II->getArgOperand(0), BasePtr, Stride, II->getArgOperand(3)});

I have ever tried to support strided load/store in GCC loop vectorizer,
but it seems to be unacceptable.  Maybe we can support strided load/stores
by leveraging LLVM approach ???

Btw, LLVM risc-v gather/scatter didn't do a perfect job here:

vid.v   v8
vmul.vx v8, v8, a3


vsoxei64.v  v10, (s2), v14

This is in-order indexed store which is very costly in hardware.
It should be unorder indexed store or strided store.

Anyway, I think we should investigate first how to support vectorization of lbm
in loop vectorizer.

[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc

2024-01-23 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

--- Comment #41 from JuzheZhong  ---
Hi, Patrick.

Could you trigger test again base on latest trunk GCC?

We have recent memory-hog fix patch:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=3132d2d36b4705bb762e61b1c8ca4da7c78a8321

I want to make sure it doesn't cause a regression on SPEC.

I have tested it with full coverage GCC testsuite, no regression.

But I want to know about SPEC 2017

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop

2024-01-23 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #24 from JuzheZhong  ---
(In reply to Richard Biener from comment #19)
> (In reply to Richard Biener from comment #18)
> > (In reply to Tamar Christina from comment #17)
> > > Ok, bisected to
> > > 
> > > g:2efe3a7de0107618397264017fb045f237764cc7 is the first bad commit
> > > commit 2efe3a7de0107618397264017fb045f237764cc7
> > > Author: Hao Liu 
> > > Date:   Wed Dec 6 14:52:19 2023 +0800
> > > 
> > > tree-optimization/112774: extend the SCEV CHREC tree with a 
> > > nonwrapping
> > > flag
> > > 
> > > Before this commit we were unable to analyse the stride of the access.
> > > After this niters seems to estimate the loop trip count at 4 and after 
> > > that
> > > the logs diverge enormously.
> > 
> > Hum, but that's backward and would match to what I said in comment#2 - we
> > should get better code with that.
> > 
> > Juzhe - when you revert the above ontop of trunk does the generated code
> > look better for Risc-V?
> 
> It doesn't revert but you can do
> 
> diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
> index 25e3130e2f1..7870c8d76fb 100644
> --- a/gcc/tree-scalar-evolution.cc
> +++ b/gcc/tree-scalar-evolution.cc
> @@ -2054,7 +2054,7 @@ analyze_scalar_evolution (class loop *loop, tree var)
>  
>  void record_nonwrapping_chrec (tree chrec)
>  {
> -  CHREC_NOWRAP(chrec) = 1;
> +  CHREC_NOWRAP(chrec) = 0;
>  
>if (dump_file && (dump_flags & TDF_SCEV))
>  {

Hmmm. With experiments. The codegen looks slightly better but still didn't
recover back to GCC-12.


Btw, I compare ARM SVE codegen, even with cost model:

https://godbolt.org/z/cKc1PG3dv

I think GCC 13.2 codegen is better than GCC trunk with cost model.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop

2024-01-23 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #14 from JuzheZhong  ---
I just tried again both GCC-13.2 and GCC-14 with -fno-vect-cost-model.

https://godbolt.org/z/enEG3qf5K

GCC-14 requires scalar epilogue loop, whereas GCC-13.2 doesn't.

I believe it's not cost model issue.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop

2024-01-23 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #12 from JuzheZhong  ---
(In reply to Richard Biener from comment #11)
> (In reply to Tamar Christina from comment #9)
> > There is a weird costing going on in the PHI nodes though:
> > 
> > m_108 = PHI  1 times vector_stmt costs 0 in body 
> > m_108 = PHI  2 times scalar_to_vec costs 0 in prologue
> > 
> > they have collapsed to 0. which can't be right..
> 
> Note this is likely because of the backend going wrong.
> 
> bool
> vectorizable_phi (vec_info *,
>   stmt_vec_info stmt_info, gimple **vec_stmt,
>   slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> {
> ..
> 
>   /* For single-argument PHIs assume coalescing which means zero cost
>  for the scalar and the vector PHIs.  This avoids artificially
>  favoring the vector path (but may pessimize it in some cases).  */
>   if (gimple_phi_num_args (as_a  (stmt_info->stmt)) > 1)
> record_stmt_cost (cost_vec, SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node),
>   vector_stmt, stmt_info, vectype, 0, vect_body);
> 
> You could check if we call this with sane values.

Do you mean it's RISC-V backend cost model issue ?

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop

2024-01-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #10 from JuzheZhong  ---
(In reply to Tamar Christina from comment #9)
> So on SVE the change is cost modelling.
> 
> Bisect landed on g:33c2b70dbabc02788caabcbc66b7baeafeb95bcf which changed
> the compiler's defaults to using the new throughput matched cost modelling
> used be newer cores.
> 
> It looks like this changes which mode the compiler picks for when using a
> fixed register size.
> 
> This is because the new cost model (correctly) models the costs for FMAs and
> promotions.
> 
> Before:
> 
> array1[0][_1] 1 times scalar_load costs 1 in prologue
> int) _2 1 times scalar_stmt costs 1 in prologue
> 
> after:
> 
> array1[0][_1] 1 times scalar_load costs 1 in prologue 
> (int) _2 1 times scalar_stmt costs 0 in prologue 
> 
> and the cost goes from:
> 
> Vector inside of loop cost: 125
> 
> to
> 
> Vector inside of loop cost: 83 
> 
> so far, nothing sticks out, and in fact the profitability for VNx4QI drops
> from
> 
> Calculated minimum iters for profitability: 5
> 
> to
> 
> Calculated minimum iters for profitability: 3
> 
> This causes a clash, as this is now exactly the same cost as VNx2QI which
> used to be what it preferred before.
> 
> Which then leads it to pick the higher VF.
> 
> In the end smaller VF shows:
> 
> ;; Guessed iterations of loop 4 is 0.500488. New upper bound 1.
> 
> and now we get:
> 
> Vectorization factor 16 seems too large for profile prevoiusly believed to
> be consistent; reducing.  
> ;; Guessed iterations of loop 4 is 0.500488. New upper bound 0.
> ;; Scaling loop 4 with scale 66.6% (guessed) to reach upper bound 0
> 
> which I guess is the big difference.
> 
> There is a weird costing going on in the PHI nodes though:
> 
> m_108 = PHI  1 times vector_stmt costs 0 in body 
> m_108 = PHI  2 times scalar_to_vec costs 0 in prologue
> 
> they have collapsed to 0. which can't be right..

I don't think this change makes the regression since the regression not only
happens on ARM SVE but also on RVV.
It should be middle-end.

I believe you'd better use -fno-vect-cost-model.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #31 from JuzheZhong  ---
machine dep reorg  : 403.69 ( 56%)  23.48 ( 93%) 427.17 ( 57%) 
5290k (  0%)

Confirm remove RTL DF checking, LICM is no longer be compile-time hog issue.

VSETVL PASS count 56% compile-time.

Even though I can' see memory-hog in GGC -ftime-report, I can see 33G memory
usage
in htop.

Confirm both compile-hog and memory-hog are VSETVL PASS issue.

I will work on optimize compile-time as well as memory-usage of VSETVL PASS.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #30 from JuzheZhong  ---
Ok. I believe m_avl_def_in && m_avl_def_out can be removed with a better
algorthm.

Then the memory-hog should be fixed soon.

I am gonna rewrite avl_vl_unmodified_between_p and trigger full coverage
testingl
Since it's going to be a big change there.

[Bug tree-optimization/113441] [13/14 Regression] Fail to fold the last element with multiple loop

2024-01-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #8 from JuzheZhong  ---
I believe the change between Nov and Dec causes regression.

But I don't continue on bisection.

Hope this information can help with your bisection.

Thanks.

[Bug tree-optimization/113441] [13/14 Regression] Fail to fold the last element with multiple loop

2024-01-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #7 from JuzheZhong  ---
(In reply to Tamar Christina from comment #6)
> Hello,
> 
> I can bisect it if you want. it should only take a few seconds.

Ok. Thanks a lot ...

I take 2 hours to bisect it manually  but still didn't locate the accurate
commit
which causes regression...

It's great that you can bisect it easily.

[Bug tree-optimization/113441] [13/14 Regression] Fail to fold the last element with multiple loop

2024-01-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #5 from JuzheZhong  ---
Confirm at Nov, 1. The regression is gone.

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=eac0917bd3d2ead4829d56c8f2769176087c7b3d

This commit is ok, which has no regressions.

Still bisecting manually.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-22 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #28 from JuzheZhong  ---
(In reply to Robin Dapp from comment #27)
> Following up on this:
> 
> I'm seeing the same thing Patrick does.  We create a lot of large non-sparse
> sbitmaps that amount to around 33G in total.
> 
> I did local experiments replacing all sbitmaps that are not needed for LCM
> by regular bitmaps.  Apart from output differences vs the original version
> the testsuite is unchanged.
> 
> As expected, wrf now takes longer to compiler, 8 mins vs 4ish mins before
> and we still use 2.7G of RAM for this single file (Likely because of the
> remaining sbitmaps) compared to a max of 1.2ish G that the rest of the
> commpilation uses.
> 
> One possibility to get the best of both worlds would be to threshold based
> on num_bbs * num_exprs.  Once we exceed it switch to the bitmap pass,
> otherwise keep sbitmaps for performance. 
> 
> Messaging with Juzhe offline, his best guess for the LICM time is that he
> enabled checking for dataflow which slows down this particular compilation
> by a lot.  Therefore it doesn't look like a generic problem.

Thanks. I don't think replacing sbitmap is the best solution.
Let's me first disable DF check and reproduce 33G memory consumption in my
local
machine.

I think the best way to optimize the memory consumption is to optimize the
VSETLV PASS algorithm and codes. I have an idea to optimize.
I am gonna work on it.

Thanks for reporting.

[Bug target/113420] risc-v vector: ICE when using C compiler compile C++ RVV intrinsics

2024-01-21 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113420

JuzheZhong  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from JuzheZhong  ---
Fixed.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-19 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #25 from JuzheZhong  ---
RISC-V backend memory-hog issue is fixed.
But compile time hog in LICM still there, so keep this PR open.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-19 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #22 from JuzheZhong  ---
(In reply to Richard Biener from comment #21)
> I once tried to avoid df_reorganize_refs and/or optimize this with the
> blocks involved but failed.

I am considering whether we should disable LICM for RISC-V by default if vector
is enabled ?
Since the compile time explode 10 times is really horrible.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-19 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #19 from JuzheZhong  ---
(In reply to JuzheZhong from comment #18)
> Hi, Robin.
> 
> I have fixed patch for memory-hog:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643418.html
> 
> I will commit it after the testing.
> 
> But compile-time hog still exists which is loop invariant motion PASS.
> 
> with -fno-move-loop-invariants, we become quite faster.
> 
> Could you take a look at it ?

Note that with default -march=rv64gcv_zvl256b  -O3:
real63m18.771s
user60m19.036s
sys 2m59.787s

But with -march=rv64gcv_zvl256b -O3 -fno-move-loop-invariants:
real6m52.984s
user6m42.473s
sys 0m10.375s

10 times faster without loop invariant motion.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-19 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #18 from JuzheZhong  ---
Hi, Robin.

I have fixed patch for memory-hog:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643418.html

I will commit it after the testing.

But compile-time hog still exists which is loop invariant motion PASS.

with -fno-move-loop-invariants, we become quite faster.

Could you take a look at it ?

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #17 from JuzheZhong  ---
Ok. Confirm the original test 33383M -> 4796k now.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #16 from JuzheZhong  ---
(In reply to Andrew Pinski from comment #15)
> (In reply to JuzheZhong from comment #14)
> > Oh. I known the reason now.
> > 
> > The issue is not RISC-V backend VSETVL PASS.
> > 
> > It's memory bug of rtx_equal_p I think.
> 
> 
> It is not rtx_equal_p but rather RVV_VLMAX which is defined as:
> riscv-protos.h:#define RVV_VLMAX gen_rtx_REG (Pmode, X0_REGNUM)
> 
> Seems like you could cache that somewhere ...

Oh. Make sense to me. Thank you so much.
I think memory-hog issue will be fixed soon.

But the compile-time hog issue of loop invariant motion is still not fixed.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #14 from JuzheZhong  ---
Oh. I known the reason now.

The issue is not RISC-V backend VSETVL PASS.

It's memory bug of rtx_equal_p I think.

We are calling rtx_equal_p which is very costly.

For example, has_nonvlmax_reg_avl is calling rtx_equal_p.

So I keep all codes unchange, then replace comparison as follows:

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 93a1238a5ab..1c85c8ee3c6 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -4988,7 +4988,7 @@ nonvlmax_avl_type_p (rtx_insn *rinsn)
 bool
 vlmax_avl_p (rtx x)
 {
-  return x && rtx_equal_p (x, RVV_VLMAX);
+  return x && REG_P (x) && REGNO (x) == X0_REGNUM/*rtx_equal_p (x,
RVV_VLMAX)*/;
 }

Use REGNO (x) == X0_REGNUM instead of rtx_equal_p.

Memory-hog issue is gone:

939M -> 725k.

So I am gonna send a patch to walk around rtx_equal_p issues which cause
memory-hog.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #13 from JuzheZhong  ---
So I think we should investigate why calling has_nonvlmax_reg_avl cost so much
memory.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #12 from JuzheZhong  ---
Ok. Here is a simple fix which give some hints:


diff --git a/gcc/config/riscv/riscv-vsetvl.cc
b/gcc/config/riscv/riscv-vsetvl.cc
index 2067073185f..ede818140dc 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -2719,10 +2719,11 @@ pre_vsetvl::compute_lcm_local_properties ()
  for (int i = 0; i < num_exprs; i += 1)
{
  const vsetvl_info  = *m_exprs[i];
- if (!info.has_nonvlmax_reg_avl () && !info.has_vl ())
+ bool has_nonvlmax_reg_avl_p = info.has_nonvlmax_reg_avl ();
+ if (!has_nonvlmax_reg_avl_p && !info.has_vl ())
continue;

- if (info.has_nonvlmax_reg_avl ())
+ if (has_nonvlmax_reg_avl_p)
{
  unsigned int regno;
  sbitmap_iterator sbi;
@@ -3556,7 +3557,7 @@ const pass_data pass_data_vsetvl = {
   RTL_PASS, /* type */
   "vsetvl", /* name */
   OPTGROUP_NONE, /* optinfo_flags */
-  TV_NONE,  /* tv_id */
+  TV_MACH_DEP,  /* tv_id */
   0,/* properties_required */
   0,/* properties_provided */
   0,/* properties_destroyed */


Memory usage from 931M -> 781M. Memory usage reduce significantly.

Note that I didn't change all has_nonvlmax_reg_avl, We have so many places
calling  has_nonvlmax_reg_avl...

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #11 from JuzheZhong  ---
It should be compute_lcm_local_properties. The memory usage reduce 50% after I
remove this function. I am still investigating.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #10 from JuzheZhong  ---
No, it's not caused here. I removed the whole function compute_avl_def_data.

The memory usage doesn't change.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #6 from JuzheZhong  ---
(In reply to Andrew Pinski from comment #5)
> Note "loop invariant motion" is the RTL based loop invariant motion pass.

So you mean it should be still RISC-V issue, right ?

[Bug tree-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #4 from JuzheZhong  ---
Also, the original file with -fno-move-loop-invariants reduce compile time from
60 minutes into 7 minutes:

real7m12.528s
user6m55.214s
sys 0m17.147s


machine dep reorg  :  75.93 ( 18%)  14.23 ( 88%)  90.15 ( 21%)
33383M ( 95%

The memory report is quite obvious (consume 95% memory).

So, I believe VSETVL PASS is not the main reason of compile-time-hog,
it should be loop invariant PASS.

But VSETVL PASS is main reason of memory-hog.

I am not familiar with loop invariant pass. Can anyone help to debug
compile-time
hog of loop invariant PASS. Or should we disable loop invariant pass by default
for RISC-V ?

[Bug tree-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #3 from JuzheZhong  ---
Ok. The reduced case:

# 1 "module_first_rk_step_part1.fppized.f90"
# 1 ""
# 1 ""
# 1 "module_first_rk_step_part1.fppized.f90"
!WRF:MEDIATION_LAYER:SOLVER


MODULE module_first_rk_step_part1

CONTAINS

  SUBROUTINE first_rk_step_part1 (   grid , config_flags  &
 , moist , moist_tend   &
 , chem  , chem_tend&
 , tracer, tracer_tend  &
 , scalar , scalar_tend &
 , fdda3d, fdda2d   &
 , aerod&
 , ru_tendf, rv_tendf   &
 , rw_tendf, t_tendf&
 , ph_tendf, mu_tendf   &
 , tke_tend &
 , adapt_step_flag , curr_secs  &
 , psim , psih , wspd , gz1oz0 , br , chklowq &
 , cu_act_flag , hol , th_phy   &
 , pi_phy , p_phy , t_phy   &
 , dz8w , p8w , t8w &
 , ids, ide, jds, jde, kds, kde &
 , ims, ime, jms, jme, kms, kme &
 , ips, ipe, jps, jpe, kps, kpe &
 , imsx,imex,jmsx,jmex,kmsx,kmex&
 , ipsx,ipex,jpsx,jpex,kpsx,kpex&
 , imsy,imey,jmsy,jmey,kmsy,kmey&
 , ipsy,ipey,jpsy,jpey,kpsy,kpey&
 , k_start , k_end  &
 , f_flux   &
)
USE module_state_description
USE module_model_constants
USE module_domain, ONLY : domain, domain_clock_get, get_ijk_from_subgrid
USE module_configure, ONLY : grid_config_rec_type, model_config_rec
USE module_radiation_driver, ONLY : pre_radiation_driver, radiation_driver
USE module_surface_driver, ONLY : surface_driver
USE module_cumulus_driver, ONLY : cumulus_driver
USE module_shallowcu_driver, ONLY : shallowcu_driver
USE module_pbl_driver, ONLY : pbl_driver
USE module_fr_fire_driver_wrf, ONLY : fire_driver_em_step
USE module_fddagd_driver, ONLY : fddagd_driver
USE module_em, ONLY : init_zero_tendency
USE module_force_scm
USE module_convtrans_prep
USE module_big_step_utilities_em, ONLY : phy_prep
use module_scalar_tables
USE module_dm, ONLY : local_communicator, mytask, ntasks, ntasks_x,
ntasks_y, local_communicator_periodic, wrf_dm_maxval
USE module_comm_dm, ONLY :
halo_em_phys_a_sub,halo_em_fdda_sfc_sub,halo_pwp_sub,halo_em_chem_e_3_sub, &
halo_em_chem_e_5_sub, halo_em_hydro_noahmp_sub
USE module_utility
IMPLICIT NONE

TYPE ( domain ), INTENT(INOUT) :: grid
TYPE ( grid_config_rec_type ), INTENT(IN) :: config_flags
TYPE(WRFU_Time):: currentTime

INTEGER, INTENT(IN) :: ids, ide, jds, jde, kds, kde, &
   ims, ime, jms, jme, kms, kme, &
   ips, ipe, jps, jpe, kps, kpe, &
   imsx,imex,jmsx,jmex,kmsx,kmex,&
   ipsx,ipex,jpsx,jpex,kpsx,kpex,&
   imsy,imey,jmsy,jmey,kmsy,kmey,&
   ipsy,ipey,jpsy,jpey,kpsy,kpey


LOGICAL ,INTENT(IN):: adapt_step_flag
REAL, INTENT(IN)   :: curr_secs

REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_moist),INTENT(INOUT)   ::
moist
REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_moist),INTENT(INOUT)   ::
moist_tend
REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_chem),INTENT(INOUT)   ::
chem
REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_chem),INTENT(INOUT)   ::
chem_tend
REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_tracer),INTENT(INOUT)   ::
tracer
REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_tracer),INTENT(INOUT)   ::
tracer_tend
REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_scalar),INTENT(INOUT)   ::
scalar
REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_scalar),INTENT(INOUT)   ::
scalar_tend
REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_fdda3d),INTENT(INOUT)  ::
fdda3d
REAL,DIMENSION(ims:ime,1:1,jms:jme,num_fdda2d),INTENT(INOUT)  ::
fdda2d
REAL,DIMENSION(ims:ime,kms:kme,jms:jme,num_aerod),INTENT(INOUT)   ::
aerod
REAL,DIMENSION(ims:ime,jms:jme), INTENT(INOUT) :: psim
REAL,DIMENSION(ims:ime,jms:jme), INTENT(INOUT) :: psih
REAL,DIMENSION(ims:ime,jms:jme), INTENT(INOUT) :: wspd
REAL

[Bug tree-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #2 from JuzheZhong  ---
To build the attachment file, we need these following file from SPEC2017:

module_big_step_utilities_em.mod  module_cumulus_driver.mod 
module_fddagd_driver.modmodule_model_constants.mod  
module_shallowcu_driver.mod
module_comm_dm.modmodule_dm.mod 
module_first_rk_step_part1.mod  module_pbl_driver.mod   
module_state_description.mod 
module_configure.mod  module_domain.mod 
module_force_scm.modmodule_radiation_driver.mod 
module_surface_driver.mod
module_convtrans_prep.mod module_em.mod 
module_fr_fire_driver_wrf.mod   module_scalar_tables.mod module_utility.mod

But I failed to create attachment for them since they are too big.

[Bug tree-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #1 from JuzheZhong  ---
Created attachment 57149
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57149=edit
spec2017 wrf

spec2017 wrf

[Bug c/113495] New: RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

Bug ID: 113495
   Summary: RISC-V: Time and memory awful consumption of SPEC2017
wrf benchmark
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: juzhe.zhong at rivai dot ai
  Target Milestone: ---

riscv64-unknown-linux-gnu-gfortran -march=rv64gcv_zvl256b -O3 -S -ftime-report

real63m18.771s
user60m19.036s
sys 2m59.787s

60+ minutes.

After investigation, the time report show 2 PASS are critical:

loop invariant motion  :2600.28 ( 72%)   1.68 (  1%)2602.12 ( 69%) 
2617k (  0%)

loop invariant consume most of the time 72% time.

The other is the VSETVL PASS:

vsetvl: earliest_fuse_vsetvl_info  : 438.26 ( 12%)  79.82 ( 47%) 518.08 (
14%)221807M ( 75%)
 vsetvl: pre_global_vsetvl_info : 135.98 (  4%)  31.71 ( 19%) 167.69 (  4%)
71950M ( 24%)

The phase 2 and phase 3 of VSETVL PASS consume 16% times and 99% memory.

I will look into VSETVL PASS issue but I am not able to take care of loop
invariant issue.

[Bug middle-end/113166] RISC-V: Redundant move instructions in RVV intrinsic codes

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113166

--- Comment #2 from JuzheZhong  ---
#include 


#if TO_16
# define uintOut_t uint16_t
# define utf8_to_utf32_scalar utf8_to_utf16_scalar
# define utf8_to_utf32_rvv utf8_to_utf16_rvv
#else
# define uintOut_t uint32_t
#endif


size_t utf8_to_utf32_scalar(char const *src, size_t count, uintOut_t *dest);

size_t
utf8_to_utf32_rvv(char const *src, size_t count, uintOut_t *dest)
{
size_t tail = 3;
if (count < tail) return utf8_to_utf32_scalar(src, count, dest);

/* validate first three bytes */
{
size_t idx = tail;
while (idx < count && (src[idx] >> 6) == 0b10)
++idx;
uintOut_t buf[10];
if (idx > tail + 3 || !utf8_to_utf32_scalar(src, idx, buf))
return 0;
}

size_t n = count - tail;
uintOut_t *destBeg = dest;

static const uint64_t err1m[] = { 0x0202020202020202,
0x4915012180808080 };
static const uint64_t err2m[] = { 0xcbcbcb8b8383a3e7,
0xcbcbdbcbcbcbcbcb };
static const uint64_t err3m[] = { 0x0101010101010101,
0x01010101babaaee6 };

const vuint8m1_t err1tbl =
__riscv_vreinterpret_v_u64m1_u8m1(__riscv_vle64_v_u64m1(err1m, 2));
const vuint8m1_t err2tbl =
__riscv_vreinterpret_v_u64m1_u8m1(__riscv_vle64_v_u64m1(err2m, 2));
const vuint8m1_t err3tbl =
__riscv_vreinterpret_v_u64m1_u8m1(__riscv_vle64_v_u64m1(err3m, 2));

const vuint8m2_t v64u8m2 = __riscv_vmv_v_x_u8m2(1<<6,
__riscv_vsetvlmax_e8m2());

const size_t vl8m1 = __riscv_vsetvlmax_e8m1();
const size_t vl16m2 = __riscv_vsetvlmax_e16m2();

#if TO_16
size_t vl8m2 = __riscv_vsetvlmax_e8m2();
const vbool4_t m4odd =
__riscv_vmsne_vx_u8m2_b4(__riscv_vand_vx_u8m2(__riscv_vid_v_u8m2(vl8m2), 1,
vl8m2), 0, vl8m2);
#endif

for (size_t vl, vlOut; n > 0; n -= vl, src += vl, dest += vlOut) {

vl = __riscv_vsetvl_e8m2(n);

vuint8m2_t v0 = __riscv_vle8_v_u8m2((uint8_t const*)src, vl);
uint64_t max =
__riscv_vmv_x_s_u8m1_u8(__riscv_vredmaxu_vs_u8m2_u8m1(v0,
__riscv_vmv_s_x_u8m1(0, vl), vl));

/* fast path: ASCII */
if (max < 0b1000) {
vlOut = vl;
#if TO_16
__riscv_vse16_v_u16m4(dest, __riscv_vzext_vf2_u16m4(v0,
vlOut), vlOut);
#else
__riscv_vse32_v_u32m8(dest, __riscv_vzext_vf4_u32m8(v0,
vlOut), vlOut);
#endif
continue;
}

/* see "Validating UTF-8 In Less Than One Instruction Per Byte"
 * https://arxiv.org/abs/2010.03090 */
vuint8m2_t v1 = __riscv_vslide1down_vx_u8m2(v0, src[vl+0], vl);
vuint8m2_t v2 = __riscv_vslide1down_vx_u8m2(v1, src[vl+1], vl);
vuint8m2_t v3 = __riscv_vslide1down_vx_u8m2(v2, src[vl+2], vl);

vuint8m2_t s1 =
__riscv_vreinterpret_v_u16m2_u8m2(__riscv_vsrl_vx_u16m2(__riscv_vreinterpret_v_u8m2_u16m2(v2),
4, vl16m2));
vuint8m2_t s3 =
__riscv_vreinterpret_v_u16m2_u8m2(__riscv_vsrl_vx_u16m2(__riscv_vreinterpret_v_u8m2_u16m2(v3),
4, vl16m2));

vuint8m2_t idx2 = __riscv_vand_vx_u8m2(v2, 0xf, vl);
vuint8m2_t idx1 = __riscv_vand_vx_u8m2(s1, 0xf, vl);
vuint8m2_t idx3 = __riscv_vand_vx_u8m2(s3, 0xf, vl);

#define VRGATHER_u8m1x2(tbl, idx) \
__riscv_vset_v_u8m1_u8m2(__riscv_vlmul_ext_v_u8m1_u8m2(
\
__riscv_vrgather_vv_u8m1(tbl,
__riscv_vget_v_u8m2_u8m1(idx, 0), vl8m1)), 1, \
__riscv_vrgather_vv_u8m1(tbl,
__riscv_vget_v_u8m2_u8m1(idx, 1), vl8m1));

vuint8m2_t err1 = VRGATHER_u8m1x2(err1tbl, idx1);
vuint8m2_t err2 = VRGATHER_u8m1x2(err2tbl, idx2);
vuint8m2_t err3 = VRGATHER_u8m1x2(err3tbl, idx3);
vuint8m2_t errs =
__riscv_vand_vv_u8m2(__riscv_vand_vv_u8m2(err1, err2, vl), err3, vl);

vbool4_t is_3 = __riscv_vmsgtu_vx_u8m2_b4(v1, 0b1110-1,
vl);
vbool4_t is_4 = __riscv_vmsgtu_vx_u8m2_b4(v0, 0b-1,
vl);
vbool4_t is_34 = __riscv_vmor_mm_b4(is_3, is_4, vl);
vbool4_t err34 = __riscv_vmxor_mm_b4(is_34,
__riscv_vmsgtu_vx_u8m2_b4(errs, 0b0111, vl), vl);
vbool4_t errm =
__riscv_vmor_mm_b4(__riscv_vmsgt_vx_i8m2_b4(__riscv_vreinterpret_v_u8m2_i8m2(errs),
0, vl), err34, vl);
if (__riscv_vfirst_m_b4(errm , vl) >= 0)
return 0;

/* decoding */

/* mask of non continuation bytes */
vbool4_t m = __riscv_vmsne_vx_u8m2_b4(__riscv_vsrl_vx_u8m2(v0,
6, vl), 0b10, vl);
vlOut = __riscv_vcpop_m_b4(m, vl);

/* extract first and second bytes */

[Bug c/113474] RISC-V: Fail to use vmerge.vim for constant vector

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113474

--- Comment #2 from JuzheZhong  ---
Oh. It's pretty simple fix. I am not sure whether Richards allow it since it's
stage4 but worth to have a try.

Could you send a patch ?

[Bug c/113474] New: RISC-V: Fail to use vmerge.vim for constant vector

2024-01-18 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113474

Bug ID: 113474
   Summary: RISC-V: Fail to use vmerge.vim for constant vector
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: juzhe.zhong at rivai dot ai
  Target Milestone: ---

void
foo (int n, int **__restrict a)
{
  int b;
  int c;
  int d;
  for (b = 0; b < n; b++)
for (long e = 8; e > 0; e--)
  a[b][e] = a[b][e] == 15;
}

ASM:

foo:
ble a0,zero,.L5
sllia3,a0,3
add a3,a1,a3
vsetivlizero,4,e32,m1,ta,ma
vmv.v.i v3,1-> redundant
vmv.v.i v2,0
.L3:
ld  a5,0(a1)
addia4,a5,4
addia5,a5,20
vle32.v v1,0(a5)
vle32.v v0,0(a4)
vmseq.viv0,v0,15
vmerge.vvm  v4,v2,v3,v0  > It should be vmerge.vim
vse32.v v4,0(a4)
vmseq.viv0,v1,15
addia1,a1,8
vmerge.vvm  v1,v2,v3,v0  > It should be vmerge.vim
vse32.v v1,0(a5)
bne a1,a3,.L3
.L5:
ret


It's odd we can generate vmseq.vi but fail to generate vmerge.vim.

Look into patterns of vcond_mask:

(define_insn_and_split "vcond_mask_"
  [(set (match_operand:V_VLS 0 "register_operand")
(if_then_else:V_VLS
  (match_operand: 3 "register_operand")
  (match_operand:V_VLS 1 "nonmemory_operand")   -> relax the predicate
  (match_operand:V_VLS 2 "register_operand")))]


Why GCC doesn't fold const_vector into operand 1 ?

[Bug target/113429] RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA build

2024-01-17 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429

--- Comment #10 from JuzheZhong  ---
I have commit V3 patch with rebasing since V2 patch conflicts with the trunk.

I think you can use trunk GCC validate CAM4 directly now.

  1   2   3   4   5   6   >