[Bug target/98596] registers not reused on RISC-V

2023-09-12 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98596

Vineet Gupta  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED
 CC||vineetg at gcc dot gnu.org

--- Comment #3 from Vineet Gupta  ---
This is fixed with following commit (and will make it into gcc-14)

commit b41d7eb0e14785ff0ad6e6922cbd4c880e680bf9
Author: Vineet Gupta 
Date:   Mon Aug 7 13:45:29 2023 -0700

RISC-V: Enable Hoist to GCSE simple constants

Hoist want_to_gcse_p () calls rtx_cost () to compute max distance for
hoist candidates. For a simple const (say 6 which needs seperate insn "LI
6")
backend currently returns 0, causing Hoist to bail and elide GCSE.

Note that constants requiring more than 1 insns to setup were working
fine since riscv_rtx_costs () was returning non-zero (although that
itself might need refining: see bugzilla 39).

To keep testsuite parity, some V tests need updating which started failing
in the new costing regime.

[Bug target/111311] RISC-V regression testsuite errors with --param=riscv-autovec-preference=scalable

2023-11-02 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111311

--- Comment #9 from Vineet Gupta  ---
(In reply to Patrick O'Neill from comment #8)
> Updated regression list using r14-5070-g4ea36076d66 on rv64gcv:
> 
> Failure list from:
> https://github.com/patrick-rivos/gcc-postcommit-ci/issues/109

And just for completeness, we have this as starting point of investigation.

linux: rv64gc lp64d medlow  34/17   13/430/5
linux: rv64gcv lp64d medlow multilib83/52   13/430/11

[Bug target/111311] RISC-V regression testsuite errors with --param=riscv-autovec-preference=scalable

2023-11-02 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111311

--- Comment #11 from Vineet Gupta  ---
(In reply to Robin Dapp from comment #10)
> As a general remark:  Some of those are present on other backends as well,
> some have been introduced by recent common-code changes and some are bogus
> test prerequisites or checks.

Is is possible to this identification (so we can at least mark them xfail or
some such). A lot of folks working on middle-end know this for certain, but for
the mere mortals every test failure seems just the same and equally important
:-)

[Bug tree-optimization/111791] RISC-V: Strange loop vectorizaion on popcount function

2023-10-18 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111791

--- Comment #5 from Vineet Gupta  ---
(In reply to Robin Dapp from comment #4)

> Analyzing loop at pr111791.c:8
> pr111791.c:8:25: note:  === analyze_loop_nest ===
> pr111791.c:8:25: note:   === vect_analyze_loop_form ===
> pr111791.c:8:25: note:=== get_loop_niters ===
> Matching expression match.pd:1919, generic-match-8.cc:27
> Applying pattern match.pd:1975, generic-match-2.cc:4670
> Matching expression match.pd:2707, generic-match-4.cc:36
> Matching expression match.pd:2710, generic-match-3.cc:53
> Matching expression match.pd:2717, generic-match-2.cc:23
> Matching expression match.pd:2707, generic-match-4.cc:36
> Matching expression match.pd:2710, generic-match-3.cc:53
> Matching expression match.pd:2717, generic-match-2.cc:23
> Matching expression match.pd:2707, generic-match-4.cc:36
> Matching expression match.pd:2710, generic-match-3.cc:53
> Matching expression match.pd:2717, generic-match-2.cc:23
> Matching expression match.pd:148, generic-match-10.cc:27
> Matching expression match.pd:148, generic-match-10.cc:27
> Applying pattern match.pd:4519, generic-match-4.cc:2923
> Applying pattern match.pd:201, generic-match-4.cc:3103
> Applying pattern match.pd:3393, generic-match-2.cc:182
> pr111791.c:8:25: note:   Symbolic number of iterations is (unsigned intD.4)
> __builtin_popcountlD.1952 (value_4(D))

Curious, how did you get this debug output - is this just one of -fdump-tree-?

[Bug target/111466] RISC-V: redundant sign extensions despite ABI guarantees

2023-09-28 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111466

Vineet Gupta  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-09-28
 Ever confirmed|0   |1

--- Comment #3 from Vineet Gupta  ---
(In reply to Vineet Gupta from comment #1)

> #2. At Expand time there's an explicit sign_extend for the incoming function
> arg which is not needed per RISC-V ABI. Not generating these to begin with
> will require less fixup needs in REE and/or CSE.
> 
> (insn 3 2 4 2 (set (reg/v:DI 141 [ n ])
> (reg:DI 11 a1 [ n ]))
> 
> (insn 12 6 13 2 (set (reg:DI 138 [ n.1_15 ])
> (sign_extend:DI (subreg/u:SI (reg/v:DI 141 [ n ]) 0)))

Robin and I debugged this at GNU Cauldron and he narrowed it down to subreg
promoted flag being cleared out which in turn causes the sign extend to be
generated. As a hack if the flag is restored the sign extend goes away. The
only issue is that flag clearing was introduced 30 years ago, albeit w/o any
additional commentary and/or test.

   commit 506980397227045212375e2dd2a1ae68a1afd481
   Author: Richard Kenner 
   Date:   Fri Jul 8 18:22:46 1994 -0400

   (expand_expr, case CONVERT_EXPR): If changing signedness and we have a
   promoted SUBREG, clear the promotion flag.

   From-SVN: r7686

Interestingly reverting this change survive the rv64gc testsuite w/o any
additional failures, so this seems to work at least for RISC-V, but may not on
other arches/ABIs.

I've posted an RFC for people familiar with the code to chime on this approach
[1]

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631641.html

[Bug target/111466] RISC-V: redundant sign extensions despite ABI guarantees

2023-09-28 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111466

--- Comment #2 from Vineet Gupta  ---
(In reply to Vineet Gupta from comment #1)

> #1. REE reports failure as "missing definition(s)".
> 
> This is because function args don't have an explicit def, they are just
> there.
> 
> Cannot eliminate extension:
> (insn 12 6 13 2 (set (reg:DI 16 a6 [orig:138 n.1_15 ] [138])
> (sign_extend:DI (reg:SI 11 a1 [orig:141 n ] [141])))  {extendsidi2}
>  (nil))
>  because of missing definition(s)

For addressing missing definition(s) there are a couple of approaches:

#1a. Try to use Ajit Agarwal's REE updates [1] which is supposed to uses
defined ABI interfaces and identify incoming args or return values. 
  - however even the latest v8 series doesn't properly address the review
comments - it hard codes the {ZERO,SIGN}_EXTEND in REE w/o actually querying
the ABI
  - requires both src and dest hard regs be the same which is often not the
case. 
  - But we can certainly use some concepts from this patch.


#1b. To Jeff suggested [2][3] inserting dummy sign_extend in REE for the
function args, which could be eliminated by REE.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630935.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630899.html
[3] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631543.html

[Bug target/111466] RISC-V: redundant sign extensions despite ABI guarantees

2023-09-27 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111466

--- Comment #1 from Vineet Gupta  ---
So there are various aspects to tackling this issue.

#1. REE reports failure as "missing definition(s)".

This is because function args don't have an explicit def, they are just there.

Cannot eliminate extension:
(insn 12 6 13 2 (set (reg:DI 16 a6 [orig:138 n.1_15 ] [138])
(sign_extend:DI (reg:SI 11 a1 [orig:141 n ] [141])))  {extendsidi2}
 (nil))
 because of missing definition(s)

#2. At Expand time there's an explicit sign_extend for the incoming function
arg which is not needed per RISC-V ABI. Not generating these to begin with will
require less fixup needs in REE and/or CSE.

(insn 3 2 4 2 (set (reg/v:DI 141 [ n ])
(reg:DI 11 a1 [ n ]))

(insn 12 6 13 2 (set (reg:DI 138 [ n.1_15 ])
(sign_extend:DI (subreg/u:SI (reg/v:DI 141 [ n ]) 0)))

[Bug rtl-optimization/111467] New: REE failing to eliminate redundant extension due to multiple reaching def(s)

2023-09-18 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111467

Bug ID: 111467
   Summary: REE failing to eliminate redundant extension due to
multiple reaching def(s)
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: vineetg at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: jeffreyalaw at gmail dot com, jivanhakobyan9 at gmail dot 
com,
kito at gcc dot gnu.org, palmer at gcc dot gnu.org
  Target Milestone: ---

For the trivial test case below (credit goes to Palmer for mentioning this
almost 2 years ago).

int
foo6(int a, int b)
{
  return a > b ? a : b;
}

-O2 -march=rv64gc

foo6:
mv  a5,a1
bge a1,a0,.L5
mv  a5,a0
.L5:
sext.w  a0,a5
ret

REE fails to eliminate the sign extension due to multiple reaching definitions
constraint.

I don't know how involved or runtime cost relaxing the constraint is, so
opening this PR to investigate.

FWIW a zba build generates a max insn, eliminating the sext.w, but the vanilla
case shows where things can possibly be improved.

[Bug target/111466] New: RISC-V: redundant sign extensions despite ABI guarantees

2023-09-18 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111466

Bug ID: 111466
   Summary: RISC-V: redundant sign extensions despite ABI
guarantees
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: vineetg at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: aagarwa at gcc dot gnu.org, jeffreyalaw at gmail dot com,
jivanhakobyan9 at gmail dot com, kito at gcc dot gnu.org,
palmer at gcc dot gnu.org
  Target Milestone: ---

Consider the test below:

int foo(int unused, int n, unsigned y, unsigned delta){
  int s = 0;
  unsigned int x = 0;// if int, sext elided
  for (;xhttps://gcc.gnu.org/pipermail/gcc-patches/2023-September/630811.html

[Bug rtl-optimization/111467] REE failing to eliminate redundant extension due to multiple reaching def(s)

2023-09-18 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111467

--- Comment #1 from Vineet Gupta  ---
(insn 8 4 11 2 (set (reg:SI 15 a5 [orig:137 b ] [137])  <--- DEF #1
(reg:SI 11 a1 [orig:136 b ] [136])) "max.c":12:20 207 {*movsi_internal}
 (nil))

(jump_insn 11 8 22 2 (set (pc)
(if_then_else (ge (reg/v:DI 11 a1 [orig:136 b ] [136])
(reg/v:DI 10 a0 [orig:135 a ] [135]))
(label_ref 13)
(pc))) "max.c":12:20 273 {*branchdi}
 (int_list:REG_BR_PROB 536870916 (nil))
 -> 13)

(insn 12 22 13 3 (set (reg:SI 15 a5 [orig:137 b ] [137])<--- DEF #2
(reg:SI 10 a0 [orig:135 a ] [135])) "max.c":12:20 207 {*movsi_internal}
 (nil))

(code_label 13 12 23 4 2 (nil) [1 uses])

(insn 19 14 20 4 (set (reg/i:DI 10 a0)  <-- USE: Multiple reaching DEFs
(sign_extend:DI (reg:SI 15 a5 [orig:137 b ] [137]))) "max.c":13:1 122
{extendsidi2}
 (nil))

[Bug target/109279] RISC-V: complex constants synthesized should be improved

2023-10-06 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109279

--- Comment #18 from Vineet Gupta  ---
(In reply to Vineet Gupta from comment #17)
> (In reply to Vineet Gupta from comment #16)
> > > Which is what this produces:
> > > ```
> > > long long f(void)
> > > {
> > >   unsigned t = 16843009;
> > >   long long t1 = t;
> > >   long long t2 = ((unsigned long long )t) << 32;
> > >   asm("":"+r"(t1));
> > >   return t1 | t2;
> > > }
> > > ```

> > li  a0,16842752
> > addia0,a0,257
> > li  a5,16842752
> > sllia0,a0,32
> > addia5,a5,257
> > or  a0,a5,a0
> > ret
> 
> This is again IRA inflicted pain (similar to [PR110748]). 
> IRA seems to be undoing split1 since we have 2 insn sequences to synthesize
> the constant pieces. This explains why the problem got exacerbated with
> commit 0530254413f8 ("riscv: relax splitter restrictions for creating
> pseudos") since now different regs are used to create parts of const, vs 1
> reg being repeatedly used for assembling a const (fooling IRA's equivalent
> replacement logic).

After commit 
2023-08-18 a047513c9222 RISC-V: Enable pressure-aware scheduling by
default.  

the test above has improved.

li  a5,16842752
addia5,a5,257
mv  a0,a5
sllia5,a5,32
or  a0,a0,a5
ret

[Bug target/109279] RISC-V: complex constants synthesized should be improved

2023-10-06 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109279

--- Comment #19 from Vineet Gupta  ---
FWIW with today's change, splitter is now hidden from IRA, but we are still
getting the extraneous mv.

2023-10-06 c1bc7513b1d7 RISC-V: const: hide mvconst splitter from IRA

[Bug target/111139] New: RISC-V: improve scalar constants cost model

2023-08-24 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39

Bug ID: 39
   Summary: RISC-V: improve scalar constants cost model
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: jeffreyalaw at gmail dot com, kito.cheng at gmail dot com,
palmer at dabbelt dot com
  Target Milestone: ---

The current const cost determination in riscv_rtx_costs () and its children
such as riscv_const_insns () needs improvements.

1. There's some likely inaccuracies with various heuristics.
2. Those heuristics are distributed in a bunch of places and better be
consolidated.
3. We need to make const cost cpu/uarch tunable as hardware widgets like macro
fusions could ammortize multi-insn const costs.


Some of the heuristics to cleanup/revisit:

1a. riscv_rtx_cost() returns 1 insn even if riscv_const_insns () returns > 1.

case CONST:
  if ((cost = riscv_const_insns (x)) > 0)
{
  if (cost == 1 && outer_code == SET)
*total = COSTS_N_INSNS (1);
  else if (outer_code == SET || GET_MODE (x) == VOIDmode)
*total = COSTS_N_INSNS (1);
}

1b. riscv_const_insns () caps the const cost to 4 even if it higher with intent
to force const pool. RV backend in general no longer favors const pools for
large constants since 2e886eef7f2b5a ("RISC-V: Produce better code with complex
constants [PR95632] [PR106602]"). This heuristic needs to be revisited.

1c. riscv_split_integer_cost () adds 2 to initial cost computed.

[Bug target/111139] RISC-V: improve scalar constants cost model

2023-08-25 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39

--- Comment #2 from Vineet Gupta  ---
Test case to help drive some of this:

unsigned long long f5(unsigned long long i)
{
  return i * 0x0202020202020202ULL;
}

[Bug target/110748] RISC-V: optimize store of DF 0.0

2023-08-15 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

--- Comment #16 from Vineet Gupta  ---
(In reply to Vineet Gupta from comment #15)

> On the branch devel/vineetg/optim-double-const-m0 I have double -0.0 working.
> 
> znd:
> li  a5,-1
> sllia5,a5,63
> sd  a5,0(a0)
> ret
> 
> There's currently an ICE for zbs
> 
> IRA is undoing the split so the insn with const_int 0x8000_
> doesn't exist for final pass.
> 
> expand
> --
> (insn 6 3 0 2 (set (mem:DF (reg:DI 135)
> (const_double:DF -0.0 [-0x0.0p+0])) {*movdf_hardfloat_rv64}
> 
> split1
> -
> (insn 10 3 11 2 (set (reg:DI 136)
> (const_int [0x8000]))
> 
> (insn 11 10 0 2 (set (mem:DF (reg:DI 135)
> (subreg:DF (reg:DI 136) 0))
> 
> ira
> 
> (insn 11 9 12 2 (set (mem:DF (reg:DI 135)
> (const_double:DF -0.0 [-0x0.0p+0])) {*movdf_hardfloat_rv64}

So IRA is doing the equivalent replacement for a register which is referenced
exactly twice: set once and used once, w/o any reg pressure considerations [1].

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-August/627212.html

There seems to be no easy way around it.

[Bug target/109279] RISC-V: complex constants synthesized should be improved

2023-08-15 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109279

--- Comment #17 from Vineet Gupta  ---
(In reply to Vineet Gupta from comment #16)
> > Which is what this produces:
> > ```
> > long long f(void)
> > {
> >   unsigned t = 16843009;
> >   long long t1 = t;
> >   long long t2 = ((unsigned long long )t) << 32;
> >   asm("":"+r"(t1));
> >   return t1 | t2;
> > }
> > ```

> 
>   li  a0,16842752
>   addia0,a0,257
>   li  a5,16842752
>   sllia0,a0,32
>   addia5,a5,257
>   or  a0,a5,a0
>   ret

This is again IRA inflicted pain (similar to [PR110748]). 
IRA seems to be undoing split1 since we have 2 insn sequences to synthesize the
constant pieces. This explains why the problem got exacerbated with commit
0530254413f8 ("riscv: relax splitter restrictions for creating pseudos") since
now different regs are used to create parts of const, vs 1 reg being repeatedly
used for assembling a const (fooling IRA's equivalent replacement logic).

[Bug target/112447] risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-08 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

--- Comment #4 from Vineet Gupta  ---
Created attachment 56541
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56541=edit
asm output nok

[Bug target/112447] risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-08 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

--- Comment #3 from Vineet Gupta  ---
Created attachment 56540
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56540=edit
asm output ok

[Bug target/112447] New: risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-08 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

Bug ID: 112447
   Summary: risc-v regression: FAIL:
gcc.c-torture/execute/memset-3.c   -O3
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: vineetg at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: jeffreyalaw at gmail dot com, juzhe.zhong at rivai dot ai,
lehua.ding at rivai dot ai, rdapp at gcc dot gnu.org
  Target Milestone: ---

As reported in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111311#c8

we have following execute failures on trunk.

=== gcc: Unexpected fails for rv64gcv lp64d medlow ===
FAIL: gcc.c-torture/execute/memset-3.c   -O3 -g  execution test

The issue is an extraneous VSETVLI instruction (with wrong SEW) being generated
which creates wrong fill pattern for memset.

```
main:

[...]

.L36:  ; 2. loop start for @off 0 
vse8.v  v1,0(t3)
vse8.v  v1,0(t6)
vse8.v  v1,0(s1)
vse8.v  v3,0(a5)
...
; loop epilogue
li  a7,15
beq a4,a7,.L171
vsetvli zero,zero,e32,m2,ta,ma   <--- wrong
j   .L36
```

vsetvli pass dumps:

```
Phase 3: Reduce global vsetvl infos. 

  Compute LCM insert and delete data:

  Expr[2]: VALID (insn 2847, bb 3)
Demand fields: demand_sew_lmul demand_avl
SEW=8, VLMUL=mf2, RATIO=16, MAX_SEW=64
TAIL_POLICY=agnostic, MASK_POLICY=agnostic
AVL=(const_int 8 [0x8])
VL=(nil)

VSETVL infos after phase 3

  bb 3:
probability: always (guessed)
Header vsetvl info:VALID (insn 2847, bb 3) (deleted)  <---
  Demand fields: demand_sew_lmul demand_avl
  SEW=8, VLMUL=mf2, RATIO=16, MAX_SEW=64
  TAIL_POLICY=agnostic, MASK_POLICY=agnostic
  AVL=(const_int 8 [0x8])
  VL=(nil)
```

So it seems LCM is deleting the valid VSETVLI insn which later causes Phase 4
to insert a different/incorrect one.

I revert the following commit and the issue goes away. 

 2023-10-18 f0e28d8c1371 RISC-V: Fix failed hoist in LICM of vmv.v.x
instruction  

This at least tells us the cause of issue, next step is to fix the issue.

[Bug target/112447] risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-08 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

--- Comment #2 from Vineet Gupta  ---
Created attachment 56539
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56539=edit
manually reduced src

[Bug target/111311] RISC-V regression testsuite errors with --param=riscv-autovec-preference=scalable

2023-11-08 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111311

--- Comment #16 from Vineet Gupta  ---
(In reply to Patrick O'Neill from comment #8)
> Updated regression list using r14-5070-g4ea36076d66 on rv64gcv:
> 
> === gcc: Unexpected fails for rv64gcv lp64d medlow ===
> FAIL: gcc.c-torture/execute/memset-3.c   -O3 -fomit-frame-pointer
> -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
> FAIL: gcc.c-torture/execute/memset-3.c   -O3 -g  execution test

memset-3 failure tracked separately:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

[Bug target/112447] risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-08 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

--- Comment #6 from Vineet Gupta  ---
I have debugged this by single stepped in qemu 

when the test fails (first loop for offset 0, iteration 8)

The last VSETVLI is this one, 

   0x10a3e   0d107057  vsetvli  zero,zero,e32,m2,ta,ma
   0x10a42   j  0x10666

We eventually hit a VMV.v.x. which creates invalid pattern due to e32.

   (gdb) info reg vtype
   vtype  0xd1  209 # SEW = 010’b / e32, LMUL = 001’b / m2
   (gdb) info reg vl
   vl 0x8   8
   (gdb) info reg a0
   a0 0x41  65

   vmv.v.x  v2,a0

  (gdb) info reg v2
  v2 {q = {0x41004100410041} 
  (gdb) info reg v3
  v2 {q = {0x41004100410041}

[Bug target/112447] risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-08 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

Vineet Gupta  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2023-11-08
 Status|UNCONFIRMED |ASSIGNED

--- Comment #9 from Vineet Gupta  ---
(In reply to JuzheZhong from comment #7)
> Oh. I missed it:
> 
>   vmv.v.x v2,s0
>   vse8.v  v2,0(a5)
> 
> Leave it to me today. It should be simple fix.
> 
> Thanks for report it.

Can I request you to let me continue to debug and fix it. I want to familiarize
myself with the vsetv pass and this seems like a good opportunity to do so
considering you think the fix is not hard.

[Bug target/109574] New: RISC-V: gcc.dg/pr90838.c failing due to extra ANDI 127 on releases/gcc-13

2023-04-20 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109574

Bug ID: 109574
   Summary: RISC-V: gcc.dg/pr90838.c failing due to extra ANDI 127
on releases/gcc-13
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
  Target Milestone: ---

int ctz3 (unsigned x)
{
  static int table[32] =
{
  0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
  31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
};

  if (x == 0) return 32;
  x = (x & -x) * 0x04D7651F;
  return table[x >> 27];
}

riscv64-unknown-linux-gnu-gcc -O2 -march=rv64gc_zbb

Before

ctz3:
ctzwa0,a0
ret

Now

ctz3:
ctzwa0,a0
andia0,a0,127
ret

Bisected this to c23a9c87cc62bd177 ("Some additional zero-extension related
optimizations in simplify-rtx.")

[Bug tree-optimization/106888] [RISCV] Negative optimization that excess andi instructions are generated in gcc.dg/pr90838.c

2023-04-20 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106888

--- Comment #3 from Vineet Gupta  ---
Debugging of ctz3 case

The insns of interest are:

insn_cost 4 for 6: r74:SI=ctz(r73:DI#0)
  REG_DEAD r73:DI
insn_cost 4 for 7: r72:DI=sign_extend(r74:SI)
  REG_DEAD r74:SI

Before the commit in question, combine is able to mush them

allowing combination of insns 6 and 7
original costs 4 + 4 = 8
replacement cost 8
deferring deletion of insn with uid = 6.
modifying insn i3 7: r72:DI=sign_extend(ctz(r76:DI#0))

With the commit in questions, it takes the new code patch introduced

combine_simplify_rtx

   simplify_context::simplify_unary_operation_1
  case SIGN_EXTEND
+ if (val_signbit_known_clear_p
+ simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op));

   return expand_compound_operation (x);  // x is ZERO_EXTEND now

[Bug tree-optimization/106888] [RISCV] Negative optimization that excess andi instructions are generated in gcc.dg/pr90838.c

2023-04-21 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106888

--- Comment #9 from Vineet Gupta  ---
(In reply to Jeffrey A. Law from comment #6)
> Comment on attachment 54905 [details]
> proposed patch
> 
> So that's a subset of what we've done.  We initially thought that was going
> to be enough to solve this class of problems.   But it's actually deeper
> than just having a zero_extension variant of this pattern. 

Yeah it seems adding a new define_insn with zero_extend is not enough (nor is
the more elegant any_extend to existing "*disi2")

Thing is at expand time, we have gimple CTZ expand to ctz+sign_extend, so
adding zero_extend won't really help ?

(insn 6 3 7 2 (set (reg:SI 74)
(ctz:SI (subreg/s/u:SI (reg/v:DI 73 [ x ]) 0))) "pr90838-red.c":11:15
-1
 (nil))
(insn 7 6 8 2 (set (reg:DI 72 [  ])
(sign_extend:DI (reg:SI 74))) "pr90838-red.c":11:15 -1
 (nil))


> I'll officially submit the zero_extension pattern and the match.pd bits. 
> The other pattern we wrote is fugly and I'd like to look at it one more time.

But that other pattern is needed for combine to fuse them together.

[Bug tree-optimization/106888] [RISCV] Negative optimization that excess andi instructions are generated in gcc.dg/pr90838.c

2023-04-21 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106888

--- Comment #7 from Vineet Gupta  ---
(In reply to Roger Sayle from comment #5)
> Created attachment 54905 [details]
> proposed patch
> 
> This patch should fix this problem, by adding another pattern the machine
> description to also recognize zero_extend of clz/ctz/pcnt, matching the
> current pattern that only matches sign_extend.  Clearly for SI operands, the
> result must always be 0..32, so sign extension and zero extension are
> equivalent, and the zero extension is perhaps (now) the preferred canonical
> form.

Thx for the patch Roger, but as Jeff noted, it alone is not enough and
generates same extra ANDI. Would you have expected combine to recog() the new
pattern ?

[Bug target/110748] RISC-V: optimize store of DF 0.0

2023-07-20 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

--- Comment #3 from Vineet Gupta  ---
Indeed the constraint already exists

(define_insn "*movdf_hardfloat_rv64"
 [(set (match_operand:DF 0 "nonimmediate_operand" "=f,f,f,m,m,*f,*r, 
*r,*r,*m")
^^
   (match_operand:DF 1 "move_operand" "
f,G,m,f,G,*r,*f,*r*G,*m,*r")
^^
 )]

At expand time: gen_movdf() -> riscv_legitimize_move forces a reg, as
reg_or_0_operand () returns false.

Breakpoint 7, riscv_legitimize_move (mode=E_DFmode, dest=0x76db9af8,
src=0x76c0c050) at ../../gcc/gcc/config/riscv/riscv.cc:2162
2162{

(gdb) call debug_rtx(dest)
(mem:DF (reg/v/f:DI 134 [ d ]) [1 *d_2(D)+0 S8 A64])
(gdb) call debug_rtx(src)
(const_double:DF 0.0 [0x0.0p+0])

2232if (!register_operand (dest, mode) && !reg_or_0_operand (src, mode))
(gdb) n
2257reg = force_reg (mode, src);
(gdb) 
2258riscv_emit_move (dest, reg);
(gdb) 
2259return true;

While for int 0, reg_or_0_operand returns true.

[Bug target/110748] RISC-V: optimize store of DF 0.0

2023-07-20 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

Vineet Gupta  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2023-07-20
 Status|UNCONFIRMED |ASSIGNED

[Bug target/110748] New: optimize store of DF 0.0

2023-07-20 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

Bug ID: 110748
   Summary: optimize store of DF 0.0
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
  Target Milestone: ---
Target: RISC-V

Currently a store of int 0 is optimized by using reg x0.

void zi(int *i) {*i = 0;}

-O2 =march=rv64gc

  sw  zero,0(a0)
  ret

However a store of positive DF 0.0 generates 2 insns.

void zd(double *d) { *d = 0.0;  }

  fmv.d.x fa5,zero
  fsd fa5,0(a0)
  ret

Since +0.0 is all zero bits, this could be generated as an int store
   sw zero, 0(a0) 

This is 1 less insn and avoids the FPU thus overall a win.

This came up when discussing an ICE in anewly proposed pass f-m-o by Manolis.
Turns out that it could be an independent optimization opportunity [1].

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624935.html

[Bug target/110748] RISC-V: optimize store of DF 0.0

2023-07-20 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

--- Comment #8 from Vineet Gupta  ---
(In reply to Jeffrey A. Law from comment #5)
> I'd bet it's const_0_operand not allowing CONST_DOUBLE.

Correct.

> The question is what unintended side effects we'd have if we allowed
> CONST_DOUBLE 0.0 in const_0_operand.

Exactly. I had the same concern. 
I do have a hack which creates a new predicate and that seems to do the trick.

+(define_predicate "const_0hf_operand"
+  (and (match_code "const_double")
+   (match_test "op == CONST0_RTX (GET_MODE (op))")))
+
+(define_predicate "reg_or_0_operand_inc_hf"
+  (ior (match_operand 0 "reg_or_0_operand")
+   (match_operand 0 "const_0hf_operand")))

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc

-  if (!register_operand (dest, mode) && !reg_or_0_operand (src, mode))
+  if (!register_operand (dest, mode) && !reg_or_0_operand_inc_hf (src, mode))
 {

And it seems to be generating the desired int 0 for double 0.0.

However to Kito's point, this indeed works in gcc 12 so I first need to bisect
what regressed it in 13.

[Bug target/110748] RISC-V: optimize store of DF 0.0

2023-07-20 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

--- Comment #9 from Vineet Gupta  ---
(In reply to Vineet Gupta from comment #8)
> (In reply to Jeffrey A. Law from comment #5)
> > I'd bet it's const_0_operand not allowing CONST_DOUBLE.
> 
> Correct.
> 
> > The question is what unintended side effects we'd have if we allowed
> > CONST_DOUBLE 0.0 in const_0_operand.
> 
> Exactly. I had the same concern. 

[...]

> However to Kito's point, this indeed works in gcc 12 so I first need to
> bisect what regressed it in 13.

The mystery is solved. Guess what it was my change ef85d150b5963 ("RISC-V:
Enable TARGET_SUPPORTS_WIDE_INT") in gcc-13 cycle which made the booboo.

+* config/riscv/predicates.md (const_0_operand): Remove
+const_double.

And I don't recall why I did that part. But I guess reinstating it back won't
be that radical, since it wa sin tree for a while. I'll throw it at full
testsuite to see if there are any fallouts.

[Bug rtl-optimization/110423] Redundant constants not getting eliminated on RISCV.

2023-07-07 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110423

--- Comment #3 from Vineet Gupta  ---
(In reply to Jeffrey A. Law from comment #2)

> This is derived heavily from Click's work in the 90s. 
> This would happen in gimple most likely, though I guess one could do it in
> RTL if they have a high pain threshold.

If a gimple pass, it won't help catch the late reload induced
rematerializations, which is seen on a lot of SPEC workloads, e.g. cactu for
stack addressing. Although I guess Manolis' fold const offset pass patch would
help things a bit.

> Click's paper is much more general, but the same concepts apply.  His paper
> doesn't cover anything like bifurcating the graph (thus allowing multiple
> constant loads in an effort to reduce undesired speculation or register
> allocation conflicts).
> 
> We might be able to get away with this precisely because these are constant
> loads and thus subject to rematerialization later if register pressure is
> high.
> 
> https://courses.cs.washington.edu/courses/cse501/06wi/reading/click-pldi95.
> pdf

The prospect of implementing Cliff's Global Value Numbering is very exciting,
however I would like to start small. Started digging into gcse.cc Hoist pass,
granted this is still pre-reload. It seems Hoist has some global redundancy
elimination capabilities for constants, added by Maxim Kuvyrkov back in 2010. I
need to see what it can and can not do.

[Bug target/110748] RISC-V: optimize store of DF 0.0

2023-07-28 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

--- Comment #15 from Vineet Gupta  ---
(In reply to Vineet Gupta from comment #12)
> > void znd(double *d) { *d = -0.0; }
> > void znf(float *f)  { *f = -0.0; }

We need 3 set of changes to get const -0.0 working:

1. Allow expand to generate set mem const_couble -0.0
   - rtx cost adj so compress_float_constant () doesn't force_const_mem ()
   - riscv_legitimize_move to allow -0.0 and not force a reg

2. Allow subsequent passes to recog() this set mem const_double -0.0 
   - Beef up "*movdf_hardfloat_rv64" with additional condition check for -0.0 
   - Add a new constraint for -0.0

3. Add a splitter (for split1, not combine) to generate the int reg

On the branch devel/vineetg/optim-double-const-m0 I have double -0.0 working.

znd:
li  a5,-1
sllia5,a5,63
sd  a5,0(a0)
ret

There's currently an ICE for zbs

IRA is undoing the split so the insn with const_int 0x8000_ doesn't
exist for final pass.

expand
--
(insn 6 3 0 2 (set (mem:DF (reg:DI 135)
(const_double:DF -0.0 [-0x0.0p+0])) {*movdf_hardfloat_rv64}

split1
-
(insn 10 3 11 2 (set (reg:DI 136)
(const_int [0x8000]))

(insn 11 10 0 2 (set (mem:DF (reg:DI 135)
(subreg:DF (reg:DI 136) 0))

ira

(insn 11 9 12 2 (set (mem:DF (reg:DI 135)
(const_double:DF -0.0 [-0x0.0p+0])) {*movdf_hardfloat_rv64}

[Bug target/110748] RISC-V: optimize store of DF 0.0

2023-07-21 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

--- Comment #10 from Vineet Gupta  ---
The fix for handling +0.0 is posted to list - really trivial.

https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625217.html

[Bug target/110748] RISC-V: optimize store of DF 0.0

2023-07-21 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

--- Comment #11 from Vineet Gupta  ---
There's a variation which can be optimized as well and seems non trivial to
implement

Now it is the negative constant -0.0 to be stored to mem. In bit notation this
has a single sign bit set thus can be optimized using a bseti if rv64gc_zbs.

void znd(double *d) { *d = -0.0; }
void znf(float *f)  { *f = -0.0; }

llvm optim these to

znd(double*):
bseti   a1, zero, 63
sd  a1, 0(a0)
ret

znf(float*):
lui a1, 0x8
sw  a1, 0(a0)
ret

While gcc resorts to constant pool for both

lui a5,%hi(.LANCHOR0)
fld fa5,%lo(.LANCHOR0)(a5)
fsd fa5,0(a0)
ret
.set.LANCHOR0,. + 0
.LC0:
.word   0
.word   -2147483648

[Bug target/110748] RISC-V: optimize store of DF 0.0

2023-07-21 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

--- Comment #12 from Vineet Gupta  ---
> void znd(double *d) { *d = -0.0; }
> void znf(float *f)  { *f = -0.0; }

The rough approach I'm thinking of is to 

(1) Introduce a constraint for -0.0 and perhaps a predicate as well for
"*movdf_hardfloat_rv64". That way df expander can elide the const pool.

(2) Add a combiner pattern to recog a set of -0.0 to bit set, which would be
automatically optim to BSETI if zbs is passed.

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2023-08-04 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

Vineet Gupta  changed:

   What|Removed |Added

 CC||vineetg at gcc dot gnu.org

--- Comment #11 from Vineet Gupta  ---
Revisited this with gcc-13.

The reduced test case no longer shows the extraneous LI 4096 (although the full
test still does). 

The key here is -funroll-loops which is needed for original issue to show as
well.

The was with middle-end update:

commit 19295e8607da2f743368fe6f5708146616aafa91
Author: Richard Biener 
Date:   Mon Oct 24 09:51:32 2022 +0200

tree-optimization/100756 - niter analysis and folding

niter analysis, specifically the part trying to simplify the computed
maybe_zero condition against the loop header copying condition, is
confused by us now simplifying

  _15 = n_8(D) * 4;
  if (_15 > 0)

to

  _15 = n_8(D) * 4;
  if (n_8(D) > 0)

which is perfectly sound at the point we do this transform.  One
solution might be to involve ranger in this simplification, another
is to be more aggressive when expanding expressions - the condition
we try to simplify is _15 > 0, so all we need is expanding that
to n_8(D) * 4 > 0.

[Bug rtl-optimization/110423] Redundant constants not getting eliminated on RISCV.

2023-06-26 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110423

--- Comment #1 from Vineet Gupta  ---
Created attachment 55402
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55402=edit
test case (but needs reverting upstream 6508d5e5a1a)

[Bug rtl-optimization/110423] New: Redundant constants not gettign eliminated on RISCV.

2023-06-26 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110423

Bug ID: 110423
   Summary: Redundant constants not gettign eliminated on RISCV.
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
  Target Milestone: ---

Redundant constants, across basic blocks, don't seem to be eliminated robustly
by gcc. I'd reported this last year [1] and this is just recapturing that info
as a bugzilla PR.

[1] https://gcc.gnu.org/pipermail/gcc/2022-October/239645.html

When analyzing coremark build for RISC-V, noticed redundant constants 
not being eliminated. While this is a recurrent issue with RV, this 
specific instance is not unique to RV as I can trigger similar output on 
aarch64 with -fno-if-conversion, hence something which could be 
addressed in common passes.

-O3 -march=rv64gc_zba_zbb

crcu8:
xor a3,a0,a1
andia3,a3,1
srlia4,a0,1
srlia5,a1,1
beq a3,zero,.L2

li  a3,-24576   # 0x_A000
addia3,a3,1 # 0x_A001
xor a5,a5,a3
zext.h  a5,a5

.L2:
xor a4,a4,a5
andia4,a4,1 
srlia3,a0,2 
srlia5,a5,1 
beq a4,zero,.L3 

li  a4,-24576   # 0x_A000
addia4,a4,1 # 0x_A001
xor a5,a5,a4
zext.h  a5,a5   

.L3:
xor a3,a3,a5
andia3,a3,1 
srlia4,a0,3 
srlia5,a5,1 
beq a3,zero,.L4

li  a3,-24576   # 0x_A000
addia3,a3,1 # 0x_A001
[...]

.L8
andia3,a5,1
srlia0,a5,1
beq a3,a4,.L9
li  a5,-24576   # 0x_A000
addia5,a5,1 # 0x_A001
xor a0,a0,a5
sllia0,a0,48
srlia0,a0,48
.L9:
ret


cse can't handle this: as explained by Jeff in [2] EBB can have jumps out but
not jumps in, which misses the cfg paths needed to be traversed to find the
equivalents.

[2] https://gcc.gnu.org/pipermail/gcc/2022-October/239646.html

Note that since gcc 13.1, this specific test generates different code since the
match.pd change 6508d5e5a1a ("match.pd: rewrite select to branchless
expression") now removes the branches and the arithmatic needing the large
const. 
But this test is very convenient, so I'm continuing to use it and just revert
the match.pd change in my local gcc build.

[Bug target/109279] RISC-V: complex constants synthesized should be improved

2023-05-19 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109279

--- Comment #16 from Vineet Gupta  ---
> Which is what this produces:
> ```
> long long f(void)
> {
>   unsigned t = 16843009;
>   long long t1 = t;
>   long long t2 = ((unsigned long long )t) << 32;
>   asm("":"+r"(t1));
>   return t1 | t2;
> }
> ```
> I suspect: 0x0080402010080400ULL should be done as two 32bit with a shift/or
> added too. Will definitely improve complex constants forming too.
> 
> Right now the backend does (const<<16+const)<<16+const... which is just so
> bad.

Umm this testcase is a different problem. It used to generate the same output
but no longer after g2e886eef7f2b5a and the other related updates:
g0530254413f8 and gc104ef4b5eb1.

For the test above, the low and high words are created independently and then
stitched.

260r.dfinit

# lower word

(insn 6 2 7 2 (set (reg:DI 138)
(const_int [0x101]))  {*movdi_64bit}
(insn 7 6 8 2 (set (reg:DI 137)
(plus:DI (reg:DI 138)
(const_int [0x101]))) {adddi3}
 (expr_list:REG_EQUAL (const_int [0x1010101]) )
(insn 5 8 9 2 (set (reg/v:DI 134 [ t1 ])
(reg:DI 136 [ t1 ])) {*movdi_64bit}

# upper word created independently, no reuse from prior values)

(insn 9 5 10 2 (set (reg:DI 141)
(const_int [0x101]))  {*movdi_64bit}
(insn 10 9 11 2 (set (reg:DI 142)
(plus:DI (reg:DI 141)
(const_int [0x101]))) {adddi3}
(insn 11 10 12 2 (set (reg:DI 140)
(ashift:DI (reg:DI 142)
(const_int 32 [0x20]))) {ashldi3}
(expr_list:REG_EQUAL (const_int [0x1010101]))

# stitch them
(insn 12 11 13 2 (set (reg:DI 139)
(ior:DI (reg/v:DI 134 [ t1 ])
(reg:DI 140))) "const2.c":7:13 99 {iordi3}


cse1 matches the new "*mvconst_internal" pattern independently on each of them 

(insn 7 6 8 2 (set (reg:DI 137)
(const_int [0x1010101])) {*mvconst_internal}
(expr_list:REG_EQUAL (const_int [0x1010101])))

(insn 11 10 12 2 (set (reg:DI 140)
(const_int [0x1010101_])) {*mvconst_internal}
(expr_list:REG_EQUAL (const_int   
[0x1010101_]) ))

This ultimately gets in the way, as otherwise it would find the equivalent reg
across the 2 snippets and reuse reg.

It is interesting that due to same pattern, split1 undoes what cse1 did so in
theory cse2 ? could redo it it. Anyhow needs to be investigated. But ATM we
have the following codegen for the aforementioned test which clearly needs more
work.

li  a0,16842752
addia0,a0,257
li  a5,16842752
sllia0,a0,32
addia5,a5,257
or  a0,a5,a0
ret

[Bug target/109279] RISC-V: complex constants synthesized should be improved

2023-05-19 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109279

Vineet Gupta  changed:

   What|Removed |Added

 CC||vineetg at gcc dot gnu.org
 Status|NEW |ASSIGNED

--- Comment #15 from Vineet Gupta  ---
(In reply to Andrew Pinski from comment #6)
> Take:
> long long f(void)
> {
>   return 0x0101010101010101ull;
> }
> 
>  CUT 
> This should be done as:
> li  a0,16842752
> addia0,a0,257
> sllia1,a0,32
> or  a0,a0,a1

I committed a fix today which gets us to exactly that [1]

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618948.html

[Bug target/113570] RISC-V: SPEC2017 549 fotonik3d miscompilation in autovec VLS 256 build

2024-01-23 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113570

--- Comment #1 from Vineet Gupta  ---
This one is a headache as we don't know where the problem is. And that it takes
~7hr for a QEMU run to finish.

Good this is there is a comparison point as VLA build works fine.

(1). bloat-o-meter (from Linux kernel) to diff the VLS (nok) and VLA (ok)
builds.

  Function old new   delta
  init_   67226752 +30
  __huygens_mod_MOD_huygense 17078   17990+912
  __huygens_mod_MOD_huygensh 14412   15614   +1202
  __huygens_mod_MOD_uin.isra  29222944 +22
  __material_mod_MOD_mat_updatee  42284272 +44

  __mur_mod_MOD_mur_init  90549162+108
  __mur_mod_MOD_mur_storee25462446-100
  __mur_mod_MOD_mur_updatee  10124   10354+230

  __pec_mod_MOD_pec_init  85228046-476

  __plane_source_mod_MOD_plane_source_init69427072+130

  __power_mod_MOD___copy_power_mod_Powertyp  14  26 +12
  __power_mod_MOD_power_dft   12801156-124
  __power_mod_MOD_power_init  98309994+164
  __power_mod_MOD_power_print 23041556-748

  __upml_mod_MOD_upml_allocate.isra  19384   19614+230
  __upml_mod_MOD_upml_init   25564   26178+614
  __upml_mod_MOD_upml_set_eps_arrays.isra 54066356+950
  __upml_mod_MOD_upml_updatee32516   27130   -5386
  __upml_mod_MOD_upml_updatee_simple 36112   29612   -6500
  __upml_mod_MOD_upml_updateh15962   15992 +30

  writeout_  10856   11002+146

(2). Assuming the issue is one of those above (which may not be the true),
manually  rebuild, changing build flags to VLA, one module at a time, relink,
rerun qemu and compare output.

   - This resulted in power.fppized.f90 as the culprit

(3) Manually split up the power module into multiple files - one function at a
time and do the same exercise to identify the function.

[Bug target/113570] New: RISC-V: SPEC2017 549 fotonik3d miscompilation in autovec VLS 256 build

2024-01-23 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113570

Bug ID: 113570
   Summary: RISC-V: SPEC2017 549 fotonik3d miscompilation in
autovec VLS 256 build
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: juzhe.zhong at rivai dot ai, kito.cheng at gmail dot com,
law at gcc dot gnu.org, palmer at dabbelt dot com,
patrick at rivosinc dot com, rdapp at gcc dot gnu.org
  Target Milestone: ---

fotonik3d runs to completion BUT output results fail vs. reference output.

specperl specdiff  -m -l 10 --abstol 1e-27 --reltol 1e-10 --obiwan
--floatcompare  benchspec/CPU/549.fotonik3d_r/data/refrate/output/pscyee.out
pscyee.out

0646:   -1.91273086037953E-17, -1.46491401919706E-15,
-1.91273086057460E-17, -1.46491401919687E-15,
^
0668:   -1.91251317582607E-17, -1.42348205527085E-15,
-1.91251317602571E-17, -1.42348205527068E-15,
^
0690:   -1.91228927083786E-17, -1.38431570180230E-15,
-1.91228927104223E-17, -1.38431570180212E-15,
^
0712:   -1.91205914533895E-17, -1.34723370999236E-15,
-1.91205914554988E-17, -1.34723370999214E-15,
^
0734:   -1.91182279925531E-17, -1.31207366208699E-15,
-1.91182279947287E-17, -1.31207366208678E-15,
^
0756:   -1.91158023250692E-17, -1.27868958770083E-15,
-1.91158023272966E-17, -1.27868958770060E-15,
^
0778:   -1.91133144501623E-17, -1.24694993327743E-15,
-1.91133144524329E-17, -1.24694993327713E-15,
^
0800:   -1.91107643669701E-17, -1.21673582512221E-15,
-1.91107643693121E-17, -1.21673582512195E-15,
^
0822:   -1.91081520746626E-17, -1.18793957765398E-15,
-1.91081520769832E-17, -1.18793957765367E-15,
^
0844:   -1.91054775723835E-17, -1.16046340741478E-15,
-1.91054775748087E-17, -1.16046340741452E-15,
^

Build flags: -Ofast -fno-lto -static -march=rv64gcv_zvl256b_zba_zbb_zbs_zicond
-ftree-vectorize --param=riscv-autovec-preference=scalable
-fallow-argument-mismatch -fmax-stack-var-size=65536

QEMU_CPU=rv64,v=true,vlen=256,vext_spec=v1.0,Zve32f=true,Zve64f=true,zba=true,zbb=true,zbs=true,zicond=true

[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc

2023-12-22 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

--- Comment #18 from Vineet Gupta  ---
(In reply to JuzheZhong from comment #17)
> PLCT told me they passed with zvl256b.
> 
> I always run SPEC with FIXED-VLMAX since we always care about peak
> performance
> on our board.

Sure we all have our preferred peak performance configs. But the compiler needs
to work for all vendors' configs. So as a test, can you try a scalable build
run at your end to at least see if you can see those issues ?

[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc

2023-12-22 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

--- Comment #16 from Vineet Gupta  ---
(In reply to JuzheZhong from comment #15)
> Currently, we don't have much run FAIL and ICE left in full coverage testing.
> 
> I suspect it is very corner case in SPEC.
> 
> You don't have to debug it. Just need to give me a preprocessed source file.
> 
> Like this:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110560
> 
> You can see google highway folks attachment is very big but I still can fix
> the issue as long as you can give me some sources that I can reproduce the
> issues.

As I mentioned already these are runtime failure mismatches, so we don't know
where the issue is and thus no reduced test case. 

FWIW I could/would have debugged gcc code it if I had a reduced test.

So we need to dig down into guts of the benchmark and see where the output is
generated, checkpoint and so on so forth etc.

The other approach is to try "defeature" autovec and see if can point to broad
areas (in backend/middle-end) where the issue could be.
e.g.
  - simple vs. lazy vsetvl
  - disabling reductions etc.

BTW I'm surprised you are not seeing these as there is nothing rivos specific
here. Are you running the full SPEC suite, including Fortran / Float workloads.

[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc

2023-12-22 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

Vineet Gupta  changed:

   What|Removed |Added

 CC||vineetg at gcc dot gnu.org

--- Comment #13 from Vineet Gupta  ---
(In reply to JuzheZhong from comment #12)
> (In reply to Patrick O'Neill from comment #11)
> > (In reply to Patrick O'Neill from comment #10)
> > > I've kicked off 2 spec runs (zvl 128 and 256) using r14-6765-g4d9e0f3f211.
> > > I'll let you know the results when they finish.
> > 
> > My terminal crashed - so these are partial results:
> > zvl256: 3 runtime failures
> > 531.deepsjeng
> > ???
> > ???
> > 
> > zvl128: 1 runtime failure
> > 527.cam4_r
> > 
> > If I had to guess I would say the 2 ??? fails are the existing 521/549.
> 
> You mean those 2 cases are still failing?
> Do you have any ideas to locate those FAIL and extract them as a simple case?

> zvl128 / no vl: 1 runtime failure
> 527.cam4_r

Yes this still remains. It is hard to debug (for me at least) as this is
fortran.

However this goes away if simple_vsetvl is used (with -Ofast for rest of
buiild) - using [1]

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-December/641342.html

[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc

2023-12-22 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

--- Comment #14 from Vineet Gupta  ---
(In reply to Vineet Gupta from comment #13)
> (In reply to JuzheZhong from comment #12)
> > (In reply to Patrick O'Neill from comment #11)
> > > (In reply to Patrick O'Neill from comment #10)
> > > > I've kicked off 2 spec runs (zvl 128 and 256) using 
> > > > r14-6765-g4d9e0f3f211.
> > > > I'll let you know the results when they finish.
> > > 
> > > My terminal crashed - so these are partial results:
> > > zvl256: 3 runtime failures
> > > 531.deepsjeng
> > > ???
> > > ???

At least 549.fotonik3d runtime failure with vl256 remains even with
simple_vsetvl.

[Bug target/105733] riscv: Poor codegen for large stack frames

2023-12-15 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105733

Vineet Gupta  changed:

   What|Removed |Added

 CC||vineetg at gcc dot gnu.org

--- Comment #4 from Vineet Gupta  ---
There has been good improvements in gcc codegen specially with commit below.

commit 6619b3d4c15cd754798b1048c67f3806bbcc2e6d
Author: Jivan Hakobyan 
Date:   Wed Aug 23 14:10:30 2023 -0600

Improve quality of code from LRA register elimination

This is primarily Jivan's work, I'm mostly responsible for the write-up and
coordinating with Vlad on a few questions.

On targets with limitations on immediates usable in arithmetic
instructions,
LRA's register elimination phase can construct fairly poor code.

 Tip W/o commit 6619b3d4c| With 6619b3d4c   
 |
foo: | foo:
li  t0,-4096 |  li  t0,-4096
addit0,t0,2032   |  addit0,t0,2032
li  a5,0 |
li  a4,0 |
add sp,sp,t0 |  add sp,sp,t0
add a4,a4,a5 |
add a5,a4,sp |  add a5,a5,a0
add a5,a5,a0 |
li  t0,4096  |  li  t0,4096
sb  zero,0(a5)   |  sb  zero,0(a5)
addit0,t0,-2032  |  addit0,t0,-2032
add sp,sp,t0 |  add sp,sp,t0
jr  ra   |  jr  ra

We still have the weird LUI 4096 based constant construction. I have a patch to
avoid 4096 for certain ranges  [-4096,-2049] or [2048,4094] (cribbed from
llvm).
e.g. 2064 = 2047 + 17 and we could potentially "spread" the 2 parts over 2 adds
to SP, avoiding the LUI. However if a const costs more than 1 insn, gcc wants
to force it in a register rather than split the add operation into 2 adds with
the split constants.

expand_binop
  expand_binop_directly
   avoid_expensive_constant

/* X is to be used in mode MODE as operand OPN to BINOPTAB.  If we're
   optimizing, and if the operand is a constant that costs more than
   1 instruction, force the constant into a register and return that
   register.  Return X otherwise.  UNSIGNEDP says whether X is unsigned.  */

[Bug target/112817] RISC-V: RVV: provide a preprocessor macro for VLS codegen

2024-01-08 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112817

--- Comment #9 from Vineet Gupta  ---
(In reply to JuzheZhong from comment #5)
> Support VLS codegen with -mrvv-vector-bits and attribute is reasonable to be
> landed on GCC-14.
> 
> Could you first implement -mrvv-vector-bits feature ?
> 
> I have support it in rvv-next, but I don't have time to migrate that into
> trunk GCC.

I presume you are referring to https://github.com/riscv-collab/riscv-gcc.git
and #riscv-gcc-rvv-next

I don't see the attribute support. Is it called something else there ?

I was looking for a new entry in gcc/c-family/c-attribs.cc or would be
somewhere else.

[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc

2024-01-10 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

Vineet Gupta  changed:

   What|Removed |Added

 CC||vineetg at gcc dot gnu.org

--- Comment #33 from Vineet Gupta  ---
cam4 failure is a bug in vsetvl pass which I'm debugging atm.
An erroneous vsetvl insn is getting generated, clobbering a live register used
subsequently in a V insn.

[Bug target/113429] New: RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA build

2024-01-16 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429

Bug ID: 113429
   Summary: RISC-V: SPEC2017 527 cam4 miscompilation in autovec
VLA build
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: juzhe.zhong at rivai dot ai, kito.cheng at gmail dot com,
law at gcc dot gnu.org, rdapp at gcc dot gnu.org
  Target Milestone: ---

cam4 gets a runtime segv early on due to VSETVLI clobbering a reg (issue
obviously goes away with simple-vsetvl)

002abc76 <__zm_conv_MOD_closure.constprop.0.isra.0>:
...
...
  2acebe:   mv  s2,a5
  2acec0:   sll a4,s2,0x3
  2acec4:   li  a3,32
  2acec8:   add a5,sp,224   <--- a5 is some address
on stack
  2aceca:   bgeua4,a3,2ad2b6
  2acece:   li  a3,16
  2aced0:   bgeua4,a3,2ad294
...
...
  2ad294:   vsetvli a5,s2,e8,mf4,ta,ma  <--- BUG here as a5
clobbered
  2ad298:   vsetivlizero,8,e8,mf2,ta,ma
  2ad29c:   add a3,a5,8
  2ad2a0:   vmv.v.i v1,0
  2ad2a4:   vse8.v  v1,(a5) <--- SEGV
  2ad2a8:   vse8.v  v1,(a3)
  2ad2ac:   add a4,a4,-16
  2ad2ae:   li  a3,8
  2ad2b0:   bltua4,a3,2aceda
  2ad2b4:   j   2ad282
  2ad2b6:   vsetivlizero,8,e8,mf2,ta,ma
  2ad2ba:   add a3,a5,8
  2ad2be:   vmv.v.i v1,0
  2ad2c2:   vse8.v  v1,(a5)
  2ad2c6:   vse8.v  v1,(a3)
  2ad2ca:   add a3,a5,16
  2ad2ce:   vse8.v  v1,(a3)
  2ad2d2:   add a3,a5,24
  2ad2d6:   vse8.v  v1,(a3)
  2ad2da:   add a4,a4,-32
  2ad2dc:   li  a3,16
  2ad2de:   bltua4,a3,2aced4
  2ad2e2:   j   2ad294
<__zm_conv_MOD_closure.constprop.0.isra.0+0x161e>

[Bug target/113429] RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA build

2024-01-16 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429

--- Comment #1 from Vineet Gupta  ---
Created attachment 57107
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57107=edit
Reduced cam4 test

[Bug target/113429] RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA build

2024-01-16 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429

--- Comment #2 from Vineet Gupta  ---
Here's my analysis as to whats going on in vsetvl pass.

Reduced Test with annotated BBs.

.globl  __a_MOD_f
.type   __a_MOD_f, @function
__a_MOD_f:

...
ble s1,zero,.L49
sllia4,s1,3
li  a3,32
addia5,sp,48
bgeua4,a3,.L67

   <--- BB 14

li  a3,16
bgeua4,a3,.L68

   <--- BB 16 

.L36
li  a3,8
bgeua4,a3,.L69

   <--- BB 18

.L37:
vsetvli a5,s1,e8,mf4,ta,ma  <--- (2) rtl insn 440
li  a3,-2147483648
...
...

   <--- BB 17

.L69:
vsetvli a5,s1,e8,mf4,ta,ma
vsetivlizero,8,e8,mf2,ta,ma
vmv.v.i v1,0
vse8.v  v1,0(a5)
j   .L37

   <--- BB 15  (BUG manifests in BB 15)

.L68:
vsetvli a5,s1,e8,mf4,ta,ma   <--- (1) rtl insn 472 (copy of insn 440):
clobbers a5 (BUG)
vsetivlizero,8,e8,mf2,ta,ma
addia3,a5,8
vmv.v.i v1,0<--- insn 88 (imp)
vse8.v  v1,0(a5)
vse8.v  v1,0(a3)
addia4,a4,-16
li  a3,8
bltua4,a3,.L37
j   .L69


The issue manifests in BB 15, but the issue is insn 440 making its way across
BBs.

The problem is introduced in Phase 2 (hack to disable phase 2 elides the
issue).


Phase 2: Lift up vsetvl info.

  Try lift up 0.
...
...
...

 Try lift up 2.

  Compute LCM earliest insert data:

  Expr[5]: VALID (insn 88, bb 14)
  Expr[6]: VALID (insn 88, bb 15)
  Expr[7]: VALID (insn 440, bb 16)

  earliest:
Edge(BB 14 -> BB 16): n_bits = 15, set = {7 }
Edge(BB 15 -> BB 16): n_bits = 15, set = {7 }
Edge(BB 16 -> BB 18): n_bits = 15, set = {9 }
Edge(BB 16 -> BB 17): n_bits = 15, set = {8 }

Fused global info result:

  Change BB 14 from:VALID (insn 88, BB 14)
 to (higher probability):VALID (insn 440, BB 16)  <--- likely issue ???

...

  Try lift up 3.

  Compute LCM earliest insert data:

  Expr[5]: VALID (insn 440, bb 14)
  Expr[6]: VALID (insn 88, bb 15)
  Expr[7]: VALID (insn 440, bb 16)

  earliest:
   Edge(bb 14 -> bb 16): n_bits = 15, set = {7 }
   Edge(bb 14 -> bb 15): n_bits = 15, set = {6 }
   Edge(bb 15 -> bb 16): n_bits = 15, set = {7 }


VSETVL infos after phase 2

  BB 14:
probability: 2.4% (guessed)
Header vsetvl info:VALID (insn 440, BB 14)
Footer vsetvl info:VALID (insn 440, BB 14)
  BB 15:
probability: 1.2% (guessed)
Header vsetvl info:VALID (insn 88, BB 15)   <-- seem OK pertains to VMV
insn
Footer vsetvl info:VALID (insn 88, BB 15)
insn 88 vsetvl info:VALID (insn 88, BB 15)
  BB 16:
probability: 2.4% (guessed)
Header vsetvl info:VALID (insn 440, BB 16)
Footer vsetvl info:VALID (insn 440, BB 16)

However...

Phase 4: Insert, modify and remove vsetvl insns.

  Insert vsetvl info before insn 88: VALID (insn 88, BB 15)  <--- OK VMV
Demand fields: demand_sew_lmul demand_avl
SEW=8, VLMUL=mf2, RATIO=16, MAX_SEW=64
TAIL_POLICY=agnostic, MASK_POLICY=agnostic
AVL=(const_int 8 [0x8])
VL=(nil)
scanning new insn with uid = 460.<--- OK: VSETVL of VMV
  Insert vsetvl insn before insn 88:
(insn 460 94 88 15 (parallel [
(set (reg:SI 66 vl)
(unspec:SI [
(const_int 8 [0x8]) repeated x2
(const_int 7 [0x7])
] UNSPEC_VSETVL))
(set (reg:SI 67 vtype)
(unspec:SI [
(const_int 8 [0x8])
(const_int 7 [0x7])
(const_int 1 [0x1]) repeated x2
] UNSPEC_VSETVL))
]) "cam4red.f90":96:18 discrim 2 -1
 (nil))


  Insert missed vsetvl info at edge (BB 14 -> BB 15): VALID (insn 440, BB 14)  
<-- BUG
Demand fields: demand_ratio_only demand_avl
SEW=8, VLMUL=mf4, RATIO=32, MAX_SEW=64
TAIL_POLICY=agnostic, MASK_POLICY=agnostic
AVL=(reg:DI 9 s1 [orig:138 _37 ] [138])
VL=(reg:DI 15 a5 [orig:140 _42 ] [140])
  Insert vsetvl insn 472:

[Bug target/113429] RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA build

2024-01-16 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429

--- Comment #3 from Vineet Gupta  ---
The toggles used to build are

riscv64-unknown-linux-gnu-gfortran -c -o cam4red.o -I. -Iinclude
-Inetcdf/include -Ofast -fno-lto -static -march=rv64gcv_zba_zbb_zbs_zicond
-ftree-vectorize --param=riscv-autovec-preference=scalable
--param=vsetvl-strategy=optim -fallow-argument-mismatch
-fmax-stack-var-size=65536 cam4red.f90

[Bug target/113429] RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA build

2024-01-17 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429

--- Comment #8 from Vineet Gupta  ---
Thx for the quick fix. I'll validate and commit !

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2024-01-18 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 113429, which changed state.

Bug 113429 Summary: RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA 
build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug target/113429] RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA build

2024-01-18 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429

Vineet Gupta  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #11 from Vineet Gupta  ---
Verified works now.

[Bug target/112817] RISC-V: RVV: provide attribute riscv_rvv_vector_bits for VLS codegen

2024-01-11 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112817

--- Comment #13 from Vineet Gupta  ---
Yeah Greg from Rivos started working on it. He'll update here as he makes
progress.

[Bug target/112817] RISC-V: RVV: provide a preprocessor macro for VLS codegen

2023-12-01 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112817

Vineet Gupta  changed:

   What|Removed |Added

 CC||vineetg at gcc dot gnu.org

--- Comment #3 from Vineet Gupta  ---
I agree, but what xsimd does is not under our control. Whoever wants to use
xsimd for whatever reasons, we can allow gcc to be used similarly to llvm and
certainly not for lack of a trivial define.

[Bug target/112651] RISC-V Vector new option -mvect-lmul required to force LMUL values (rather than --param=riscv-autovec-lmul to hint at values)

2023-12-01 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112651

--- Comment #4 from Vineet Gupta  ---
(In reply to JuzheZhong from comment #3)
> The reason we use --param=riscv-autovec-lmul instead of -mvect-lmul which is
> not documented because we don't have ratifed compile option.
> 
> I have mentioned whether we should have -mrvv-vector-lmul but LLVM people
> object
> it.
> 
> https://github.com/riscv-non-isa/riscv-toolchain-conventions/issues/33

It seems the discussions back in March stalled due to things being tooearly.

But llvm and gcc seem to have diverged anyways for other toggles in the area. 
e.g. fixed length vec size is specified differently
(gcc:--param=riscv-autovec-preference=fixed-vlmax vs.
llvm:-mrvv-vector-bits=zvl) so we might as well switch gcc to -m way. This can
obviously only be done now, before this goes out in the wild in gcc-14 release.
I'd say even now it would be disruptive but ...

[Bug target/112817] New: RISC-V: RVV: provide a preprocessor macro for VLS codegen

2023-12-01 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112817

Bug ID: 112817
   Summary: RISC-V: RVV: provide a preprocessor macro for VLS
codegen
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: ewlu at rivosinc dot com, juzhe.zhong at rivai dot ai,
rdapp at gcc dot gnu.org
  Target Milestone: ---

LLVM toggle for setting up fixed vector length using -mrvv-vector-bits=zvl
(which in turn derives VL from -march=...-vl256) also generates a preprocessor
define __riscv_v_fixed_vlen.

gcc doesn't, which is a bit of pain for downstream projects such as xsimd.

Granted the C-API document [1] doesn't specify this, generation by llvm and
more importantly usage in downstream projects seems good enough of a
requirement to have it in gcc as well.

[1] https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md

[Bug target/112817] RISC-V: RVV: provide a preprocessor macro for VLS codegen

2023-12-01 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112817

--- Comment #6 from Vineet Gupta  ---
(In reply to JuzheZhong from comment #5)

> Support VLS codegen with -mrvv-vector-bits and attribute is reasonable to be
> landed on GCC-14.

I don't think that is the reqmt for this issue. Just defining the preprocessor
flag with existing gcc toggle for VLS codegen should be enough - as long as it
generates same macro as llvm.

[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression

2023-12-04 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853

--- Comment #2 from Vineet Gupta  ---
Bisected to

commit 97ddebb6b4f6b132b0a8072b26d030077b418963
Author: Juzhe-Zhong 
Date:   Thu Nov 23 18:55:03 2023 +0800

RISC-V: Refine some codes of riscv-v.cc[NFC]

This patch is NFC patch to refine unreasonable codes I notice.

Tested on zvl128b/zvl256b/zvl512b/zvl1024b no regression.

Committed.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (emit_vlmax_gather_insn): Refine codes.
(emit_vlmax_masked_gather_mu_insn): Ditto.
(modulo_sel_indices): Ditto.
(expand_vec_perm): Ditto.
(shuffle_generic_patterns): Ditto.

[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression

2023-12-04 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853

--- Comment #1 from Vineet Gupta  ---
Currently bisecting.

The issue happens at an indexed load insn:

=> 0x6f656 :vluxei64.v  v2,(a3),v2

The src reg v2 is different in good vs. failing case

bad case
--
info reg v2
 b = {0xc0, 0xcf, 0xb6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x40, 0xc5, 0xb6, 0x0
}}

good case
-
 b = {0xc0, 0xcf, 0xb6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x40, 0xc5, 0xb6, 0x0,
0x0, 0x0, 0x0, 0x0, 0xc0, 0xcf, 0xb6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x40, 0xb0,
0xb6, 0x0, 0x0, 0x0, 0x0, 0x0}}

[Bug target/112853] New: RISC-V: RVV: SPEC2017 525.x264 regression

2023-12-04 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853

Bug ID: 112853
   Summary: RISC-V: RVV: SPEC2017 525.x264 regression
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: juzhe.zhong at rivai dot ai, patrick at rivosinc dot com,
rdapp at gcc dot gnu.org
  Target Milestone: ---

As of commit 3d104d93a701 ("ARC: Consistent use of whitespace in assembler
templates.") x264 build for RVV is segfaulting.

```
QEMU_CPU=rv64,vlen=256,zba=true,zbb=true,zbs=true,zicond=true,vext_spec=v1.0,Zve32f=true,Zve64f=true
qemu-riscv64  ./ldecod_r_base.rivos_rv64-m64 -i BuckBunny.264 -o BuckBunny.yuv

Setting Default Parameters...
Parsing Configfile decoder.cfg

- JM 17.1 (FRExt) -
--
 Input H.264 bitstream  : BuckBunny.264 
 Output decoded YUV : BuckBunny.yuv 
 Input reference file   : test_rec.yuv 
--
POC must = frame# or field# for SNRs to be correct
--
  Frame  POC  Pic#   QPSnrY SnrU SnrV   Y:U:V Time(ms)
--
 Input reference file   : test_rec.yuv does not exist 
  SNR values are not available
Segmentation fault (core dumped)
```

It was fine at my prev checkpoint: 
2023-11-22 6f59f959e751 hppa: Define MAX_FIXED_MODE_SIZE

Built with following flags

-Ofast -fno-lto -static -march=rv64gcv_zba_zbb_zbc_zbs_zicond -ftree-vectorize
--param=riscv-autovec-preference=scalable

[Bug target/112447] risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-15 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

Vineet Gupta  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #22 from Vineet Gupta  ---
Fixed for gcc-14.

[Bug target/111557] [RISC-V] The macro __riscv_unaligned_fast should be __riscv_misaligned_fast

2023-11-16 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111557

Vineet Gupta  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from Vineet Gupta  ---
Fixed in gcc-14.

Not keeping old names for "compatibility sake" since the original change (also
gcc-14) has technically not shipped.

[Bug target/112447] risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-14 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

--- Comment #11 from Vineet Gupta  ---
As a hack I commented out set_delete() to see what the extraneous vsetvli
would have been.

```
  .L36:
   # bb 3: start of outer loop: off 0

vsetvli zero,zero,e8,mf2,ta,ma # insn 2915
vse8.v  v1,0(t3)   # insn 2874, bb 3
vse8.v  v1,0(t6)
vse8.v  v1,0(s0)
  ...

# bb 181

addia4,a4,1
li  a7,15
bne a4,a7,.L36 # insn 1082

   # bb 182: start of outer loop: off 1

vsetvli zero,zero,e32,mf2,ta,ma# insn 2919
vmv.x.s a3,v1  # insn 1858
vsetvli zero,zero,e16,mf2,ta,ma
sw  a3,8(sp)
vmv.x.s a3,v1
```

Essentially phase 2 is fusing vsetvl info for insn 2874 and insn 1858
But the fused info doesn't seem right. 

vsetvli zero,zero,e32,m2,ta,ma
j   .L36

Manually modifying it to orig value fixes the test.

vsetvli zero,zero,e8,mf2,ta,ma
j   .L36

Phase 2 logs

```
  Try lift up 1.

  earliest:
Edge(bb 0 -> bb 2): n_bits = 13, set = {0 }
Edge(bb 62 -> bb 63): n_bits = 13, set = {4 }
Edge(bb 180 -> bb 181): n_bits = 13, set = {8 }
Edge(bb 181 -> bb 3): n_bits = 13, set = {2 }

Fuse curr info since prev info compatible with it:
  prev_info: VALID (insn 1858, bb 181)   <-- due to Edge(bb 181 -> bb
3)
Demand fields: demand_sew_only
SEW=32, VLMUL=mf2, RATIO=64, MAX_SEW=64
TAIL_POLICY=agnostic, MASK_POLICY=agnostic
AVL=(nil)
VL=(nil)
  curr_info: VALID (insn 2874, bb 3)
Demand fields: demand_ratio_only demand_avl
SEW=8, VLMUL=mf2, RATIO=16, MAX_SEW=64
TAIL_POLICY=agnostic, MASK_POLICY=agnostic
AVL=(const_int 8 [0x8])
VL=(nil)

  prev_info after fused: VALID (insn 1858, bb 181)
Demand fields: demand_sew_lmul demand_avl
SEW=32, VLMUL=m2, RATIO=16, MAX_SEW=64
TAIL_POLICY=agnostic, MASK_POLICY=agnostic
AVL=(const_int 8 [0x8])
VL=(nil)
```

This fuse in turn is happening from 

DEF_SEW_LMUL_RULE (sew_only, ratio_only, sew_lmul,
   next_ratio_valid_for_prev_sew_p, always_false,
   modify_lmul_with_next_ratio)

I'm not really sure if the merge callback here is correct.

[Bug target/112447] risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-14 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

--- Comment #15 from Vineet Gupta  ---
(In reply to JuzheZhong from comment #14)
> Let me give you some guide which helps you to dig into the problem.
> 
> First, reduce the case as follows:

Did your msg get truncated or pressed send too soon ?

Because the reduced test you pasted is exactly what I uploaded to the bug and I
can't reduce it any further.

[Bug target/112447] risc-v regression: FAIL: gcc.c-torture/execute/memset-3.c -O3

2023-11-14 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112447

--- Comment #13 from Vineet Gupta  ---
Then I don't know where the problem actually is ?

[Bug target/113429] RISC-V: SPEC2017 527 cam4 miscompilation in autovec VLA build

2024-01-16 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113429

--- Comment #6 from Vineet Gupta  ---
Created attachment 57111
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57111=edit
additional modules

[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns

2024-04-16 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

Vineet Gupta  changed:

   What|Removed |Added

   Last reconfirmed|2024-04-15 00:00:00 |2024-4-16

--- Comment #9 from Vineet Gupta  ---
So I stared with the reg being spilled (a1)

.L2:
beq a1,zero,.L5# if j[1] == 0
li  a2,1
ble a6,s11,.L2# if j[0] < 1
sd  a1,8(sp)# spill (save)


.L3:   # inner loop start
   ...

blt  a2,a6,.L3# inner loop end

ld  a1,8(sp)# spill (restore)
j   .L2

Next was zooming into the inner loop where a1 is being used/clobbered by sched1
and not w/o sched1 with my rudimentary define, use, dead annotation.

--
-fschedule-insns (NOK)   | -fno-schedule-insns (OK)
--
1-def  lda5,%lo(u)(s0) #u, u | 1-def   lda5,%lo(u)(t6)  # u, u
2-def  srliw a0,a5,16| 2-def   srliw s10,a5,16
3-def  srli  a1,a5,32| 1-use   sha5,%lo(_Z1sv)(a4)
1-use  sha5,%lo(_Z1sv)(a3)   | 2-dead  shs10,%lo(_Z1sv+2)(a4)
  ---insn1---| 3-def   srli  s10,a5,32
1-use  srli  a5,a5,48| 1-use   srli  a5,a5,48
  ---insn2---| 1-dead  sha5,%lo(_Z1sv+6)(a4)
2-dead sha0,%lo(_Z1sv+2)(a3) |  ---insn1---
3-dead sha1,%lo(_Z1sv+4)(a3) |  ---insn2---
1-dead sha5,%lo(_Z1sv+6)(a3) | 3-dead  shs10,%lo(_Z1sv+4)(a4)

The problem seems to be longer live range of 2-def (on left side). If it was
used/dead right afte, 3-def won't need a new register.

With that insight, I can now start looking into the sched1 dumps of the
corresponding BB.

;;   10--> b  0: i  35 r170#0=[r242+low(`u')] 
:alu:@GR_REGS+1(1)@FP_REGS+0(0)
;;   11--> b  0: i  79 r209=[r229+low(`f')]   
:alu:GR_REGS+0(0)FP_REGS+1(1)
;;   12--> b  0: i  76 r141=fix(r206) 
:alu:@GR_REGS+1(1)@FP_REGS+0(-1)
;;   13--> b  0: i  46 r180=zxt(r170,0x10,0x10)   
:alu:@GR_REGS+1(1)@FP_REGS+0(0)
;;   14--> b  0: i  55 r188=r170 0>>0x20  
:alu:GR_REGS+1(1)FP_REGS+0(0)
;;   15--> b  0: i  81 r210=r141<<0x3 
:alu:GR_REGS+1(0)FP_REGS+0(0)
;;   16--> b  0: i  82 r211=r143+r210 
:alu:GR_REGS+1(0)FP_REGS+0(0)
;;   17--> b  0: i  44 [r230+low(`_Z1sv')]=r170#0 
:alu:@GR_REGS+0(0)@FP_REGS+0(0)
;;   18--> b  0: i  65 r197=r170 0>>0x30  
:alu:GR_REGS+1(0)FP_REGS+0(0)
;;   19--> b  0: i  54 [r230+low(const(`_Z1sv'+0x2))]=r180#0  
:alu:@GR_REGS+0(-1)@FP_REGS+0(0)
;;   20--> b  0: i  64 [r230+low(const(`_Z1sv'+0x4))]=r188#0  
:alu:GR_REGS+0(-1)FP_REGS+0(0)
;;   21--> b  0: i  73 [r230+low(const(`_Z1sv'+0x6))]=r197#0  
:alu:GR_REGS+0(-1)FP_REGS+0(0)

[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns

2024-04-17 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #10 from Vineet Gupta  ---
Debug update -fsched-verbose=99 dumps (they are reay verbose)

For the insn/regs under consideration, the canonical pre-scheduled sequence
with ideal live-range (but non-ideal load-to-use delay) is following

  ;;   ==
  ;;   -- basic block 3 from 17 to 98 -- before reload
  ;;   ==

  ;;|   35 |   10 | r170#0=[r242+low(`u')] alu
  ;;|   44 |6 | [r230+low(`_Z1sv')]=r170#0 alu

  ;;|   46 |7 | r180=zxt(r170,0x10,0x10)   alu
  ;;|   54 |6 | [r230+low(const(`_Z1sv'+0x2))]=r180#0 alu

  ;;|   55 |7 | r188=r170 0>>0x20  alu
  ;;|   64 |6 | [r230+low(const(`_Z1sv'+0x4))]=r188#0 alu

  ;;|   65 |7 | r197=r170 0>>0x30  alu
  ;;|   73 |6 | [r230+low(const(`_Z1sv'+0x6))]=r197#0 alu

r170 (insn 35) is the central character whose live range has to be longest 
because of dependencies.

 - {46, 55, 65} USE r170, and sources which create new pseudos
 - {54, 64, 73} are where these new pseudos sink.

How these 2 sets are interleaved defines the register pressure.
 - If above src1:sink1:src2:sink2:src3:sink3: 1 reg suffices
 - If src1:src2:src3: 3 reg needed

Per sched1 dumps, the "source" set gets inducted into the ready queue together:

  ;;dependencies resolved: insn 65
  ;;tick updated: insn 65 into ready
  ;;dependencies resolved: insn 55
  ;;tick updated: insn 55 into ready
  ;;dependencies resolved: insn 46
  ;;tick updated: insn 46 into ready
  ;;dependencies resolved: insn 44
  ;;tick updated: insn 44 into ready
  ;;+--
  ;;| Pressure costs for ready queue
  ;;|  pressure points GR_REGS:[26->28 at 17:54] FP_REGS:[1->1 at 0:94]
  ;;+--
  ;;|  15   44 |6  +3 | GR_REGS:[0 base cost 0] FP_REGS:[0 base cost 0]
  ;;|  16   46 |7  +3 | GR_REGS:[1 base cost 0] FP_REGS:[0 base cost 0]
   
  ;;|  18   55 |7  +3 | GR_REGS:[1 base cost 1] FP_REGS:[0 base cost 0]
   
  ;;|  20   65 |7  +3 | GR_REGS:[1 base cost 1] FP_REGS:[0 base cost 0]
   
  ;;|  11   76 |   10  +2 | GR_REGS:[1 base cost 0] FP_REGS:[-1 base cost
0]
  ;;|   0   94 |2  +1 | GR_REGS:[0 base cost 0] FP_REGS:[0 base cost 0]
  ;;|  28   92 |5  +1 | GR_REGS:[0 base cost 0] FP_REGS:[1 base cost 0]
  ;;|  26   88 |5  +1 | GR_REGS:[0 base cost 0] FP_REGS:[1 base cost 0]
  ;;|  22   79 |9  +1 | GR_REGS:[0 base cost 0] FP_REGS:[1 base cost 0]
  ;;+--
  ;;  RFS_PRESSURE_DELAY: 7: 44 46 76 94
  ;;RFS_PRIORITY: 6: 92 88 79
  ;;  RFS_PRESSURE_INDEX: 2: 55
  ;;Ready list (t =  10):65:44(cost=1:prio=7:delay=3:idx=20) 
55:42(cost=1:prio=7:delay=3:idx=18)  44:39(cost=0:prio=6:delay=3:idx=15) 
46:40(cost=0:prio=7:delay=3:idx=16)  76:47(cost=0:prio=10:delay=2:idx=11) 
94:58(cost=0:prio=2:delay=1:idx=0)  92:56(cost=0:prio=5:delay=1:idx=28) 
88:54(cost=0:prio=5:delay=1:idx=26)  79:48(cost=0:prio=9:delay=1:idx=22)

As the algorithm converges, they move around a bit, but rarely are the src/sink
considered in same iteration and if at all only 1

  ;;+--
  ;;| Pressure costs for ready queue
  ;;|  pressure points GR_REGS:[29->29 at 0:94] FP_REGS:[1->1 at 0:94]
  ;;+--

...

  ;;|  19   64 |6  +0 | GR_REGS:[-1 base cost -1] FP_REGS:[0 base cost
0]
  ;;|  17   54 |6  +0 | GR_REGS:[-1 base cost -1] FP_REGS:[0 base cost
0]
  ;;|  20   65 |7  +0 | GR_REGS:[0 base cost 0] FP_REGS:[0 base cos


All of this leads to the pessimistic schedule emitted in the end.

I'm still trying to wrap my head around the humungous dump info.

[Bug target/114729] New: RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns

2024-04-15 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

Bug ID: 114729
   Summary: RISC-V SPEC2017 507.cactu excessive spillls with
-fschedule-insns
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: jeffreyalaw at gmail dot com, kito.cheng at gmail dot com,
rdapp at gcc dot gnu.org
  Target Milestone: ---

Created attachment 57953
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57953=edit
spec cactu reduced

In RISC-V SPEC runs, Cactu dynamic icounts are worst of all (compared to
aarch64 with similar build toggles: -Ofast). 

As of Upstream commit 3fed1609f610 of 2024-01-31:
   aarch64: 1,363,212,534,747  vs.
   risc-v : 2,852,277,890,338 

There's an existing issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265
which captures ongoing work to improve the stack/array accesses. However that
is more of damage control. The root cause happens to be excessive stack spills
on RISC-V. Robin noticed these were somehow triggered by first scheduling pass.
Disabling sched1 with -fno-schedule-insns brings down the total icount to half 
1,295,520,619,523 which is even slightly better than aarch64, all things
considered.

I ran a reducer (tracking token sfp in -verbose-asm output) and was able to get
a test which shows a single stack spill (store+load) with
default/-fschedule-insns and none with -fno-schedule-insns.

It seems sched1 is moving insn around, but the actual spills are generated by
IRA. So this is an interplay of sched1 and IRA.

```
ira

New iteration of spill/restore move
  Changing RTL for loop 2 (header bb6)
  Changing RTL for loop 1 (header bb4)
  26 vs parent 26:Creating newreg=246 from oldreg=137
  25 vs parent 25:Creating newreg=247 from oldreg=143
  11 vs parent 11:Creating newreg=248 from oldreg=223
  16 vs parent 16:Creating newreg=249 from oldreg=237

  Changing RTL for loop 3 (header bb3)
  26 vs parent 26:Creating newreg=250 from oldreg=246
  25 vs parent 25:Creating newreg=251 from oldreg=247
  -1 vs parent 11:Creating newreg=253 from oldreg=248
  16 vs parent 16:Creating newreg=254 from oldreg=249

...

scanning new insn with uid = 181.
scanning new insn with uid = 182.
scanning new insn with uid = 183.
scanning new insn with uid = 184.
changing bb of uid 194
  unscanned insn
scanning new insn with uid = 185.
scanning new insn with uid = 186.
scanning new insn with uid = 187.
scanning new insn with uid = 188.
changing bb of uid 195
  unscanned insn

...
+++Costs: overall 11650, reg 10680, mem 970, ld 485, st 485, move 1366
+++   move loops 0, new jumps 2
...

(insn 9 104 11 2 (set (reg/f:DI 137 [ r.4_4 ])
  (mem/f/c:DI (lo_sum:DI (reg/f:DI 155)
 (symbol_ref:DI ("r") [flags 0x86]  
   )) [4 r+0 S8 A64]))
  {*movdi_64bit}
(expr_list:REG_DEAD (reg/f:DI 155)
(expr_list:REG_EQUAL (mem/f/c:DI 
 (symbol_ref:DI ("r") [flags 0x86]  
) [4 r+0 S8 A64])

(insn 115 165 181 2 (set (reg:DI 245)
   (const_int 1 [0x1])) {*movdi_64bit}
 (expr_list:REG_EQUIV (const_int 1 [0x1])

   spill code start -

(insn 181 115 182 2 (set (reg/f:DI 246 
[orig:137 r.4_4 ] [137])
(reg/f:DI 137 [ r.4_4 ])) {*movdi_64bit}
 (expr_list:REG_DEAD (reg/f:DI 137 [ r.4_4 ])

(insn 182 181 183 2 (set (reg/f:DI 247 
[orig:143 w.9_10 ] [143])
(reg/f:DI 143 [ w.9_10 ])) {*movdi_64bit}
 (expr_list:REG_DEAD (reg/f:DI 143 [ w.9_10 ])

(insn 183 182 184 2 (set (reg:DI 248 
[orig:223 MEM[(int *)j.15_19 + 4B] ] [223])
(reg:DI 223 [ MEM[(int *)j.15_19 + 4B] ])) 
{*movdi_64bit}
 (expr_list:REG_DEAD (reg:DI 223 
 [ MEM[(int *)j.15_19 + 4B] ])

(insn 184 183 174 2 (set (reg:DI 249 
[orig:237 _38 ] [237])
(reg:DI 237 [ _38 ])) {*movdi_64bit}
 (expr_list:REG_DEAD (reg:DI 237 [ _38 ])

   spill code -

(jump_insn 174 184 175 2 (set (pc)
(label_ref 100)) 350 {jump}
 (nil)
 -> 100)

(barrier 175 174 196)

   spill code start -

(code_label 196 175 195 10 10 (nil) [1 uses])
(note 195 196 189 10 [bb 10] NOTE_INSN_BASIC_BLOCK)

(insn 189 195 190 10 (set (reg/f:DI 250 
[orig:137 r.4_4 ] [137])
(reg/f:DI 246 [orig:137 r.4_4 ] [137])) 
 {*movdi_64bit}
 (expr_list:REG_DEAD (reg/f:DI 246 
  [orig:137 r.4_4 ] [137])

(insn 190 189 191 10 (set (reg/f:DI 251 
[orig:143 w.9_10 ] [143])
(reg/f:DI 247 [orig:143 w.9_10 ] [143])) 
 {*movdi_64bit}
 (expr_list:REG_DEAD (reg/f:DI 247 
   [orig:143 w.9_10 ] [143])

(insn 191 190 192 10 (set (reg/v:DI 252 
 [orig:152 i ] [152])
(reg/v:DI 152 [ i ])) 208 {*movdi_64bit}
 (expr_list:REG_DEAD (reg/v:DI 152 [ i ])

(insn 192 191 193 10 (set (reg:DI 

[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns

2024-04-15 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #2 from Vineet Gupta  ---
FWIW -fsched-pressure is already default enabled for RISC-V.

[Bug rtl-optimization/114729] RISC-V SPEC2017 507.cactu excessive spillls with -fschedule-insns

2024-04-15 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729

--- Comment #4 from Vineet Gupta  ---
(In reply to Jeffrey A. Law from comment #3)

> Vineet, do we have this isolated enough that we know what function is most
> affected and presumably the most impacted blocks?  If so we can probably
> start to debug scheduler dumps.

I think so :-) But this is all anecdotal.

The test attached was reduced from original/full ML_BSSN_RHS.ii (which granted
is 2nd most spill, orig is ML_BSSN_Advect.ii which i have also reduced now).
Anyhow  pretty much all of file is one function and my reduction methodology
was to see 1 spill with sched1 enabled and none otherwise. I hope that is
representative of the pathology seen in the original/full ML_BSSN_RHS.ii

> There's a flag -fsched-verbose=N that gives a lot more low level information
> about the scheduler's decisions.  I usually use N=99.  It makes for a huge
> dump, but gives extremely detailed information about the scheduler's view of
> the world.

I'll start diving into sched1 dumps as you suggest.

[Bug target/111501] RISC-V: non-optimal casting when shifting

2024-05-06 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111501

--- Comment #4 from Vineet Gupta  ---
Awesome !

The trunk is open and new stuff, RISC-V certainly, is already landing, so no
harm in sending it now ;-)

[Bug target/112817] RISC-V: RVV: provide attribute riscv_rvv_vector_bits for VLS codegen

2024-03-06 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112817

Vineet Gupta  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2024-03-06

--- Comment #14 from Vineet Gupta  ---
To summarize this needs following 3 things

1. preprocessor macro __riscv_v_fixed_vlen if -march has explicit xxxvl
specified 
2. implement gcc toggle -mrvv-vector-bits=zvl which essentially copies the xxx
from -march string
3. Implement attribute riscv_rvv_vector_bits to specify vector length for user
types: cfr. https://godbolt.org/z/5Pc4PzPvs, https://godbolt.org/z/9hdMqh3jf,
https://godbolt.org/z/9WKM8s5rq

[Bug target/112817] RISC-V: RVV: provide attribute riscv_rvv_vector_bits for VLS codegen

2024-03-06 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112817

--- Comment #15 from Vineet Gupta  ---
(In reply to Vineet Gupta from comment #14)
> 2. implement gcc toggle -mrvv-vector-bits=zvl which essentially copies the
> xxx from -march string

Done:

commit 0a01d1232ff0a8b094270fbf45c9fd0ea46df19f
Author: Pan Li 
Date:   Fri Feb 23 15:37:28 2024 +0800

RISC-V: Introduce gcc option mrvv-vector-bits for RVV

This patch would like to introduce one new gcc option for RVV. To
appoint the bits size of one RVV vector register. Valid arguments to
'-mrvv-vector-bits=' are:

* scalable
* zvl

The scalable will pick up the zvl*b in the march as the minimal vlen.
For example, the minimal vlen will be 512 when march=rv64gcv_zvl512b
and mrvv-vector-bits=scalable.

The zvl will pick up the zvl*b in the march as exactly vlen.
For example, the vlen will be 1024 exactly when march=rv64gcv_zvl1024b
and mrvv-vector-bits=zvl.

The internal option --param=riscv-autovec-preference will be replaced
by option -mrvv-vector-bits. Aka:

* -mrvv-vector-bits=scalable indicates
--param=riscv-autovec-preference=scalable
* -mrvv-vector-bits=zvl indicates
--param=riscv-autovec-preference=fixed-vlmax

You can also take -fno-tree-vectorize for
--param=riscv-autovec-preference=none.
The internal option --param=riscv-autovec-preference is unavailable after
this
patch.

[Bug target/105733] riscv: Poor codegen for large stack frames

2024-05-28 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105733

Vineet Gupta  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Vineet Gupta  ---
Fixed with aforementioned commit for gcc-15.

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2024-05-28 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

Vineet Gupta  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Vineet Gupta  ---
Two years hence and we are a little wiser.

The root-cause of spills is sched1
[PR/114729](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729).

The recent sum of two s12 patch does make the spill codegen better by having 1
less insn to materialize the stack access. And that shaves off 10% of cactu
dynamic icounts which should be enough to close this PR.

commit 4bfc4585c9935fbde75ccf04e44a15d24f42cde9
Author: Vineet Gupta 
Date:   Mon May 13 11:45:55 2024 -0700

RISC-V: avoid LUI based const materialization ... [part of PR/106265]

... if the constant can be represented as sum of two S12 values.
The two S12 values could instead be fused with subsequent ADD insn.
The helps
 - avoid an additional LUI insn
 - side benefits of not clobbering a reg

e.g.
w/o patch w/ patch
long  | |
plus(unsigned long i) | li  a5,4096 |
{ | addia5,a5,-2032 | addi a0, a0, 2047
   return i + 2064;   | add a0,a0,a5| addi a0, a0, 17
} | ret | ret

NOTE: In theory not having const in a standalone reg might seem less
  CSE friendly, but for workloads in consideration these mat are
  from very late LRA reloads and follow on GCSE is not doing much
  currently.

The real benefit however is seen in base+offset computation for array
accesses and especially for stack accesses which are finalized late in
optim pipeline, during LRA register allocation. Often the finalized
offsets trigger LRA reloads resulting in mind boggling repetition of
exact same insn sequence including LUI based constant materialization.

This shaves off 290 billion dynamic instrustions (QEMU icounts) in
SPEC 2017 Cactu benchmark which is over 10% of workload. In the rest of
suite, there additional 10 billion shaved, with both gains and losses
in indiv workloads as is usual with compiler changes.

 500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
 500.perlbench_r-1 |740,383,419,739 |   739,280,308,163 |
 500.perlbench_r-2 |692,074,638,817 |   691,118,734,547 |
 502.gcc_r-0   |190,820,141,435 |   190,857,065,988 |
 502.gcc_r-1   |225,747,660,839 |   225,809,444,357 | <- -0.02%
 502.gcc_r-2   |220,370,089,641 |   220,406,367,876 | <- -0.03%
 502.gcc_r-3   |179,111,460,458 |   179,135,609,723 | <- -0.02%
 502.gcc_r-4   |219,301,546,340 |   219,320,416,956 | <- -0.01%
 503.bwaves_r-0|278,733,324,691 |   278,733,323,575 | <- -0.01%
 503.bwaves_r-1|442,397,521,282 |   442,397,519,616 |
 503.bwaves_r-2|344,112,218,206 |   344,112,216,760 |
 503.bwaves_r-3|417,561,469,153 |   417,561,467,597 |
 505.mcf_r |669,319,257,525 |   669,318,763,084 |
 507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%
 508.namd_r|  1,855,884,342,110 | 1,855,881,110,934 |
 510.parest_r  |  1,654,525,521,053 | 1,654,402,859,174 |
 511.povray_r  |  2,990,146,655,619 | 2,990,060,324,589 |
 519.lbm_r |  1,158,337,294,525 | 1,158,337,294,529 |
 520.omnetpp_r |  1,021,765,791,283 | 1,026,165,661,394 |
 521.wrf_r |  1,715,955,652,503 | 1,714,352,737,385 |
 523.xalancbmk_r   |849,846,008,075 |   849,836,851,752 |
 525.x264_r-0  |277,801,762,763 |   277,488,776,427 |
 525.x264_r-1  |927,281,789,540 |   926,751,516,742 |
 525.x264_r-2  |915,352,631,375 |   914,667,785,953 |
 526.blender_r |  1,652,839,180,887 | 1,653,260,825,512 |
 527.cam4_r|  1,487,053,494,925 | 1,484,526,670,770 |
 531.deepsjeng_r   |  1,641,969,526,837 | 1,642,126,598,866 |
 538.imagick_r |  2,098,016,546,691 | 2,097,997,929,125 |
 541.leela_r   |  1,983,557,323,877 | 1,983,531,314,526 |
 544.nab_r |  1,516,061,611,233 | 1,516,061,407,715 |
 548.exchange2_r   |  2,072,594,330,215 | 2,072,591,648,318 |
 549.fotonik3d_r   |  1,001,499,307,366 | 1,001,478,944,189 |
 554.roms_r|  1,028,799,739,111 | 1,028,780,904,061 |
 557.xz_r-0|363,827,039,684 |   363,057,014,260 |
 557.xz_r-1|906,649,112,601 |   905,928,888,732 |
 557.xz_r-2|509,023,898,187 |   508,140,356,932 |
 997.specrand_fr   |402,535,577 |   403,052,561 |
 999.specrand_ir   |402,535,577 |   403,052,561 |

This should still be 

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2024-05-28 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 106265, which changed state.

Bug 106265 Summary: RISC-V SPEC2017 507.cactu code bloat due to address 
generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug target/115264] New: RISC-V: yet another instance of poor codegen related to stack (glibc tmpnam.c)

2024-05-28 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115264

Bug ID: 115264
   Summary: RISC-V: yet another instance of poor codegen related
to stack (glibc tmpnam.c)
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vineetg at gcc dot gnu.org
CC: jeffreyalaw at gmail dot com, kito.cheng at gmail dot com
  Target Milestone: ---

When working on sum of two s12 optim for function prologue/epilogue noticed
that glibc tempnam.c generates really poor code (as compared to llvm trunk)

--->8-
typedef long unsigned int size_t;

extern int __gen_tempname (char *__tmpl, int __suffixlen, int __flags, int
__kind);
extern char *__strdup (const char *__string);
extern int __path_search (char *__tmpl, size_t __tmpl_len,
 const char *__dir, const char *__pfx, int __try_tempdir);

char *
tempnam (const char *dir, const char *pfx)
{
  char buf[4096];

  if (__path_search (buf, 4096, dir, pfx, 1))
return ((void *)0) ;

  if (__gen_tempname (buf, 0, 0, 2))
return ((void *)0) ;

  return __strdup (buf);
}

--->8-

- There's two copies of epilogue (for no seemingly obvious benefit)
- s0 is needlessly being spilled (likely one of RA passes generate the refs but
subsequent passes failing to eliminate) - sum of two s12 making it worse.