[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2024-05-28 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

Vineet Gupta  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Vineet Gupta  ---
Two years hence and we are a little wiser.

The root-cause of spills is sched1
[PR/114729](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729).

The recent sum of two s12 patch does make the spill codegen better by having 1
less insn to materialize the stack access. And that shaves off 10% of cactu
dynamic icounts which should be enough to close this PR.

commit 4bfc4585c9935fbde75ccf04e44a15d24f42cde9
Author: Vineet Gupta 
Date:   Mon May 13 11:45:55 2024 -0700

RISC-V: avoid LUI based const materialization ... [part of PR/106265]

... if the constant can be represented as sum of two S12 values.
The two S12 values could instead be fused with subsequent ADD insn.
The helps
 - avoid an additional LUI insn
 - side benefits of not clobbering a reg

e.g.
w/o patch w/ patch
long  | |
plus(unsigned long i) | li  a5,4096 |
{ | addia5,a5,-2032 | addi a0, a0, 2047
   return i + 2064;   | add a0,a0,a5| addi a0, a0, 17
} | ret | ret

NOTE: In theory not having const in a standalone reg might seem less
  CSE friendly, but for workloads in consideration these mat are
  from very late LRA reloads and follow on GCSE is not doing much
  currently.

The real benefit however is seen in base+offset computation for array
accesses and especially for stack accesses which are finalized late in
optim pipeline, during LRA register allocation. Often the finalized
offsets trigger LRA reloads resulting in mind boggling repetition of
exact same insn sequence including LUI based constant materialization.

This shaves off 290 billion dynamic instrustions (QEMU icounts) in
SPEC 2017 Cactu benchmark which is over 10% of workload. In the rest of
suite, there additional 10 billion shaved, with both gains and losses
in indiv workloads as is usual with compiler changes.

 500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
 500.perlbench_r-1 |740,383,419,739 |   739,280,308,163 |
 500.perlbench_r-2 |692,074,638,817 |   691,118,734,547 |
 502.gcc_r-0   |190,820,141,435 |   190,857,065,988 |
 502.gcc_r-1   |225,747,660,839 |   225,809,444,357 | <- -0.02%
 502.gcc_r-2   |220,370,089,641 |   220,406,367,876 | <- -0.03%
 502.gcc_r-3   |179,111,460,458 |   179,135,609,723 | <- -0.02%
 502.gcc_r-4   |219,301,546,340 |   219,320,416,956 | <- -0.01%
 503.bwaves_r-0|278,733,324,691 |   278,733,323,575 | <- -0.01%
 503.bwaves_r-1|442,397,521,282 |   442,397,519,616 |
 503.bwaves_r-2|344,112,218,206 |   344,112,216,760 |
 503.bwaves_r-3|417,561,469,153 |   417,561,467,597 |
 505.mcf_r |669,319,257,525 |   669,318,763,084 |
 507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%
 508.namd_r|  1,855,884,342,110 | 1,855,881,110,934 |
 510.parest_r  |  1,654,525,521,053 | 1,654,402,859,174 |
 511.povray_r  |  2,990,146,655,619 | 2,990,060,324,589 |
 519.lbm_r |  1,158,337,294,525 | 1,158,337,294,529 |
 520.omnetpp_r |  1,021,765,791,283 | 1,026,165,661,394 |
 521.wrf_r |  1,715,955,652,503 | 1,714,352,737,385 |
 523.xalancbmk_r   |849,846,008,075 |   849,836,851,752 |
 525.x264_r-0  |277,801,762,763 |   277,488,776,427 |
 525.x264_r-1  |927,281,789,540 |   926,751,516,742 |
 525.x264_r-2  |915,352,631,375 |   914,667,785,953 |
 526.blender_r |  1,652,839,180,887 | 1,653,260,825,512 |
 527.cam4_r|  1,487,053,494,925 | 1,484,526,670,770 |
 531.deepsjeng_r   |  1,641,969,526,837 | 1,642,126,598,866 |
 538.imagick_r |  2,098,016,546,691 | 2,097,997,929,125 |
 541.leela_r   |  1,983,557,323,877 | 1,983,531,314,526 |
 544.nab_r |  1,516,061,611,233 | 1,516,061,407,715 |
 548.exchange2_r   |  2,072,594,330,215 | 2,072,591,648,318 |
 549.fotonik3d_r   |  1,001,499,307,366 | 1,001,478,944,189 |
 554.roms_r|  1,028,799,739,111 | 1,028,780,904,061 |
 557.xz_r-0|363,827,039,684 |   363,057,014,260 |
 557.xz_r-1|906,649,112,601 |   905,928,888,732 |
 557.xz_r-2|509,023,898,187 |   508,140,356,932 |
 997.specrand_fr   |402,535,577 |   403,052,561 |
 999.specrand_ir   |402,535,577 |   403,052,561 |

This should still be 

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2023-08-04 Thread vineetg at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

Vineet Gupta  changed:

   What|Removed |Added

 CC||vineetg at gcc dot gnu.org

--- Comment #11 from Vineet Gupta  ---
Revisited this with gcc-13.

The reduced test case no longer shows the extraneous LI 4096 (although the full
test still does). 

The key here is -funroll-loops which is needed for original issue to show as
well.

The was with middle-end update:

commit 19295e8607da2f743368fe6f5708146616aafa91
Author: Richard Biener 
Date:   Mon Oct 24 09:51:32 2022 +0200

tree-optimization/100756 - niter analysis and folding

niter analysis, specifically the part trying to simplify the computed
maybe_zero condition against the loop header copying condition, is
confused by us now simplifying

  _15 = n_8(D) * 4;
  if (_15 > 0)

to

  _15 = n_8(D) * 4;
  if (n_8(D) > 0)

which is perfectly sound at the point we do this transform.  One
solution might be to involve ranger in this simplification, another
is to be more aggressive when expanding expressions - the condition
we try to simplify is _15 > 0, so all we need is expanding that
to n_8(D) * 4 > 0.

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-22 Thread vineet.gupta at linux dot dev via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #10 from Vineet Gupta  ---
Created a small test case which emulates generation of 2 split consts. 

void foo(void)
{
bar(2072, 2096);
}

253r.expand has 4 instructions: Pair of LI 4096 + ADDI for each const.
260r.fwprop1 prunes out the redundant insn 7. However it runs before
300r.reload. The original issue is reload inserting the insns so no way to
prune these out.

---253r.expand
(insn 5 2 6 2 (set (reg:DI 72)
(const_int 4096 [0x1000])) "const-load.c":5:2 -1
 (nil))
(insn 6 5 7 2 (set (reg:DI 11 a1)
(plus:DI (reg:DI 72)
(const_int -2000 [0xf830]))) "const-load.c":5:2 -1
 (expr_list:REG_EQUAL (const_int 2096 [0x830])
(nil)))
(insn 7 6 8 2 (set (reg:DI 73)
(const_int 4096 [0x1000])) "const-load.c":5:2 -1
 (nil))
(insn 8 7 9 2 (set (reg:DI 10 a0)
(plus:DI (reg:DI 73)
(const_int -2024 [0xf818]))) "const-load.c":5:2 -1
 (expr_list:REG_EQUAL (const_int 2072 [0x818])
(nil)))

---300r.reload
(insn 5 2 6 2 (set (reg:DI 10 a0 [72])
(const_int 4096 [0x1000])) "const-load.c":5:2 139 {*movdi_64bit}
 (expr_list:REG_EQUIV (const_int 4096 [0x1000])
(nil)))
(insn 6 5 8 2 (set (reg:DI 11 a1)
(plus:DI (reg:DI 10 a0 [72])
(const_int -2000 [0xf830]))) "const-load.c":5:2 4
{adddi3}
 (expr_list:REG_EQUAL (const_int 2096 [0x830])
(nil)))
(insn 8 6 9 2 (set (reg:DI 10 a0)
(plus:DI (reg:DI 10 a0 [72])
(const_int -2024 [0xf818]))) "const-load.c":5:2 4
{adddi3}
 (expr_list:REG_EQUAL (const_int 2072 [0x818])
(nil)))

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-21 Thread vineet.gupta at linux dot dev via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #9 from Vineet Gupta  ---
The redundant Insn 2660 is reload inserted for Insn 1717

 1717: r1871:DI=frame:DI+r2813:DI
Inserting insn reload before:
 2660: r2814:DI=0x1000
 2661: r2813:DI=r2814:DI-0x7e8
  REG_EQUAL 0x818


Insn 1717 in turn is reload inserted for Insn 82

   82: r1870:DI=r1871:DI+r222:DI
  REG_DEAD r222:DI
Inserting insn reload before:
 1717: r1871:DI=frame:DI+0x28
Inserting insn reload after:
 1716: r223:DI=r1870:DI


Insn 82 comes from 253r.expand

(insn 80 79 81 5 (set (reg:DI 222 [ _282 ])
(ashift:DI (reg:DI 74 [ g.1_3 ])
(const_int 7 [0x7]))) "rhs-cred.cc":44:29 -1
 (nil))
(insn 81 80 82 5 (set (reg:DI 884)
(plus:DI (reg/f:DI 67 virtual-stack-vars)
(const_int -8 [0xfff8]))) "rhs-cred.cc":44:29 -1
 (nil))
(insn 82 81 83 5 (set (reg/f:DI 223 [ _283 ])
(plus:DI (reg:DI 884)
(reg:DI 222 [ _282 ]))) "rhs-cred.cc":44:29 -1
 (nil))


Stepping thru lra-constraints.cc, curr_insn_transform () is called twice for
Insn 1717 and in the 2nd call (lra_constraints_iter == 4), operand is required
to have a reload (goal_alt_win[2]==false) which leads to lra_emit_move () ->
riscv_move_integer ()

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-21 Thread vineet.gupta at linux dot dev via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #8 from Vineet Gupta  ---
Created attachment 53332
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53332=edit
Full reload output

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-12 Thread vineet.gupta at linux dot dev via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #7 from Vineet Gupta  ---
(In reply to Richard Biener from comment #5)
> So why do we even emit unsupported 'li 4096' and leave it to the linker to
> "optimize(?)"? 

li 4096 is really a pseudo-op - LUI is used to build 32-bit constants. For this
problem whether LI or LUI is used is just a detail.

> OTOH LRA rematerialization also could be the culprit, thinking rematerializing
> the constant is cheaper than spilling a register holding it.

Thx for the pointer. I tried disabling it -fno-lra-remat and it generates
exactly same code.

Re: [Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-12 Thread Andrew Waterman
To be clear, `li rx, 4096' isn't unsupported: it's a
very-much-supported idiom for `lui rx, 1`.


On Mon, Jul 11, 2022 at 11:45 PM rguenth at gcc dot gnu.org via
Gcc-bugs  wrote:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265
>
> --- Comment #5 from Richard Biener  ---
> So why do we even emit unsupported 'li 4096' and leave it to the linker to
> "optimize(?)"?  At least the cost of this should be reflected - IIRC powerpc
> recently got improvements for similar cases by changing the targets rtx_cost
> hook to properly const SET from CONST_INT so that CSE doesn't leave so many
> sets from constants around.
>
> OTOH LRA rematerialization also could be the culprit, thinking rematerializing
> the constant is cheaper than spilling a register holding it.


[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-12 Thread andrew at sifive dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #6 from Andrew Waterman  ---
To be clear, `li rx, 4096' isn't unsupported: it's a
very-much-supported idiom for `lui rx, 1`.


On Mon, Jul 11, 2022 at 11:45 PM rguenth at gcc dot gnu.org via
Gcc-bugs  wrote:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265
>
> --- Comment #5 from Richard Biener  ---
> So why do we even emit unsupported 'li 4096' and leave it to the linker to
> "optimize(?)"?  At least the cost of this should be reflected - IIRC powerpc
> recently got improvements for similar cases by changing the targets rtx_cost
> hook to properly const SET from CONST_INT so that CSE doesn't leave so many
> sets from constants around.
>
> OTOH LRA rematerialization also could be the culprit, thinking rematerializing
> the constant is cheaper than spilling a register holding it.

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #5 from Richard Biener  ---
So why do we even emit unsupported 'li 4096' and leave it to the linker to
"optimize(?)"?  At least the cost of this should be reflected - IIRC powerpc
recently got improvements for similar cases by changing the targets rtx_cost
hook to properly const SET from CONST_INT so that CSE doesn't leave so many
sets from constants around.

OTOH LRA rematerialization also could be the culprit, thinking rematerializing
the constant is cheaper than spilling a register holding it.

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-11 Thread vineet.gupta at linux dot dev via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #4 from Vineet Gupta  ---
Going back to first dump (upstream 6abe341558a w/o riscv_rtx_costs() adj): the
3rd instruction addi is marking a2 REG_DEAD at 315 cprop.hardreg

--->8 314r.rnreg

(insn 2663 2662 1714 3 (set (reg:DI 13 a3 [2815])
(plus:DI (reg:DI 12 a2 [2816])
(const_int -2024 [0xf818]))) "rhs-cred.cc":42:47 4
{adddi3}
 (expr_list:REG_EQUAL (const_int 2072 [0x818])
(nil)))


--->8 315r.cprop_hardreg

(insn 2663 2662 1714 3 (set (reg:DI 13 a3 [2815])
(plus:DI (reg:DI 12 a2 [2816])
(const_int -2024 [0xf818]))) "rhs-cred.cc":42:47 4
{adddi3}
 (expr_list:REG_DEAD (reg:DI 12 a2 [2816])
(expr_list:REG_EQUAL (const_int 2072 [0x818])
(nil

Not sure if that is the issue or the compiler just deducing that a2 is dead
given that the next instruction sets it up again was introduced much earlier in
300r.reload

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-11 Thread vineet.gupta at linux dot dev via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #3 from Vineet Gupta  ---
Digging into RTL dumps, the li instructions are introduced by 300r reload.

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-11 Thread vineet.gupta at linux dot dev via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #2 from Vineet Gupta  ---
I've experimented with riscv_rtx_costs() setting cost of const to 1 as
discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98596. This does
reduce the number of li 4096 instances to 10 (from 14), but the fragment pasted
above generates the same code.

# rhs-cred.cc:42: (*(double *)((char *))[k] + *(double *)((char *)0)[12] +
#(insn 2660 1703 76 (set (reg:DI 28 t3 [2817])
#(const_int 4096 [0x1000])) "rhs-cred.cc":42:47 139 {*movdi_64bit}
# (nil))
li  t3,4096 # tmp2817,  # 2660  [c=1 l=4] 
*movdi_64bit/1


# rhs-cred.cc:39: bg = ([az])[0] * *()[av] + *()[2];
#(insn 76 2660 2661 (set (reg/f:DI 16 a6 [orig:193 _239 ] [193])
#(plus:DI (reg:DI 12 a2 [1860])
#(reg:DI 16 a6 [orig:181 _224 ] [181]))) "rhs-cred.cc":39:40 4
{adddi3}
# (expr_list:REG_DEAD (reg:DI 12 a2 [1860])
#(nil)))
add a6,a2,a6# _224, _239, tmp1860   # 76[c=4 l=4] 
adddi3/0


# rhs-cred.cc:42: (*(double *)((char *))[k] + *(double *)((char *)0)[12] +
#(insn 2661 76 2658 (set (reg:DI 12 a2 [2816])
#(plus:DI (reg:DI 28 t3 [2817])
#(const_int -2040 [0xf808]))) "rhs-cred.cc":42:47 4
{adddi3}
# (expr_list:REG_DEAD (reg:DI 28 t3 [2817])
#(expr_list:REG_EQUAL (const_int 2056 [0x808])
#(nil
addia2,t3,-2040 #, tmp2816, tmp2817 # 2661  [c=4 l=4] 
adddi3/1

# rhs-cred.cc:44: *()[av * 2] - *()[k] - ([az])[2] +
#(insn 2658 2661 1702 (set (reg:DI 28 t3 [2815])
#(const_int 4096 [0x1000])) "rhs-cred.cc":44:29 139 {*movdi_64bit}
# (nil))

li  t3,4096 # tmp2815,  # 2658  [c=1 l=4] 
*movdi_64bit/1

[Bug target/106265] RISC-V SPEC2017 507.cactu code bloat due to address generation

2022-07-11 Thread vineet.gupta at linux dot dev via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265

--- Comment #1 from Vineet Gupta  ---
Analyzed a section of -dP dump where reg a2 is setup with exact same value
while being live.

rhs-cred.cc:42: (*(double *)((char *))[k] + *(double *)((char *)0)[12] +
#(insn 2662 1711 76 (set (reg:DI 12 a2 [2816])
#(const_int 4096 [0x1000])) "rhs-cred.cc":42:47 139 {*movdi_64bit}
# (nil))
li  a2,4096 # tmp2816,  # 2662  [c=4 l=4] 
*movdi_64bit/1


rhs-cred.cc:39: bg = ([az])[0] * *()[av] + *()[2];
#(insn 76 2662 2663 (set (reg/f:DI 20 s4 [orig:193 _239 ] [193])
#(plus:DI (reg:DI 13 a3 [1865])
#(reg:DI 20 s4 [orig:181 _224 ] [181]))) "rhs-cred.cc":39:40 4
{adddi3}
# (expr_list:REG_DEAD (reg:DI 13 a3 [1865])
#(nil)))
add s4,a3,s4# _224, _239, tmp1865   # 76[c=4 l=4] 
adddi3/0


rhs-cred.cc:42: (*(double *)((char *))[k] + *(double *)((char *)0)[12] +
#(insn 2663 76 2660 (set (reg:DI 13 a3 [2815])
#(plus:DI (reg:DI 12 a2 [2816])
#(const_int -2024 [0xf818]))) "rhs-cred.cc":42:47 4
{adddi3}
# (expr_list:REG_DEAD (reg:DI 12 a2 [2816])
#(expr_list:REG_EQUAL (const_int 2072 [0x818])
#(nil
addia3,a2,-2024 #, tmp2815, tmp2816 # 2663  [c=4 l=4] 
adddi3/1

rhs-cred.cc:44: *()[av * 2] - *()[k] - ([az])[2] +
#(insn 2660 2663 1710 (set (reg:DI 12 a2 [2814])
#(const_int 4096 [0x1000])) "rhs-cred.cc":44:29 139 {*movdi_64bit}
# (nil))
li  a2,4096 # tmp2814,  # 2660  [c=4 l=4] 
*movdi_64bit/1