[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-03-03 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

Jeffrey A. Law  changed:

   What|Removed |Added

   Priority|P3  |P4

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-28 Thread olegendo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #15 from Oleg Endo  ---
(In reply to Roger Sayle from comment #14)
> My apologies for not keeping folks updated on my thinking. Following Oleg's
> feedback, I've decided to slim down my proposed fix to the bare minimum, and
> postpone the other rtx_costs improvements until GCC 15 (or later), when I'll
> have more time to use to CSiBE to demonstrate the benefits/tradeoffs for -Os
> and -Oz.

Alright, I think we've got a couple of issues here.  It's a good idea to split
out and address the actual issue of this PR.  I'm curious to see the effects of
-Oz.


> For example, with fwprop about to transition to insn_cost, it
> would be good for the SH backend to provide a sh_insn_cost target hook.

Nothing against that.  I just really wanted to understand your line of thought
of altering the address costs (i.e. treating all addressing modes with equal
cost 0) in case of optimizing for size.

> 
> The current minimal patch to fix this specific regression is:
> 
> diff --git a/gcc/config/sh/sh.cc b/gcc/config/sh/sh.cc
> index 2c411c3..fba6c0fd465 100644
> --- a/gcc/config/sh/sh.cc
> +++ b/gcc/config/sh/sh.cc
> @@ -3313,7 +3313,8 @@ sh_rtx_costs (rtx x, machine_mode mode
> ATTRIBUTE_UNUSED, i
> nt outer_code,
> {
>   *total = sh_address_cost (XEXP (XEXP (x, 0), 0),
> GET_MODE (XEXP (x, 0)),
> -   MEM_ADDR_SPACE (XEXP (x, 0)), true);
> +   MEM_ADDR_SPACE (XEXP (x, 0)), true)
> +  + COSTS_N_INSNS (1);
>   return true;
> }
>return false;
> 
> The minor complication is that as explained above this results in:
> PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times addc 6
> PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times cmp/pz 25
> PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times shll 3
> PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times subc 14
> 
> which were failures that were fixed (or silenced) by my solution to PR111267.

Yeah, I can imagine that the correction to mem-sign-extend costs will pop
something else.  It should be adapted of course as well.  I'll try to check
that out myself. Although it might take (a bit overloaded at the moment).


> I will note that although the scan-assembler-times complain, that this
> tweak to sh_rtx_costs reduces the total number of instructions in pr59533-1.c
> which (normally) indicates that its an improvement.
> 
> *** old.s   Thu Jan 25 22:54:11 2024
> --- new.s   Thu Jan 25 22:54:23 2024
> ***
> *** 15,23 
> .global test_01
> .type   test_01, @function
>   test_01:
> -   mov.b   @r4,r0
> -   extu.b  r0,r0
> mov.b   @r4,r1
> cmp/pz  r1
> mov #0,r1
> rts
> --- 15,22 
> .global test_01
> .type   test_01, @function
>   test_01:
> mov.b   @r4,r1
> +   extu.b  r1,r0
> cmp/pz  r1
> mov #0,r1
> rts
> ...

Which test is that exactly?  In pr59533-1.c 'test_01' is this:

int
test_01 (unsigned char* a)
{
  /* 1x cmp/pz, 1x addc  */
  return a[0] + (a[0] < 128);
}

... which is totally different from what you posted above.  Something seems to
got mixed up.



> Hence I'm looking into PR59533, which has separate tests for sh2a and !sh2a,
> and my latest discoveries are the -m2a isn't supported if I build gcc using
> --target=sh3-linux-gnu, and that --target=sh2a-linux-gnu doesn't
> automatically default to --target=sh2aeb-linux-gnu and instead gives a fatal
> error about "SH2A does not support little-endian" during the build.  All
> part (joy?) of the learning curve.

Yeah, every backend has its own flavor of setup and dealing with options.
You don't need to build for linux-gnu to check these issues or even run the
CSiBE set. Just configure the build it for --target=sh-elf and it should be
good to go.  By default it builds all multilibs and the default option is -m1
(= SH1).  You can then use -m2a -mb for SH2A.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-27 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #14 from Roger Sayle  ---
My apologies for not keeping folks updated on my thinking. Following Oleg's
feedback, I've decided to slim down my proposed fix to the bare minimum, and
postpone the other rtx_costs improvements until GCC 15 (or later), when I'll
have more time to use to CSiBE to demonstrate the benefits/tradeoffs for -Os
and -Oz.  For example, with fwprop about to transition to insn_cost, it would
be good for the SH backend to provide a sh_insn_cost target hook.

The current minimal patch to fix this specific regression is:

diff --git a/gcc/config/sh/sh.cc b/gcc/config/sh/sh.cc
index 2c411c3..fba6c0fd465 100644
--- a/gcc/config/sh/sh.cc
+++ b/gcc/config/sh/sh.cc
@@ -3313,7 +3313,8 @@ sh_rtx_costs (rtx x, machine_mode mode ATTRIBUTE_UNUSED,
i
nt outer_code,
{
  *total = sh_address_cost (XEXP (XEXP (x, 0), 0),
GET_MODE (XEXP (x, 0)),
-   MEM_ADDR_SPACE (XEXP (x, 0)), true);
+   MEM_ADDR_SPACE (XEXP (x, 0)), true)
+  + COSTS_N_INSNS (1);
  return true;
}
   return false;

The minor complication is that as explained above this results in:
PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times addc 6
PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times cmp/pz 25
PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times shll 3
PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times subc 14

which were failures that were fixed (or silenced) by my solution to PR111267.
I will note that although the scan-assembler-times complain, that this
tweak to sh_rtx_costs reduces the total number of instructions in pr59533-1.c
which (normally) indicates that its an improvement.

*** old.s   Thu Jan 25 22:54:11 2024
--- new.s   Thu Jan 25 22:54:23 2024
***
*** 15,23 
.global test_01
.type   test_01, @function
  test_01:
-   mov.b   @r4,r0
-   extu.b  r0,r0
mov.b   @r4,r1
cmp/pz  r1
mov #0,r1
rts
--- 15,22 
.global test_01
.type   test_01, @function
  test_01:
mov.b   @r4,r1
+   extu.b  r1,r0
cmp/pz  r1
mov #0,r1
rts
...

Hence I'm looking into PR59533, which has separate tests for sh2a and !sh2a,
and my latest discoveries are the -m2a isn't supported if I build gcc using
--target=sh3-linux-gnu, and that --target=sh2a-linux-gnu doesn't automatically
default to --target=sh2aeb-linux-gnu and instead gives a fatal error about
"SH2A does not support little-endian" during the build.  All part (joy?) of the
learning curve.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-26 Thread olegendo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #13 from Oleg Endo  ---
(In reply to Roger Sayle from comment #12)
> It should be mentioned that the fwprop fix for PR11267 also resolved several
> FAILs in gcc.target/sh/pr59533.c.  I mention this as tweaking the cost of
> SIGN_EXTEND in sh_rtx_costs interacts with the (redundant) extensions
> mentioned in the initial description of PR59533.

Good to know, thanks!  I'll try to look into it.


> It's still not entirely clear to me why we would want to squash the costs
> of addresses to 0 when optimizing for size?  What does effect does it have
> on the generated code?  I can't imagine how it would be possibly making
> any smaller code?

Roger, could you please comment on that?  I'm still somewhat puzzled...

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-26 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

Roger Sayle  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=59533

--- Comment #12 from Roger Sayle  ---
It should be mentioned that the fwprop fix for PR11267 also resolved several
FAILs in gcc.target/sh/pr59533.c.  I mention this as tweaking the cost of
SIGN_EXTEND in sh_rtx_costs interacts with the (redundant) extensions mentioned
in the initial description of PR59533.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-22 Thread olegendo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #11 from Oleg Endo  ---
(In reply to Roger Sayle from comment #10)

> I've found an interesting table of SH cycle counts (for different CPUs) at
> http://www.shared-ptr.com/sh_insns.html

Yeah, I know.  I did that ;)

> In my proposed patch, the address cost (1) when optimizing for size attempts
> to return the additional size of an instruction based on the addressing
> mode.  For register, and reg+reg addressing modes there is no size increase
> (overhead), and for adressing modes with displacements, and displacements to
> address pointers, there is a cost.

AFAIR, I've added the 'sh_address_cost' function.  The intention was/is to
encourage/discourage usage of certain address modes based on the side effects
and impact on the surrounding code.  All insns/addr modes have the same length
and basically same execution time.  However, e.g. @(reg+reg) has a constraint
on 'r0' usage, so I weighted that heavier.  If there's anything that could use
@(reg+disp) as an alternative, that'd be better in some cases. (not sure if
such optimizations actually are done...)

> (2) when optimizing for speed, address
> cost remains between 0 and 3, and is used to prioritize between (equivalent
> numbers of) instructions.  Normally, rtx_costs are defined in terms of
> COST_N_INSNS, which multiplies by 4.  Hence on many platforms a single
> instruction that references memory may be encoded as COSTS_N_INSNS(1)+1 (or
> a more complex addressing mode as COSTS_N_INSNS(1)+2) to show that this is
> disfavored to a single instruction that doesn't reference memory,
> COSTS_N_INSNS(1)+0.

That's actually what sh_rtx_costs was supposed to do as well.  I think in usual
cases it does that, only that apparently I've screwed up the {SIGN|ZERO}_EXTEND
for the case of the mem load and it shows up only now, many years later.

It's still not entirely clear to me why we would want to squash the costs of
addresses to 0 when optimizing for size?  What does effect does it have on the
generated code?  I can't imagine how it would be possibly making any smaller
code?

With your patch, in case of the SIGN_EXTEND with mem operand, it would make the
address cost 0 with -Os, which would return COSTS_N_INSNS(1) for reg operand as
well as mem operand.  So both insns are equally weighted and could be
considered interchangeable.  And we might bump into this type of regression
again, if some (future) optimization decides that it can interchange/substitute
insns of the same cost... 


> For example, SH currently reports multiplications as a single cycle operation,

That doesn't seem to be the case.  It's supposed to be using the function
'multcosts' in sh.cc, which returns at least a cost of '2'.  Note that on SH1
and SH2 there is no dynamic (barrel) shift.  So actually some multiplications
could be faster than stitched shifts.


> sh_rtx_costs doesn't distinguish the machine mode, so the costs of SImode 
> multiplications are the same as DImode multiplications.

I guess this is because SH doesn't have real DImode multiplication (64 x 64 ->
64/128 bit).  It can only do 32 x 32 -> 64 bit widening multiplication.  Any
real DImode multiplication will result in either expanded sequence to calculate
sum of particial products or a libcall, AFAIR

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-22 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #10 from Roger Sayle  ---
Hi Oleg.  Great question.  The "speed" parameter passed to rtx_costs, and
address_cost indicates whether the middle-end is optimizing for peformance, and
interested in the nummber of cycles taken by each instruction, or optimizing
for size, and interested in the number of bytes used to encode the instruction.
 Previously, this speed parameter was ignored by the SH backend, so the costs
were the same independent of the objective function.

In my proposed patch, the address cost (1) when optimizing for size attempts to
return the additional size of an instruction based on the addressing mode.  For
register, and reg+reg addressing modes there is no size increase (overhead),
and for adressing modes with displacements, and displacements to address
pointers, there is a cost.  (2) when optimizing for speed, address cost remains
between 0 and 3, and is used to prioritize between (equivalent numbers of)
instructions.  Normally, rtx_costs are defined in terms of COST_N_INSNS, which
multiplies by 4.  Hence on many platforms a single instruction that references
memory may be encoded as COSTS_N_INSNS(1)+1 (or a more complex addressing mode
as COSTS_N_INSNS(1)+2) to show that this is disfavored to a single instruction
that doesn't reference memory, COSTS_N_INSNS(1)+0.

This is the fix for this particular regression; SIGN_EXTEND of a register now
costs COSTS_N_INSNS(1), and SIGN_EXTEND of a MEM now costs COSTS_N_INSNS(1)+1.

A useful way to debug rtx_costs is to use the -dP command line option, and then
look at the [c=X, l=Y] annotations in the assembly language file.  One way to
check/confirm that these are sensible is that ideally they should be correlated
when optimizing for size (with -Os or -Oz).

I've found an interesting table of SH cycle counts (for different CPUs) at
http://www.shared-ptr.com/sh_insns.html and these could be used to improve
sh_rtx_costs further.  For example, SH currently reports multiplications as
a single cycle operation, which doesn't match the hardware specs, and prevents
GCC from using synth_mult to produce faster (or shorter) sequences using shifts
and additions.  Likewise, sh_rtx_costs doesn't distinguish the machine mode,
so the costs of SImode multiplications are the same as DImode multiplications.

In comment #5 you mention GCC's defaults; it turns out that for rtx_costs the
default values that would be provided by the middle-end, may be more accurate
than the values (currently) specified by the backend.

I hope this answers your question.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-22 Thread olegendo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #9 from Oleg Endo  ---
(In reply to Roger Sayle from comment #8)
> Created attachment 57190 [details]
> proposed patch
> 
> Proposed patch to provide a sane/saner set of rtx_costs for SH.  There's
> plenty more that could be done, but these changes are (more than) sufficient
> to resolve the code quality regression caused by improved fwprop.  If
> someone could try this out on SH, and report back the results, that would be
> great.


You've added differentiation for 'speed ?' in 'sh_address_cost'.  Like this
one.

   /* 'GBR + 0'.  Account one more because of R0 restriction.  */
   if (REG_P (x) && REGNO (x) == GBR_REG)
-return 2;
+return speed ? 2 : 0;

What's the intention here?  Why does the cost of the address computation
reduced when not optimizing for speed?  It distorts the address costs and makes
them all equal.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-22 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #8 from Roger Sayle  ---
Created attachment 57190
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57190=edit
proposed patch

Proposed patch to provide a sane/saner set of rtx_costs for SH.  There's plenty
more that could be done, but these changes are (more than) sufficient to
resolve the code quality regression caused by improved fwprop.  If someone
could try this out on SH, and report back the results, that would be great.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-22 Thread olegendo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #7 from Oleg Endo  ---
(In reply to Roger Sayle from comment #6)
> To help diagnose the problem, I came up with this simple patch:

Thanks for looking into it!

> which then helps reveal that on sh3-linux-gnu with -O1 we see:

I think this will also happen on all sh-elf sub-targets, not necessarily
sh3-linux... if it helps anything ... 

> propagating insn 6 into insn 12, replacing:
> (set (reg:SI 174 [ _1 ])
> (sign_extend:SI (reg:QI 169 [ *a_7(D) ])))
> successfully matched this instruction to *extendqisi2_compact_snd:
> (set (reg:SI 174 [ _1 ])
> (sign_extend:SI (mem:QI (reg/v/f:SI 168 [ aD.1817 ]) [0 *a_7(D)+0 S1
> A8])))
> change is profitable (cost 4 -> cost 1)
> 
> which confirms Andrew's and Oleg's analyses above; the sh_rtx_costs function
> is a little odd... Reading from memory is four times faster than using a
> pseudo!?
> I'm investigating a "costs" patch for the SH backend.

Looks like sh_rtx_costs function assumes that the costs of the whole RTX are
summed up outside in the caller.

In sh_rtx_costs SIGN_EXTEND, ZERO_EXTEND, the 'sh_address_cost' is returned
directly for the MEM_P case. It should probably have went through COSTS_N_INSN
to get it into the same scale as for the arith_reg_operand case.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-22 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

Roger Sayle  changed:

   What|Removed |Added

   Last reconfirmed||2024-01-22
 Status|UNCONFIRMED |NEW
 CC||roger at nextmovesoftware dot 
com
 Ever confirmed|0   |1

--- Comment #6 from Roger Sayle  ---
To help diagnose the problem, I came up with this simple patch:
diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 7872609b336..dc563ac2ca1 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -492,6 +492,9 @@ try_fwprop_subst_pattern (obstack_watermark ,
insn_change _change,
   " (cost %d -> cost %d)\n", old_cost, new_cost);
ok = false;
  }
+   else if (dump_file)
+ fprintf (dump_file, "change is profitable"
+  " (cost %d -> cost %d)\n", old_cost, new_cost);
   }

   if (!ok)

which then helps reveal that on sh3-linux-gnu with -O1 we see:
propagating insn 6 into insn 12, replacing:
(set (reg:SI 174 [ _1 ])
(sign_extend:SI (reg:QI 169 [ *a_7(D) ])))
successfully matched this instruction to *extendqisi2_compact_snd:
(set (reg:SI 174 [ _1 ])
(sign_extend:SI (mem:QI (reg/v/f:SI 168 [ aD.1817 ]) [0 *a_7(D)+0 S1 A8])))
change is profitable (cost 4 -> cost 1)

which confirms Andrew's and Oleg's analyses above; the sh_rtx_costs function is
a little odd... Reading from memory is four times faster than using a pseudo!?
I'm investigating a "costs" patch for the SH backend.  My apologies for the
temporary inconvenience, and thanks to Jeff for catching/spotting this.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-21 Thread olegendo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #5 from Oleg Endo  ---
(In reply to Andrew Pinski from comment #3)

> That seems to make the cost of a load/store if just an index, the same as the 
> cost
> as the index. Which definitely seems wrong. It should be the cost of the 
> load/store
> and the cost of the address formation. 

Yep, that doesn't seem to be accurate.

> The way aarch64 implements its _rtx_costs is that it cases SET and if the LHS 
> is a
> mem, then it is the cost of the store there and returns true (though you 
> might need
> to take into RHS if it can be more than just a register like). And then MEM
> includes the cost of the load.

That's what I'd also do intuitively.

> i386 does something similar too.

If every backend does that same thing, isn't it better to do it as a default
for everybody?  I can imagine the minimalist backends (like RX) will regress on
this case as well. (Haven't checked it though)

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #4 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #3)
> The way aarch64 implements its _rtx_costs is that it cases SET and if the
> LHS is a mem, then it is the cost of the store there and returns true
> (though you might need to take into RHS if it can be more than just a
> register like). And then MEM includes the cost of the load.

i386 does something similar too.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #3 from Andrew Pinski  ---
(In reply to Oleg Endo from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > This is most likely a cost model issue on sh3.
> 
> You mean this one (sh.cc, sh_rtx_costs)?
> 
> /* The cost of a mem access is mainly the cost of the address mode.  */
> case MEM:
>   *total = sh_address_cost (XEXP (x, 0), GET_MODE (x), MEM_ADDR_SPACE
> (x),
>   true);
>   return true;

That seems to make the cost of a load/store if just an index, the same as the
cost as the index. Which definitely seems wrong. It should be the cost of the
load/store and the cost of the address formation. 

The way aarch64 implements its _rtx_costs is that it cases SET and if the LHS
is a mem, then it is the cost of the store there and returns true (though you
might need to take into RHS if it can be more than just a register like). And
then MEM includes the cost of the load.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-21 Thread olegendo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #2 from Oleg Endo  ---
(In reply to Andrew Pinski from comment #1)
> This is most likely a cost model issue on sh3.

You mean this one (sh.cc, sh_rtx_costs)?

/* The cost of a mem access is mainly the cost of the address mode.  */
case MEM:
  *total = sh_address_cost (XEXP (x, 0), GET_MODE (x), MEM_ADDR_SPACE (x),
true);
  return true;

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

Andrew Pinski  changed:

   What|Removed |Added

 Target||sh3-linux-gnu
   Target Milestone|--- |14.0

--- Comment #1 from Andrew Pinski  ---
This is most likely a cost model issue on sh3.