[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-04-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|12.0|13.0
   Priority|P1  |P2

--- Comment #35 from Richard Biener  ---
(In reply to Richard Biener from comment #34)
> As noted the effect of
> 
>   if(...) {
>ux = 0.005;
>uy = 0.002;
>uz = 0.000;
>   }
> 
> is PRE of most(!) dependent instructions, creating
> 
>   # prephitmp_1099 = PHI <_1098(6),
> 6.4997172499889149648879538290202617645263671875e-1(5)>
>   # prephitmp_ = PHI <_1110(6),
> 1.08980570833178483570691241766326129436492919921875e-1(5)>
> ...
> 
> we successfully coalesce the non-constant incoming register with the result
> but have to emit copies for all constants on the other edge where we have
> quite a number of duplicate constants to deal with.
> 
> I've experimented with ensuring we get _full_ PRE of the dependent
> expressions
> by more aggressively re-associating (give PHIs with a constant incoming
> operand
> on at least one edge a rank similar to constants, 1).
> 
> This increases the number of PHIs further but reduces the followup
> computations
> more.  We still fail to simply tail-duplicate the merge block - another
> possibility to eventually save some of the overhead, our tail duplication
> code (gimple-ssa-split-paths.cc) doesn't handle this case since the
> diamond is not the one immediately preceeding the loop exit/latch.
> 
> The result of "full PRE" is a little bit worse than the current state (so
> it's not a full solution here).

Btw, looking at coverage the constant case is only an umimportant fraction
of the runtime, so the register pressure increase by the PRE dominates
(but the branch is predicted to be 50/50):

3562383000:  241:   if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
 55296000:  242:ux = 0.005;
 55296000:  243:uy = 0.002;
 55296000:  244:uz = 0.000;
-:  245:}

we can also see that PGO notices this and we do _not_ perform the PRE.

So the root cause is nothing we can fix for GCC 12, tuning to avoid
spilling to GPRs can recover parts of the regression but will definitely
have effects elsewhere.

Re-targeting to GCC 13.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-04-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #34 from Richard Biener  ---
As noted the effect of

  if(...) {
   ux = 0.005;
   uy = 0.002;
   uz = 0.000;
  }

is PRE of most(!) dependent instructions, creating

  # prephitmp_1099 = PHI <_1098(6),
6.4997172499889149648879538290202617645263671875e-1(5)>
  # prephitmp_ = PHI <_1110(6),
1.08980570833178483570691241766326129436492919921875e-1(5)>
...

we successfully coalesce the non-constant incoming register with the result
but have to emit copies for all constants on the other edge where we have
quite a number of duplicate constants to deal with.

I've experimented with ensuring we get _full_ PRE of the dependent expressions
by more aggressively re-associating (give PHIs with a constant incoming operand
on at least one edge a rank similar to constants, 1).

This increases the number of PHIs further but reduces the followup computations
more.  We still fail to simply tail-duplicate the merge block - another
possibility to eventually save some of the overhead, our tail duplication
code (gimple-ssa-split-paths.cc) doesn't handle this case since the
diamond is not the one immediately preceeding the loop exit/latch.

The result of "full PRE" is a little bit worse than the current state (so
it's not a full solution here).

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-04-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #33 from Richard Biener  ---
(In reply to Richard Biener from comment #32)
> The diff with ! added is quite short, I've yet have to measure any
> effect on LBM:
> 
> --- streamcollide.s.orig2022-04-25 11:37:01.638733951 +0200
> +++ streamcollide.s22022-04-25 11:35:54.885849296 +0200
> @@ -33,28 +33,24 @@
> .p2align 4
> .p2align 3
>  .L12:
> -   movq.LC0(%rip), %rax
> -   vmovsd  .LC4(%rip), %xmm6
> +   vmovsd  .LC0(%rip), %xmm2
> +   vmovsd  .LC1(%rip), %xmm13
> +   movabsq $0x3ff01878b7a1c25d, %rax
> movabsq $0x3fef85af6c69b5a6, %rdi
> +   vmovsd  .LC2(%rip), %xmm12
> +   vmovsd  .LC3(%rip), %xmm14
> movabsq $0x3ff03db8fde2ef4e, %r8
> +   movabsq $0x3fefcea39c51dabe, %r9
> +   vmovsd  .LC4(%rip), %xmm6
> vmovsd  .LC5(%rip), %xmm7
> movq.LC8(%rip), %r11
> -   movabsq $0x3fefcea39c51dabe, %r9
> movq.LC6(%rip), %rdx
> movq.LC7(%rip), %rcx
> -   vmovq   %rax, %xmm2
> -   vmovq   %rax, %xmm4
> -   movq.LC1(%rip), %rax
> movq%r11, %rsi
> movq%r11, %r12
> -   vmovq   %rax, %xmm13
> -   vmovq   %rax, %xmm8
> -   movq.LC2(%rip), %rax
> -   vmovq   %rax, %xmm12
> -   vmovq   %rax, %xmm5
> -   movq.LC3(%rip), %rax
> -   vmovq   %rax, %xmm14
> -   movabsq $0x3ff01878b7a1c25d, %rax
> +   vmovsd  %xmm2, %xmm2, %xmm4
> +   vmovsd  %xmm13, %xmm13, %xmm8
> +   vmovsd  %xmm12, %xmm12, %xmm5
> vmovsd  %xmm14, -16(%rsp)
>  .L5:
> vmulsd  .LC9(%rip), %xmm0, %xmm3

Huh, and the net effect is that the + code is 9% _slower_!?

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-04-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #32 from Richard Biener  ---
So the bad "head" can be fixed via

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index c74edd1aaef..8f9f26e0a82 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3580,9 +3580,9 @@
 ;; Possible store forwarding (partial memory) stall in alternatives 4, 6 and
7.
 (define_insn "*movdf_internal"
   [(set (match_operand:DF 0 "nonimmediate_operand"
-"=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,r ,v,r 
,o ,r  ,m")
+"=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,!r,!v,r
 ,o ,r  ,m")
(match_operand:DF 1 "general_operand"
-"Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x,v,r
,roF,rF,rmF,rC"))]
+"Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m
,*x,!v,!r,roF,rF,rmF,rC"))]
   "!(MEM_P (operands[0]) && MEM_P (operands[1]))
&& (lra_in_progress || reload_completed
|| !CONST_DOUBLE_P (operands[1])

which is adding ! to r<->v alternatives.  That should eventually be done
by duplicating the alternatives and enabling one set via some enable
attribute based on some tunable.  I see those alternatives are already

   (set (attr "preferred_for_speed")
 (cond [(eq_attr "alternative" "3,4")
  (symbol_ref "TARGET_INTEGER_DFMODE_MOVES")
(eq_attr "alternative" "20")
  (symbol_ref "TARGET_INTER_UNIT_MOVES_FROM_VEC")
(eq_attr "alternative" "21")
  (symbol_ref "TARGET_INTER_UNIT_MOVES_TO_VEC")
   ] 
   (symbol_ref "true")))

not sure why it's preferred_for_speed here though - shouldn't that be
enabled for size if !TARGET_INTER_UNIT_MOVES_{TO,FROM}_VEC and otherwise
disabled?  Not sure if combining enabled and preferred_for_speed is
reasonably possible, but we have a preferred_for_size attribute here.

The diff with ! added is quite short, I've yet have to measure any
effect on LBM:

--- streamcollide.s.orig2022-04-25 11:37:01.638733951 +0200
+++ streamcollide.s22022-04-25 11:35:54.885849296 +0200
@@ -33,28 +33,24 @@
.p2align 4
.p2align 3
 .L12:
-   movq.LC0(%rip), %rax
-   vmovsd  .LC4(%rip), %xmm6
+   vmovsd  .LC0(%rip), %xmm2
+   vmovsd  .LC1(%rip), %xmm13
+   movabsq $0x3ff01878b7a1c25d, %rax
movabsq $0x3fef85af6c69b5a6, %rdi
+   vmovsd  .LC2(%rip), %xmm12
+   vmovsd  .LC3(%rip), %xmm14
movabsq $0x3ff03db8fde2ef4e, %r8
+   movabsq $0x3fefcea39c51dabe, %r9
+   vmovsd  .LC4(%rip), %xmm6
vmovsd  .LC5(%rip), %xmm7
movq.LC8(%rip), %r11
-   movabsq $0x3fefcea39c51dabe, %r9
movq.LC6(%rip), %rdx
movq.LC7(%rip), %rcx
-   vmovq   %rax, %xmm2
-   vmovq   %rax, %xmm4
-   movq.LC1(%rip), %rax
movq%r11, %rsi
movq%r11, %r12
-   vmovq   %rax, %xmm13
-   vmovq   %rax, %xmm8
-   movq.LC2(%rip), %rax
-   vmovq   %rax, %xmm12
-   vmovq   %rax, %xmm5
-   movq.LC3(%rip), %rax
-   vmovq   %rax, %xmm14
-   movabsq $0x3ff01878b7a1c25d, %rax
+   vmovsd  %xmm2, %xmm2, %xmm4
+   vmovsd  %xmm13, %xmm13, %xmm8
+   vmovsd  %xmm12, %xmm12, %xmm5
vmovsd  %xmm14, -16(%rsp)
 .L5:
vmulsd  .LC9(%rip), %xmm0, %xmm3

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-04-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed|2021-09-03 00:00:00 |2022-4-11

--- Comment #31 from Richard Biener  ---
For the testcase I still see 58 vmovq GPR <-> XMM at -Ofast -march=znver2,
resulting from spilling of xmms.  And there's still cases like

movq.LC0(%rip), %rax
...
vmovq   %rax, %xmm2
vmovq   %rax, %xmm4

Honza - you said you wanted to have some look here.  As I understand Vlad
costing isn't taken into account when choosing alternatives for reloading.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-02-10 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #30 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #29)
> (In reply to Vladimir Makarov from comment #28)
> > Could somebody benchmark the following patch on zen2 470.lbm.
> 
> Code generation changes quite a bit, with the patch the offending function
> is 16 bytes larger.  I see no large immediate moves to GPRs anymore but
> there is still a lot of spilling of XMMs to GPRs.  Performance is
> unchanged by the patch:
> 
> 470.lbm 13740128107 S   13740128107 S
> 470.lbm 13740128107 *   13740128107 S
> 470.lbm 13740128107 S   13740128107 *
> 
> 

Thank you very much for testing the patch, Richard.  The results mean no go for
the patch to me.

> Without knowing much of the code I wonder if we can check whether the move
> will be to a reg in GENERAL_REGS?  That is, do we know whether there are
> (besides some special constants like zero), immediate moves to the
> destination register class?
>

There are no such info from the target code.  Ideally we need to have the cost
of loading *particular* immediate value into register class on the same cost
basis
as load/store.  Still to use this info efficiently choosing alternatives should
be based on costs not on the hints and some machine independent general
heuristics (as now).


> That said, given the result on LBM I'd not change this at this point.
> 
> Honza wanted to look at the move pattern to try to mitigate the
> GPR spilling of XMMs.
> 
> I do think that we need to take costs into account at some point and get
> rid of the reload style hand-waving with !?* in the move patterns.

In general I am agree with the direction but it will be quite hard to do.  I
know it well from my experience to change register class cost calculation
algorithm in IRA (the experimental code can be found on the branch ira-select).
I expect huge number of test failures and some benchmark performance
degradation practically for any targets and a big involvement of target
maintainers to fix them.  Although it is possible to try to do this for one
target at the time.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-02-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #29 from Richard Biener  ---
(In reply to Vladimir Makarov from comment #28)
> Could somebody benchmark the following patch on zen2 470.lbm.
> 
> diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
> index 9cee17479ba..76619aca8eb 100644
> --- a/gcc/lra-constraints.cc
> +++ b/gcc/lra-constraints.cc
> @@ -5084,7 +5089,9 @@ lra_constraints (bool first_p)
>  (x, lra_get_allocno_class (i)) == NO_REGS))
> || contains_symbol_ref_p (x
>   ira_reg_equiv[i].defined_p = false;
> -   if (contains_reg_p (x, false, true))
> +   if (contains_reg_p (x, false, true)
> +   || (CONST_DOUBLE_P (x)
> +   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8)))
>   ira_reg_equiv[i].profitable_p = false;
> if (get_equiv (reg) != reg)
>   bitmap_ior_into (equiv_insn_bitmap,
> _reg_info[i].insn_bitmap);
> 
> If it improves the performance, I'll commit this patch.
> 
> The expander unconditionally uses memory pool for double constants.  I think
> the analogous treatment could be done for equiv double constants in LRA.
> 
> I know only x86_64 permits 64-bit constants as immediate for moving them
> into general regs.  As double fp operations is not done in general regs in
> the most cases, they should be moved into fp regs and this is costly as Jan
> wrote.  So it has sense to prohibit using equiv double constant values in
> LRA unconditionally.  If in the future we have a target which can move
> double immediate into fp regs we can introduce some target hooks to deal
> with equiv double constant.  But right now I think there is no need for the
> new hook.

Code generation changes quite a bit, with the patch the offending function
is 16 bytes larger.  I see no large immediate moves to GPRs anymore but
there is still a lot of spilling of XMMs to GPRs.  Performance is
unchanged by the patch:

470.lbm 13740128107 S   13740128107 S
470.lbm 13740128107 *   13740128107 S
470.lbm 13740128107 S   13740128107 *

I've used

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 9cee17479ba..a0ec608c056 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -5084,7 +5084,9 @@ lra_constraints (bool first_p)
 (x, lra_get_allocno_class (i)) == NO_REGS))
|| contains_symbol_ref_p (x
  ira_reg_equiv[i].defined_p = false;
-   if (contains_reg_p (x, false, true))
+   if (contains_reg_p (x, false, true)
+   || (CONST_DOUBLE_P (x)
+   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)),
UNITS_PER_WORD)))
  ira_reg_equiv[i].profitable_p = false;
if (get_equiv (reg) != reg)
  bitmap_ior_into (equiv_insn_bitmap,
_reg_info[i].insn_bitmap);

note UNITS_PER_WORD vs. literal 8.

Without knowing much of the code I wonder if we can check whether the move
will be to a reg in GENERAL_REGS?  That is, do we know whether there are
(besides some special constants like zero), immediate moves to the
destination register class?

That said, given the result on LBM I'd not change this at this point.

Honza wanted to look at the move pattern to try to mitigate the
GPR spilling of XMMs.

I do think that we need to take costs into account at some point and get
rid of the reload style hand-waving with !?* in the move patterns.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-02-09 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #28 from Vladimir Makarov  ---
Could somebody benchmark the following patch on zen2 470.lbm.

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 9cee17479ba..76619aca8eb 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -5084,7 +5089,9 @@ lra_constraints (bool first_p)
 (x, lra_get_allocno_class (i)) == NO_REGS))
|| contains_symbol_ref_p (x
  ira_reg_equiv[i].defined_p = false;
-   if (contains_reg_p (x, false, true))
+   if (contains_reg_p (x, false, true)
+   || (CONST_DOUBLE_P (x)
+   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8)))
  ira_reg_equiv[i].profitable_p = false;
if (get_equiv (reg) != reg)
  bitmap_ior_into (equiv_insn_bitmap,
_reg_info[i].insn_bitmap);

If it improves the performance, I'll commit this patch.

The expander unconditionally uses memory pool for double constants.  I think
the analogous treatment could be done for equiv double constants in LRA.

I know only x86_64 permits 64-bit constants as immediate for moving them into
general regs.  As double fp operations is not done in general regs in the most
cases, they should be moved into fp regs and this is costly as Jan wrote.  So
it has sense to prohibit using equiv double constant values in LRA
unconditionally.  If in the future we have a target which can move double
immediate into fp regs we can introduce some target hooks to deal with equiv
double constant.  But right now I think there is no need for the new hook.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-28 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #27 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #17)
> So in .reload we have (with unpatched trunk)
> 
>   401: NOTE_INSN_BASIC_BLOCK 6
>   462: ax:DF=[`*.LC0']
>   REG_EQUAL 9.8506899724167309977929107844829559326171875e-1
>   407: xmm2:DF=ax:DF
>   463: ax:DF=[`*.LC0']
>   REG_EQUAL 9.8506899724167309977929107844829559326171875e-1
>   408: xmm4:DF=ax:DF
> 
> why??!  We can load .LC0 into xmm4 directly.  IRA sees
> 
>   401: NOTE_INSN_BASIC_BLOCK 6
>   407: r118:DF=r482:DF
>   408: r119:DF=r482:DF
> 
> now I cannot really decipher IRA or LRA dumps but my guess would be that
> inheritance (causing us to load from LC0) interferes badly with register
> class assignment?
> 
> Changing pseudo 482 in operand 1 of insn 407 on equiv
> 9.8506899724167309977929107844829559326171875e-1
> ...
>   alt=21,overall=9,losers=1,rld_nregs=1
>  Choosing alt 21 in insn 407:  (0) v  (1) r {*movdf_internal}
>   Creating newreg=525, assigning class GENERAL_REGS to r525
>   407: r118:DF=r525:DF
> Inserting insn reload before:
>   462: r525:DF=[`*.LC0']
>   REG_EQUAL 9.8506899724167309977929107844829559326171875e-1
> 
> we should have preferred alt 14 I think (0) v (1) m, but that has
> 
>   alt=14,overall=13,losers=1,rld_nregs=0
> 0 Spill pseudo into memory: reject+=3
> Using memory insn operand 0: reject+=3
> 0 Non input pseudo reload: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> alt=15,overall=28,losers=3 -- refuse
> 0 Costly set: reject++
> alt=16: Bad operand -- refuse
> 0 Costly set: reject++
> 1 Costly loser: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> alt=17,overall=17,losers=2 -- refuse
> 0 Costly set: reject++
> 1 Spill Non-pseudo into memory: reject+=3
> Using memory insn operand 1: reject+=3
> 1 Non input pseudo reload: reject++
> alt=18,overall=14,losers=1 -- refuse
> 0 Spill pseudo into memory: reject+=3
> Using memory insn operand 0: reject+=3
> 0 Non input pseudo reload: reject++
> 1 Costly loser: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> alt=19,overall=29,losers=3 -- refuse
> 0 Non-prefered reload: reject+=600
> 0 Non input pseudo reload: reject++
> alt=20,overall=607,losers=1 -- refuse
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> 
> I'm not sure I can decipher the reasoning but I don't understand how it
> doesn't seem to anticipate the cost of reloading the GPR in the alternative
> it chooses?
> 
> Vlad?

All this diagnostics is just description of voodoo from the old reload pass. 
LRA choosing alternative the same way as the old reload pass (I doubt that any
other approach will not break all existing targets).  Simply the old reload
pass does not report its decisions in the dump.

LRA code (lra-constraints.cc::process_alt_operands) choosing the insn
alternatives (as the old reload pass) does not use any memory or register move
costs.  Instead, the alternative is chosen by heuristics and insn constraints
hints (like ? !). The only case where these costs are used, when we have
reg:=reg and the register move costs for this is 2.  In this case LRA(reload)
does not bother to check the insn constraints.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-28 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #26 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #7)
> make costs in a way that IRA/LRA prefer re-materialization of constants
> from the constant pool over spilling to GPRs (if that's possible at all -
> Vlad?)

LRA rematerialization can not rematerialize constant value from memory pool. 
It can rematerialize value of expression only consisting of other pseudos
(currently assigned to hard regs) and constants.

I guess rematerialization pass can be extended to work for constants from
constant memory pool.  It is pretty doable project opposite to
rematerialization of any memory which would require a lot analysis including
aliasing and complicated cost calculation benefits.  May be somebody could pick
this project up.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #25 from Hongtao.liu  ---
> Guess we need to let RA know mem cost is cheaper than GPR for r249.

Reduce sse_store cost?

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #24 from Hongtao.liu  ---
for
vmovq   %rdi, %xmm7 # 503   [c=4 l=4]  *movdf_internal/21
..
vmulsd  %xmm7, %xmm4, %xmm5 # 320   [c=12 l=4]  *fop_df_comm/2
..
movabsq $0x3fef85af6c69b5a6, %rdi   # 409   [c=5 l=10] 
*movdf_internal/11



and 
7806(insn 320 319 322 8 (set (reg:DF 441)
7807(mult:DF (reg:DF 166 [ _323 ])
7808(reg:DF 249 [ _900 ]))) "../test.c":87:218 1072 {*fop_df_comm}
7809 (expr_list:REG_DEAD (reg:DF 249 [ _900 ])
7810(nil)))

RA allocate rdi for 249 because cost of general reg is cheaper than mem.

  a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964
BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964
CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NON_Q_REGS:5964,5964
TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964 FP_TOP_REG:19546,19546
FP_SECOND_REG:19546,19546 FLOAT_REGS:19546,19546 SSE_FIRST_REG:0,0
NO_REX_SSE_REGS:0,0 SSE_REGS:0,0 FLOAT_SSE_REGS:19546,19546
FLOAT_INT_REGS:19546,19546 INT_SSE_REGS:19546,19546
FLOAT_INT_SSE_REGS:19546,19546 MEM:6294,6294

 950r249: preferred SSE_REGS, alternative GENERAL_REGS, allocno
INT_SSE_REGS


Disposition:
 66:r249 l1 5

with -mtune=aldlake, for r249 cost of general regs is expensive than mem, and
RA will allocate mem for it, then no more movabsq/vmovq is needed.

 655  a66(r249,l1) costs: AREG:5964,5964 DREG:5964,5964 CREG:5964,5964
BREG:5964,5964 SIREG:5964,5964 DIREG:5964,5964 AD_REGS:5964,5964
CLOBBERED_REGS:5964,5964 Q_REGS:5964,5964 NO\
N_Q_REGS:5964,5964 TLS_GOTBASE_REGS:5964,5964 GENERAL_REGS:5964,5964
FP_TOP_REG:14908,14908 FP_SECOND_REG:14908,14908 FLOAT_REGS:14908,14908
SSE_FIRST_REG:0,0 NO_REX_SSE_REGS:0\
,0 SSE_REGS:0,0 FLOAT_SSE_REGS:14908,14908 FLOAT_INT_REGS:14908,14908
INT_SSE_REGS:14908,14908 FLOAT_INT_SSE_REGS:14908,14908 MEM:5632,5632


 950r249: preferred SSE_REGS, alternative NO_REGS, allocno SSE_REGS

 66:r249 l1   mem

vmulsd  -80(%rsp), %xmm2, %xmm3 # 320   [c=29 l=6]  *fop_df_comm/2

Guess we need to let RA know mem cost is cheaper than GPR for r249.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener  changed:

   What|Removed |Added

 Status|WAITING |NEW

--- Comment #23 from Richard Biener  ---
(In reply to H.J. Lu from comment #22)
> Is this related to PR 104059? Can you try:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html

I verified that hardreg cprop does nothing on the testcase, so no.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

H.J. Lu  changed:

   What|Removed |Added

 Status|NEW |WAITING

--- Comment #22 from H.J. Lu  ---
Is this related to PR 104059? Can you try:

https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589209.html

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #21 from hubicka at kam dot mff.cuni.cz ---
> I would say so.  It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark

double a=10;
int
main()
{
long int i;
double sum,val1,val2,val3,val4;
 for (i=0;i<10;i++)
 {
#if 1
#if 1
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val1): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val2): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val3): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq  
%%r8, %0": "=x"(val4): :"r8","xmm11");
#else
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val1):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val2):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val3):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0":
"=x"(val4):"m"(a) :"r8","xmm11");
#endif
#else
asm __volatile__("vmovq   %1, %0": "=x"(val1):"m"(a)
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val2):"m"(a)
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val3):"m"(a)
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val4):"m"(a)
:"r8","xmm11");
#endif
sum+=val1+val2+val3+val4;
 }
 return sum;

and indeed the third variant runs 1.2s while the first two takes equal
time 2.4s on my zen2 laptop.

Re: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread Jan Hubicka via Gcc-bugs
> I would say so.  It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark

double a=10;
int
main()
{
long int i;
double sum,val1,val2,val3,val4;
 for (i=0;i<10;i++)
 {
#if 1
#if 1
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   
%%r8, %0": "=x"(val1): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   
%%r8, %0": "=x"(val2): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   
%%r8, %0": "=x"(val3): :"r8","xmm11");
asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq   
%%r8, %0": "=x"(val4): :"r8","xmm11");
#else
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": 
"=x"(val1):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": 
"=x"(val2):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": 
"=x"(val3):"m"(a) :"r8","xmm11");
asm __volatile__("movq %1, %%r8;vmovq   %%r8, %0": 
"=x"(val4):"m"(a) :"r8","xmm11");
#endif
#else
asm __volatile__("vmovq   %1, %0": "=x"(val1):"m"(a) 
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val2):"m"(a) 
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val3):"m"(a) 
:"r8","xmm11");
asm __volatile__("vmovq   %1, %0": "=x"(val4):"m"(a) 
:"r8","xmm11");
#endif
sum+=val1+val2+val3+val4;
 }
 return sum;

and indeed the third variant runs 1.2s while the first two takes equal
time 2.4s on my zen2 laptop.


[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #20 from rguenther at suse dot de  ---
On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
> 
> --- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> > 
> > Yep, we also have code like
> > 
> > -   movabsq $0x3ff03db8fde2ef4e, %r8
> > ...
> > -   vmovq   %r8, %xmm11
> 
> It is loading random constant to xmm11.  Since reg<->xmm moves are
> relatively cheap it looks OK to me that we generate this.  Is it faster
> to load constant from the memory?

I would say so.  It saves code size and also uop space unless the two
can magically fuse to a immediate to %xmm move (I doubt that).

> > movq.LC11(%rip), %rax
> > vmovq   %rax, %xmm14
> This is odd indeed and even more odd that we both movabs and memory load... 
> i386 FE plays some games with allowing some constants in SSE
> instructions (to allow simplification and combining) and split them out
> to memory later.  It may be consequence of this.

I've pasted the LRA dump pieces I think are relevant but I don't
understand them.  The constant load isn't visible originally but
is introduced by LRA so that may be the key to the mystery here.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #19 from rguenther at suse dot de  ---
On Thu, 27 Jan 2022, hubicka at kam dot mff.cuni.cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
> 
> --- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> > > According to znver2_cost
> > > 
> > > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > > instead of GPR.
> > 
> > That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> > but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> > Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat 
> > special,
> > but modeling that doesn't seem to be necessary.
> > 
> > We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> > gpr<->xmm move cost to 8.
> 
> I was simply following latencies here, so indeed reg<->mem bypass is not
> really modelled.  I recall doing few experiments which was kind of
> inconclusive.

Yes, I think xmm->gpr->xmm vs. xmm->mem->xmm isn't really the issue here
but it's mem->gpr->xmm vs. mem->xmm with all the constant pool remats.
Agner lists latency of 3 for gpr<->xmm and a latency of 4 for mem<->xmm
but then there's forwarding (and the clever renaming trick) which likely
makes xmm->mem->xmm cheaper than 4 + 4 but xmm->gpr->xmm will really
be 3 + 3 latency.  grp<->xmm seem to be also more resource constrained.

In any case for moving xmm to gpr it doesn't make sense to go through
memory but it doesn't seem worth to spill to xmm or gpr when we only
use gpr / xmm later.

Letting the odd and bogus code we generate for the .LC0 re-materialization
aside which we should fix and which fixing likely will fix LBM.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #18 from Richard Biener  ---
For the case of LBM what also helps is disabling PRE or using PGO (which
sees the useless PRE) given that the path the expressions become partially
compile-time computable is never taken at runtime.  In theory we could
isolate that path completely via -ftracer but the pass pipeline setup doesn't
look optimal.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #17 from Richard Biener  ---
So in .reload we have (with unpatched trunk)

  401: NOTE_INSN_BASIC_BLOCK 6
  462: ax:DF=[`*.LC0']
  REG_EQUAL 9.8506899724167309977929107844829559326171875e-1
  407: xmm2:DF=ax:DF
  463: ax:DF=[`*.LC0']
  REG_EQUAL 9.8506899724167309977929107844829559326171875e-1
  408: xmm4:DF=ax:DF

why??!  We can load .LC0 into xmm4 directly.  IRA sees

  401: NOTE_INSN_BASIC_BLOCK 6
  407: r118:DF=r482:DF
  408: r119:DF=r482:DF

now I cannot really decipher IRA or LRA dumps but my guess would be that
inheritance (causing us to load from LC0) interferes badly with register
class assignment?

Changing pseudo 482 in operand 1 of insn 407 on equiv
9.8506899724167309977929107844829559326171875e-1
...
  alt=21,overall=9,losers=1,rld_nregs=1
 Choosing alt 21 in insn 407:  (0) v  (1) r {*movdf_internal}
  Creating newreg=525, assigning class GENERAL_REGS to r525
  407: r118:DF=r525:DF
Inserting insn reload before:
  462: r525:DF=[`*.LC0']
  REG_EQUAL 9.8506899724167309977929107844829559326171875e-1

we should have preferred alt 14 I think (0) v (1) m, but that has

  alt=14,overall=13,losers=1,rld_nregs=0
0 Spill pseudo into memory: reject+=3
Using memory insn operand 0: reject+=3
0 Non input pseudo reload: reject++
1 Non-pseudo reload: reject+=2
1 Non input pseudo reload: reject++
alt=15,overall=28,losers=3 -- refuse
0 Costly set: reject++
alt=16: Bad operand -- refuse
0 Costly set: reject++
1 Costly loser: reject++
1 Non-pseudo reload: reject+=2
1 Non input pseudo reload: reject++
alt=17,overall=17,losers=2 -- refuse
0 Costly set: reject++
1 Spill Non-pseudo into memory: reject+=3
Using memory insn operand 1: reject+=3
1 Non input pseudo reload: reject++
alt=18,overall=14,losers=1 -- refuse
0 Spill pseudo into memory: reject+=3
Using memory insn operand 0: reject+=3
0 Non input pseudo reload: reject++
1 Costly loser: reject++
1 Non-pseudo reload: reject+=2
1 Non input pseudo reload: reject++
alt=19,overall=29,losers=3 -- refuse
0 Non-prefered reload: reject+=600
0 Non input pseudo reload: reject++
alt=20,overall=607,losers=1 -- refuse
1 Non-pseudo reload: reject+=2
1 Non input pseudo reload: reject++

I'm not sure I can decipher the reasoning but I don't understand how it
doesn't seem to anticipate the cost of reloading the GPR in the alternative
it chooses?

Vlad?

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #16 from hubicka at kam dot mff.cuni.cz ---
> 
> Yep, we also have code like
> 
> -   movabsq $0x3ff03db8fde2ef4e, %r8
> ...
> -   vmovq   %r8, %xmm11

It is loading random constant to xmm11.  Since reg<->xmm moves are
relatively cheap it looks OK to me that we generate this.  Is it faster
to load constant from the memory?
> movq.LC11(%rip), %rax
> vmovq   %rax, %xmm14
This is odd indeed and even more odd that we both movabs and memory load... 
i386 FE plays some games with allowing some constants in SSE
instructions (to allow simplification and combining) and split them out
to memory later.  It may be consequence of this.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #15 from Richard Biener  ---
Created attachment 52300
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52300=edit
LBM_performStreamCollide testcase

This is the relevant function.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #14 from Richard Biener  ---
(In reply to Hongtao.liu from comment #9)
> 1703 :   401cb1: vmovq  %xmm1,%r9   (*)
>  834 :   401cb6: vmovq  %r8,%xmm1
> 1719 :   401cbb: vmovq  %r9,%xmm0   (*)
> 
> Look like %r9 is dead after the second (*), and it can be optimized to
> 
> 1703 :   401cb1: vmovq  %xmm1,%xmm0
>  834 :   401cb6: vmovq  %r8,%xmm1

Yep, we also have code like

-   movabsq $0x3ff03db8fde2ef4e, %r8
...
-   vmovq   %r8, %xmm11

or

movq.LC11(%rip), %rax
vmovq   %rax, %xmm14

which is extremely odd to see ... (I didn't check how we arrive at that)

When I do

diff --git a/gcc/config/i386/x86-tune-costs.h
b/gcc/config/i386/x86-tune-costs.h
index 017ffa69958..4c51358d7b6 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1585,7 +1585,7 @@ struct processor_costs znver2_cost = {
   in 32,64,128,256 and 512-bit.  */
   {8, 8, 8, 8, 16},/* cost of storing SSE registers
   in 32,64,128,256 and 512-bit.  */
-  6, 6,/* SSE->integer and
integer->SSE
+  8, 8,/* SSE->integer and
integer->SSE
   moves.  */
   8, 8,/* mask->integer and integer->mask
moves */
   {6, 6, 6},   /* cost of loading mask register

performance improves from 128 seconds to 115 seconds.  The result is
a lot more stack spilling in the code but there are still cases like

movq.LC8(%rip), %rax
vmovsd  .LC13(%rip), %xmm6
vmovsd  .LC16(%rip), %xmm11
vmovsd  .LC14(%rip), %xmm3
vmovsd  .LC12(%rip), %xmm14
vmovq   %rax, %xmm2
vmovq   %rax, %xmm0
movq.LC9(%rip), %rax

see how we load .LC8 to %rax just to move it to xmm2 and xmm0 instead of
at least moving xmm2 to xmm0 (maybe that's now cprop_hardreg) or also
loading directly to xmm0.

In the end register pressure is the main issue but how we deal with it
is bad.  It's likely caused by a combination of PRE & hoisting & sinking
which together exploit

  if( ((*((unsigned int*) ((void*) (&srcGrid)[((FLAGS)+N_CELL_ENTRIES*((0)+
(0)*(1*(100))+(0)*(1*(100))*(1*(100+(i)]))) & (ACCEL))) {
   ux = 0.005;
   uy = 0.002;
   uz = 0.000;
  }

which makes the following computes partly compile-time resolvable.

I still think the above code generation issues need to be analyzed and
we should figure why we emit this weird code under register pressure.
I'll attach a testcase that has the function in question split out for
easier analysis.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread hubicka at kam dot mff.cuni.cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #13 from hubicka at kam dot mff.cuni.cz ---
> > According to znver2_cost
> > 
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
> 
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> but modeling that doesn't seem to be necessary.
> 
> We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> gpr<->xmm move cost to 8.

I was simply following latencies here, so indeed reg<->mem bypass is not
really modelled.  I recall doing few experiments which was kind of
inconclusive.

Re: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread Jan Hubicka via Gcc-bugs
> > According to znver2_cost
> > 
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
> 
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
> Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
> but modeling that doesn't seem to be necessary.
> 
> We seem to have store costs of 8 and load costs of 6, I'll try bumping the
> gpr<->xmm move cost to 8.

I was simply following latencies here, so indeed reg<->mem bypass is not
really modelled.  I recall doing few experiments which was kind of
inconclusive.


[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #12 from Richard Biener  ---
(In reply to Hongtao.liu from comment #10)
> (In reply to Richard Biener from comment #8)
> > So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> > -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> > this improves to 114 seconds, with sink2 disabled I get 108 seconds
> > and with the tune-ctrl ontop I get 113 seconds.
> > 
> > Note that Zen2 is quite special in that it has the ability to handle
> > load/store from the stack by mapping it to a register, effectively
> > making them zero latency (zen3 lost this ability).
> > 
> > So while moves between GPRs and XMM might not be bad anymore _spilling_
> > to a GPR (and I suppose XMM, too) is still a bad idea and the stack
> > should be preferred.
> > 
> 
> According to znver2_cost
> 
> Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> sse_to_integer cost(more than fp_store) can helps RA to choose memory
> instead of GPR.

That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
but GPR<->xmm should be more expensive than GPR/xmm<->stack.  As said above
Zen2 can do reg -> mem, mem -> reg via renaming if 'mem' is somewhat special,
but modeling that doesn't seem to be necessary.

We seem to have store costs of 8 and load costs of 6, I'll try bumping the
gpr<->xmm move cost to 8.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #11 from Richard Biener  ---
(In reply to Richard Biener from comment #8)
> So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> this improves to 114 seconds, with sink2 disabled I get 108 seconds
> and with the tune-ctrl ontop I get 113 seconds.

With -Ofast -march=znver2 -fschedule-insns -fsched-pressure I get 113 seconds.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #10 from Hongtao.liu  ---
(In reply to Richard Biener from comment #8)
> So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
> -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
> this improves to 114 seconds, with sink2 disabled I get 108 seconds
> and with the tune-ctrl ontop I get 113 seconds.
> 
> Note that Zen2 is quite special in that it has the ability to handle
> load/store from the stack by mapping it to a register, effectively
> making them zero latency (zen3 lost this ability).
> 
> So while moves between GPRs and XMM might not be bad anymore _spilling_
> to a GPR (and I suppose XMM, too) is still a bad idea and the stack
> should be preferred.
> 

According to znver2_cost

Cost of sse_to_integer is a little bit less than fp_store, maybe increase
sse_to_integer cost(more than fp_store) can helps RA to choose memory instead
of GPR.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #9 from Hongtao.liu  ---
1703 :   401cb1: vmovq  %xmm1,%r9   (*)
 834 :   401cb6: vmovq  %r8,%xmm1
1719 :   401cbb: vmovq  %r9,%xmm0   (*)

Look like %r9 is dead after the second (*), and it can be optimized to

1703 :   401cb1: vmovq  %xmm1,%xmm0
 834 :   401cb6: vmovq  %r8,%xmm1

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-26 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #8 from Richard Biener  ---
So w/ -Ofast -march=znver2 I get a runtime of 130 seconds, when I add
-mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec then
this improves to 114 seconds, with sink2 disabled I get 108 seconds
and with the tune-ctrl ontop I get 113 seconds.

Note that Zen2 is quite special in that it has the ability to handle
load/store from the stack by mapping it to a register, effectively
making them zero latency (zen3 lost this ability).

So while moves between GPRs and XMM might not be bad anymore _spilling_
to a GPR (and I suppose XMM, too) is still a bad idea and the stack
should be preferred.

Not sure if it's possible to do that though.

Doing the same experiment as above on a Zen3 machine would be nice, too.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-26 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

Richard Biener  changed:

   What|Removed |Added

  Component|tree-optimization   |rtl-optimization
 CC||vmakarov at gcc dot gnu.org
   Keywords||ra

--- Comment #7 from Richard Biener  ---
I see a lot more GPR <-> XMM moves in the 'after' case:

1035 :   401c8b: vaddsd %xmm1,%xmm0,%xmm0
1953 :   401c8f: vmovq  %rcx,%xmm1
 305 :   401c94: vaddsd %xmm8,%xmm1,%xmm1
3076 :   401c99: vmovq  %xmm0,%r14
 590 :   401c9e: vmovq  %r11,%xmm0
 267 :   401ca3: vmovq  %xmm1,%r8
 136 :   401ca8: vmovq  %rdx,%xmm1
 448 :   401cad: vaddsd %xmm1,%xmm0,%xmm1
1703 :   401cb1: vmovq  %xmm1,%r9   (*)
 834 :   401cb6: vmovq  %r8,%xmm1
1719 :   401cbb: vmovq  %r9,%xmm0   (*)
2782 :   401cc0: vaddsd %xmm0,%xmm1,%xmm1
   22135 :   401cc4: vmovsd %xmm1,%xmm1,%xmm0
1261 :   401cc8: vmovq  %r14,%xmm1
 646 :   401ccd: vaddsd %xmm0,%xmm1,%xmm0
   18136 :   401cd1: vaddsd %xmm2,%xmm5,%xmm1
 629 :   401cd5: vmovq  %xmm1,%r8
 142 :   401cda: vaddsd %xmm6,%xmm3,%xmm1
 177 :   401cde: vmovq  %xmm0,%r14
 288 :   401ce3: vmovq  %xmm1,%r9
 177 :   401ce8: vmovq  %r8,%xmm1
 174 :   401ced: vmovq  %r9,%xmm0

those look like RA / spilling artifacts and IIRC I saw Hongtao posting
patches in this area to regcprop I think?  The above is definitely
bad, for example (*) seems to swap %xmm0 and %xmm1 via %r9.

The function is LBM_performStreamCollide, the sinking pass does nothing wrong,
it moves unconditionally executed

-  _948 = _861 + _867;
-  _957 = _944 + _948;
-  _912 = _861 + _873;
...
-  _981 = _853 + _865;
-  _989 = _977 + _981;
-  _916 = _853 + _857;
-  _924 = _912 + _916;

into a conditionally executed block.  But that increases register pressure
by 5 FP regs (if I counted correctly) in that area.  So this would be the
usual issue of GIMPLE transforms not being register-pressure aware.

-fschedule-insn -fsched-pressure seems to be able to somewhat mitigate this
(though I think EBB scheduling cannot undo such movement).

In postreload I see transforms like

-(insn 466 410 411 7 (set (reg:DF 0 ax [530])
-(mem/u/c:DF (symbol_ref/u:DI ("*.LC10") [flags 0x2]) [0  S8 A64]))
"lbm.c":241:5 141 {*movdf_internal}
- (expr_list:REG_EQUAL (const_double:DF
9.9397449830464503247640095651149749755859375e-1
[0x0.fe751ce28ed5fp+0])
-(nil)))
-(insn 411 466 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
+(insn 411 410 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123])
 (reg:DF 0 ax [530])) "lbm.c":241:5 141 {*movdf_internal}
  (nil))

which seems like we could have reloaded %xmm5 from .LC10.  But the spilling
to GPRs seems to be present already after LRA and cprop_hardreg doesn't
do anything bad either.

The differences can be seen on trunk with -Ofast -march=znver2
[-fdisable-tree-sink2].

We have X86_TUNE_INTER_UNIT_MOVES_TO_VEC/X86_TUNE_INTER_UNIT_MOVES_FROM_VEC
and the interesting thing is that when I disable them I do see some
spilling to the stack but also quite some re-materialized constants
(loads from .LC* as seem from the opportunity above).

It might be interesting to benchmark with
-mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec and find a way
to make costs in a way that IRA/LRA prefer re-materialization of constants
from the constant pool over spilling to GPRs (if that's possible at all -
Vlad?)