[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2021-05-14 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
Bug 89071 depends on bug 87007, which changed state.

Bug 87007 Summary: [8 Regression] 10% slowdown with -march=skylake-avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #22 from Peter Cordes  ---
Nice, that's exactly the kind of thing I suggested in bug 80571.  If this
covers 

* vsqrtss/sd  (mem),%merge_into, %xmm 
* vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont
* vcmptrueps  %same,%same, %ymm # splat -1 without AVX2.  false dep on all
known uarches

as well as int->FP conversions, then we could probably close that as fixed by
this as well.

bug 80571 does suggest that we could look for any cold reg, like a non-zero
constant, instead of requiring an xor-zeroed vector, so it might go slightly
beyond what this patch does.

And looking for known-to-be-ready dead regs from earlier in the same dep chain
could certainly be useful for non-AVX code-gen, allowing us to copy-and-sqrt
without introducing a dependency on anything that's not already ready.

(In reply to h...@gcc.gnu.org from comment #21)
> Author: hjl
> Date: Fri Feb 22 15:54:08 2019
> New Revision: 269119

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-22 Thread hjl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #21 from hjl at gcc dot gnu.org  ---
Author: hjl
Date: Fri Feb 22 15:54:08 2019
New Revision: 269119

URL: https://gcc.gnu.org/viewcvs?rev=269119=gcc=rev
Log:
i386: Add pass_remove_partial_avx_dependency

With -mavx, for

$ cat foo.i
extern float f;
extern double d;
extern int i;

void
foo (void)
{
  d = f;
  f = i;
}

we need to generate

vxorp[ds]   %xmmN, %xmmN, %xmmN
...
vcvtss2sd   f(%rip), %xmmN, %xmmX
...
vcvtsi2ss   i(%rip), %xmmN, %xmmY

to avoid partial XMM register stall.  This patch adds a pass to generate
a single

vxorps  %xmmN, %xmmN, %xmmN

at entry of the nearest dominator for basic blocks with SF/DF conversions,
which is in the fake loop that contains the whole function, instead of
generating one

vxorp[ds]   %xmmN, %xmmN, %xmmN

for each SF/DF conversion.

NB: The LCM algorithm isn't appropriate here since it may place a vxorps
inside the loop.  Simple testcase show this:

$ cat badcase.c

extern float f;
extern double d;

void
foo (int n, int k)
{
  for (int j = 0; j != n; j++)
if (j < k)
  d = f;
}

It generates

...
loop:
  if(j < k)
vxorps%xmm0, %xmm0, %xmm0
vcvtss2sd f(%rip), %xmm0, %xmm0
  ...
loopend
...

This is because LCM only works when there is a certain benifit.  But for
conditional branch, LCM wouldn't move

   vxorps  %xmm0, %xmm0, %xmm0

out of loop.  SPEC CPU 2017 on Intel Xeon with AVX512 shows:

1. The nearest dominator

|RATE   |Improvement|
|500.perlbench_r| 0.55% |
|538.imagick_r  | 8.43% |
|544.nab_r  | 0.71% |

2. LCM

|RATE   |Improvement|
|500.perlbench_r| -0.76% |
|538.imagick_r  | 7.96%  |
|544.nab_r  | -0.13% |

Performance impacts of SPEC CPU 2017 rate on Intel Xeon with AVX512
using

-Ofast -flto -march=skylake-avx512 -funroll-loops

before

commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576
Author: uros 
Date:   Thu Jan 31 20:06:42 2019 +

PR target/89071
* config/i386/i386.md (*extendsfdf2): Split out reg->reg
alternative to avoid partial SSE register stall for TARGET_AVX.
(truncdfsf2): Ditto.
(sse4_1_round2): Ditto.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268427
138bc75d-0d04-0410-961f-82ee72b054a4

are:

|INT RATE   |Improvement|
|500.perlbench_r| 0.55% |
|502.gcc_r  | 0.14% |
|505.mcf_r  | 0.08% |
|523.xalancbmk_r| 0.18% |
|525.x264_r |-0.49% |
|531.deepsjeng_r|-0.04% |
|541.leela_r|-0.26% |
|548.exchange2_r|-0.3%  |
|557.xz_r   |BuildSame|

|FP RATE|Improvement|
|503.bwaves_r   |-0.29% |
|507.cactuBSSN_r| 0.04% |
|508.namd_r |-0.74% |
|510.parest_r   |-0.01% |
|511.povray_r   | 2.23% |
|519.lbm_r  | 0.1%  |
|521.wrf_r  | 0.49% |
|526.blender_r  | 0.13% |
|527.cam4_r | 0.65% |
|538.imagick_r  | 8.43% |
|544.nab_r  | 0.71% |
|549.fotonik3d_r| 0.15% |
|554.roms_r | 0.08% |

After commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576, on Skylake client,
impacts on 538.imagick_r with

-fno-unsafe-math-optimizations -march=native -Ofast -funroll-loops -flto

1. Size comparision:

before:

   textdata bss dec hex filename
243637783524528 2449257  255f69 imagick_r

after:

   textdata bss dec hex filename
242524983524528 2438129  2533f1 imagick_r

2. Number of vxorps:

before  after   difference
49484135-19.66%

3. Performance improvement:

|RATE   |Improvement|
|538.imagick_r  | 5.5%  |

gcc/

2019-02-22  H.J. Lu  
Hongtao Liu  
Sunil K Pandey  

PR target/87007
* config/i386/i386-passes.def: Add
pass_remove_partial_avx_dependency.
* config/i386/i386-protos.h
(make_pass_remove_partial_avx_dependency): New.
* config/i386/i386.c (make_pass_remove_partial_avx_dependency):
New function.
(pass_data_remove_partial_avx_dependency): New.
(pass_remove_partial_avx_dependency): Likewise.
(make_pass_remove_partial_avx_dependency): Likewise.
* config/i386/i386.md (avx_partial_xmm_update): New attribute.
(*extendsfdf2): Add avx_partial_xmm_update.
(truncdfsf2): Likewise.
(*float2): Likewise.
(SF/DF conversion splitters): Disabled for TARGET_AVX.

gcc/testsuite/

2019-02-22  H.J. Lu  
Hongtao Liu  
Sunil K Pandey  

PR target/87007
* gcc.target/i386/pr87007-1.c: New test.
* gcc.target/i386/pr87007-2.c: Likewise.

Added:
trunk/gcc/testsuite/gcc.target/i386/pr87007-1.c

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-03 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #20 from Uroš Bizjak  ---
(In reply to H.J. Lu from comment #19)

> > Do we need XOR for cvtsd2ss mem->xmm?
> 
> Yes, we do since
> 
>  vcvtss2sd f(%rip), %xmm0, %xmm0
> 
> partially updates %xmm0.

This is part of PR 87007, so let's call this PR FIXED.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-03 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|--- |9.0

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-03 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #19 from H.J. Lu  ---
(In reply to Uroš Bizjak from comment #18)
> The only remaining question is on cvtsd2ss mem->xmm, where ICC goes with the
> same strategy as with other non-conversion SSE unops:
> 
>vmovsdd(%rip), %xmm0
>vcvtsd2ss %xmm0, %xmm0, %xmm0
> 
> but with cvtss2sd:
> 
>vxorpd%xmm0, %xmm0, %xmm0
>vcvtss2sd f(%rip), %xmm0, %xmm0
> 
> Do we need XOR for cvtsd2ss mem->xmm?

Yes, we do since

 vcvtss2sd f(%rip), %xmm0, %xmm0

partially updates %xmm0.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-03 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #18 from Uroš Bizjak  ---
The only remaining question is on cvtsd2ss mem->xmm, where ICC goes with the
same strategy as with other non-conversion SSE unops:

   vmovsdd(%rip), %xmm0
   vcvtsd2ss %xmm0, %xmm0, %xmm0

but with cvtss2sd:

   vxorpd%xmm0, %xmm0, %xmm0
   vcvtss2sd f(%rip), %xmm0, %xmm0

Do we need XOR for cvtsd2ss mem->xmm?

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-03 Thread uros at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #17 from uros at gcc dot gnu.org ---
Author: uros
Date: Sun Feb  3 16:48:41 2019
New Revision: 268496

URL: https://gcc.gnu.org/viewcvs?rev=268496=gcc=rev
Log:
PR target/89071
* config/i386/i386.md (*sqrt2_sse): Add (v,0) alternative.
Do not prefer (v,v) alternative for non-AVX targets and (m,v)
alternative for speed when TARGET_SSE_PARTIAL_REG_DEPENDENCY is set.
(*rcpsf2_sse): Ditto.
(*rsqrtsf2_sse): Ditto.
(sse4_1_round

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-03 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #16 from Uroš Bizjak  ---
(In reply to Peter Cordes from comment #15)
> (In reply to Uroš Bizjak from comment #13)
> > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP
> > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we
> > currently don't emit XOR clear in front of these instrucitons, when they
> > operate with memory input.
> 
> They *do* have an output dependency.  It might or might not actually be a
> problem and be worth clogging the front-end with extra uops to avoid, it
> depending on surrounding code. >.<

OK, I'll proceed with the patch from Comment #14 then.

> * CVTSS2SD vs. PD, and SD2SS vs. PD2PS
>   packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop),
> Nano3000, KNL.  On Silvermont by just 1 cycle latency (so  even a MOVAPS on
> the critical path would make it equal.)  Similar on Atom.  Slower on CPUs
> that do 128-bit vectors as two 64-bit uops, like Bobcat, and Pentium M / K8
> and older.
> 
>   packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c
> throughput), Prescott, P4.  Much faster on Jaguar (1c vs. 8c throughput, and
> 1 uop vs. 2).

We do have infrastructure to convert scalar conversions to packed:

/* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
   from FP to FP.  This form of instructions avoids partial write to the
   destination.  */
DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts",
  m_AMDFAM10)

/* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
   from integer to FP. */
DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)

And, as can be seen from above tunes, they are currently enabled for AMDFAM10,
it is just a matter of selecting relevant tune for the selected target.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #15 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #13)
> I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP
> and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we
> currently don't emit XOR clear in front of these instrucitons, when they
> operate with memory input.

They *do* have an output dependency.  It might or might not actually be a
problem and be worth clogging the front-end with extra uops to avoid, it
depending on surrounding code. >.<

e.g. ROUNDSD:  DEST[127:63] remains unchanged
Thanks, Intel.  You'd think by SSE4.1 they would have learned that false
dependencies suck, and that it's extremely rare to actually take advantage of
this merge behaviour, but no.

For register-source ROUNDSD / ROUNDSS, we can use ROUNDPD / ROUNDPS which write
the full destination register and have identical performance on all CPUs that
support them.  (Except Silvermont, where roundps/pd have 5c latency vs. 4c for
roundss/sd.  Goldmont makes them equal.)  KNL has faster (V)ROUNDPS/D than
ROUNDSS/SD, maybe only because of the SSE encoding?  Agner Fog isn't clear, and
doesn't have an entry that would match vroundss/sd.

Copy-and-round is good for avoiding extra MOVAPS instructions which can make
SSE code front-end bound, and reduce the effective size of the out-of-order
window.

Preserving FP exception semantics for packed instead of scalar register-source:

* if the upper element(s) of the source is/are known 0, we can always do this
with sqrt and round, and convert: they won't produce any FP exceptions, not
even inexact.  (But not rsqrt / rcpps, of course.)
  This will be the case after a scalar load, so if we need the original value
in memory *and* the result of one of these instructions, we're all set.

* with rounding, the immediate can control masking of precision exceptions, but
not Invalid which is always raised by SRC = SNaN.  If we can rule out SNaN in
the upper elements of the input, we can use ROUNDPS / ROUNDPD

roundps/d can't produce a denormal output.  I don't think denormal inputs slow
it down on any CPUs, but worth checking for cases where we don't care about
preserving exception semantics and want to use it with potentially-arbitrary
garbage in high elements.


rsqrtps can't produce a denormal output because sqrt makes the output closer to
1.0 (reducing the magnitude of the exponent).  (And thus neither can sqrtps.) 
SQRTPS/PD is the same performance as SQRTSS/SD on new CPUs, but old CPUs that
crack 128-bit ops into 64-bit are slower: Pentium III, Pentium M, and Bobcat. 
And Jaguar for sqrt.  Also Silvermont is *MUCH* slower for SQRTPD/PS then
SD/SS, and even Goldmont Plus has slower packed SQRT, RSQRT, and RCP than
scalar.

But RCPPS can produce a denormal.  (double)1.0/FLT_MAX = 2.938736e-39, which is
smaller than FLT_MIN = 1.175494e-38



So according to Agner's tables:

* ROUNDPS/PD is never slower than ROUNDSS/SD on any CPU that support them.
* SQRTPS/PD *are* slower than scalar on Silvermont through Goldmont Plus, and
Bobcat, Nano 3000, and P4 Prescott/Nocona.  By about a factor of 2, enough that
should probably care about it for tune=generic.  For ss/ps only (not double),
also K10 and Jaguar have slower sqrtps than ss.  Also in 32-bit mode, P4,
Pentium M and earlier Intel, and Atom, are much slower for packed than scalar
sqrt.
  SQRTPD is *faster* than SQRTSD on KNL.  (But hopefully we're never tuning for
KNL without AVX available.)

* RSQRT / RCP: packed is slower on Atom, Silvermont, and Goldmont (multi-uop so
a big decode stall).  Somewhat slower on Goldmont Plus (1 uop but half
throughput).  Also slower on Nano3000, and slightly slower on Pentium 4 (before
and after Prescott/Nocona), and KNL.  (But hopefully KNL can always use
VRSQRT28PS/PD or scalar)
  Pentium M and older again decode as at least 2 uops for packed, same as
Bobcat and K8.
  Same performance for packed vs. scalar on Jaguar, K10, bdver1-4, ryzen, Core2
and later, and SnB-family.

* CVTSS2SD vs. PD, and SD2SS vs. PD2PS
  packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), Nano3000,
KNL.  On Silvermont by just 1 cycle latency (so  even a MOVAPS on the critical
path would make it equal.)  Similar on Atom.  Slower on CPUs that do 128-bit
vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 and older.

  packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c
throughput), Prescott, P4.  Much faster on Jaguar (1c vs. 8c throughput, and 1
uop vs. 2).

  same speed (but without the false dep) for SnB-family (mostly), Core 2,
Ryzen.

  Odd stuff: Agner reports:
Nehalem: ps2pd = 2 uops / 2c, ss2sd = 1 uop / 1c.  (I guess just
zero-padding the significand, no rounding required).  pd2ps and sd2ss are equal
at 2 uops / 4c latency.
SnB: cvtpd2ps is 1c higher latency than sd2ss.
IvB: ps2pd on IvB is 1c vs. 2c for ss2sd
On HSW and later things have settled down to 

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-01 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2019-02-01
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Ever confirmed|0   |1

--- Comment #14 from Uroš Bizjak  ---
Created attachment 45582
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45582=edit
Additional patch to break partial SSE reg dependencies

Here is another patch that may help with partial SSE reg dependencies for
{R,}SQRTS{S,D}, RCPS{S,D} and ROUNDS{S,D} instructions. It takes the same
strategy as both ICC and clang take, that is:

a) load from mem with MOVS{S,D} and
b) in case of SSE, match input and output register.

The implementation uses preferred_for_speed attribute, so in cold sections or
when compiled with -Os, the compiler is still able to create direct load from
memory (SSE, AVX) and use unmatched registers for SSE targets.

So, the sqrt from memory is now compikled to:

movsd   z(%rip), %xmm0
sqrtsd  %xmm0, %xmm0


(SSE) or

vmovsd  z(%rip), %xmm1
vsqrtsd %xmm1, %xmm1, %xmm0

(AVX).

And sqrt from unmatched input register will compile to:

sqrtsd  %xmm1, %xmm1
movapd  %xmm1, %xmm0

(SSE) or

   vsqrtsd %xmm1, %xmm1, %xmm0
.

HJ, can you please benchmark this patch?

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-01 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #13 from Uroš Bizjak  ---
I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP and
ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we
currently don't emit XOR clear in front of these instrucitons, when they
operate with memory input.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-31 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #12 from Uroš Bizjak  ---
(In reply to Peter Cordes from comment #10)

> It also bizarrely uses it for VMOVSS, which gcc should only emit if it
> actually wants to merge (right?).  *If* this part of the patch isn't a bug
> 
> - return "vmovss\t{%1, %0, %0|%0, %0, %1}";
> + return "vmovss\t{%d1, %0|%0, %d1}";
>  
> then even better would be vmovaps %1, %0 (which can benefit from
> mov-elimination, and doesn't need a port-5-only ALU uop.)  Same for vmovsd
> of course.

This is actually overridden in mode calculations, where it is disabled for
TARGET_SSE_PARTIAL_REG_DEPENDENCY targets.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-31 Thread uros at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #11 from uros at gcc dot gnu.org ---
Author: uros
Date: Thu Jan 31 20:06:42 2019
New Revision: 268427

URL: https://gcc.gnu.org/viewcvs?rev=268427=gcc=rev
Log:
PR target/89071
* config/i386/i386.md (*extendsfdf2): Split out reg->reg
alternative to avoid partial SSE register stall for TARGET_AVX.
(truncdfsf2): Ditto.
(sse4_1_round2): Ditto.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.md

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #10 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #9)
> There was similar patch for sqrt [1], I think that the approach is
> straightforward, and could be applied to other reg->reg scalar insns as
> well, independently of PR87007 patch.
> 
> [1] https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00202.html

Yeah, that looks good.  So I think it's just vcvtss2sd and sd2ss, and
VROUNDSS/SD that aren't done yet.

That patch covers VSQRTSS/SD, VRCPSS, and VRSQRTSS.

It also bizarrely uses it for VMOVSS, which gcc should only emit if it actually
wants to merge (right?).  *If* this part of the patch isn't a bug

-   return "vmovss\t{%1, %0, %0|%0, %0, %1}";
+   return "vmovss\t{%d1, %0|%0, %d1}";

then even better would be vmovaps %1, %0 (which can benefit from
mov-elimination, and doesn't need a port-5-only ALU uop.)  Same for vmovsd of
course.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #9 from Uroš Bizjak  ---
There was similar patch for sqrt [1], I think that the approach is
straightforward, and could be applied to other reg->reg scalar insns as well,
independently of PR87007 patch.

[1] https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00202.html

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #8 from Peter Cordes  ---
Created attachment 45544
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544=edit
testloop-cvtss2sd.asm

(In reply to H.J. Lu from comment #7)
> I fixed assembly codes and run it on different AVX machines.
> I got similar results:
> 
> ./test
> sse  : 28346518
> sse_clear: 28046302
> avx  : 28214775
> avx2 : 28251195
> avx_clear: 28092687
> 
> avx_clear:
>   vxorps  %xmm0, %xmm0, %xmm0
>   vcvtsd2ss   %xmm1, %xmm0, %xmm0
>   ret
> 
> is slightly faster.


I'm pretty sure that's a coincidence, or an unrelated microarchitectural effect
where adding any extra uop makes a difference.  Or just chance of code
alignment for the uop-cache (32-byte or maybe 64-byte boundaries).

You're still testing with the caller compiled without optimization.  The loop
is a mess of sign-extension and reloads, of course, but most importantly
keeping the loop counter in memory creates a dependency chain involving
store-forwarding latency.

Attempting a load later can make it succeed more quickly in store-forwarding
cases, on Intel Sandybridge-family, so perhaps an extra xor-zeroing uop is
reducing the average latency of the store/reloads for the loop counter (which
is probably the real bottleneck.)

https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization

Loads are weird in general: the scheduler anticipates their latency and
dispatches uops that will consume their results in the cycle when it expects a
load will put the result on the forwarding network.  But if the load *isn't*
ready when expected, it may have to replay the uops that wanted that input. 
See
https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop
for a detailed analysis of this effect on IvyBridge.  (Skylake doesn't have the
same restrictions on stores next to loads, but other effects can cause
replays.)

https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base/52358810#52358810
is an interesting case for pointer-chasing where the load port speculates that
it can use the base pointer for TLB lookups, instead of the base+offset. 
https://stackoverflow.com/questions/52527325/why-does-the-number-of-uops-per-iteration-increase-with-the-stride-of-streaming
shows load replays on cache misses.

So there's a huge amount of complicating factors from using a calling loop that
keeps its loop counter in memory, because SnB-family doesn't have a simple
fixed latency for store forwarding.





If I put the tests in a different order, I sometimes get results like:

./test
sse  : 26882815
sse_clear: 26207589
avx_clear: 25968108
avx  : 25920897
avx2 : 25956683

Often avx (with the false dep on the load result into XMM1) is slower than
avx_clear of avx2, but there's a ton of noise.



Adding vxorps  %xmm2, %xmm2, %xmm2  to avx.S also seems to have sped it up; now
it's the same speed as the others, even though I'm *not* breaking the
dependency chain anymore.  XMM2 is unrelated, nothing touches it.

This basically proves that your benchmark is sensitive to extra instructions,
whether they interact with vcvtsd2ss or not.


We know that in the general case, throwing in extra NOPs or xor-zeroing
instructions on unused registers does not make code faster, so we should
definitely distrust the result of this microbenchmark.




I've attached my NASM loop.  It has various commented-out loop bodies, and
notes in comments on results I found with performance counters.  I don't know
if it will be useful (because it's a bit messy), but it's what I use for
testing snippets of asm in a static binary with near-zero startup overhead.  I
just run perf stat on the whole executable and look at cycles / uops.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #7 from H.J. Lu  ---
I fixed assembly codes and run it on different AVX machines.
I got similar results:

./test
sse  : 28346518
sse_clear: 28046302
avx  : 28214775
avx2 : 28251195
avx_clear: 28092687

avx_clear:
vxorps  %xmm0, %xmm0, %xmm0
vcvtsd2ss   %xmm1, %xmm0, %xmm0
ret

is slightly faster.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #6 from Peter Cordes  ---
(In reply to Peter Cordes from comment #5)
> But whatever the effect is, it's totally unrelated to what you were *trying*
> to test. :/

After adding a `ret` to each AVX function, all 5 are basically the same speed
(compiling the C with `-O2` or -O2 -march=native), with just noise making it
hard to see anything clearly.  sse_clear tends to be faster than sse in a group
of runs, but if there are differences it's more likely due to weird front-end
effects and all the loads of inputs + store/reload of the return address by
call/ret.

I did  while ./test;  : ;done   to factor out CPU clock-speed ramp up and maybe
some cache warmup stuff, but it's still noisy from run to run.  Making
printf/write system calls between tests will cause TLB / branch-prediction
effects because of kernel spectre mitigation, so I guess every test is in the
same boat, running right after a system call.

Adding loads and stores into the mix makes microbenchmarking a lot harder.

Also notice that since `xmm0` and `xmm1` pointers are global, those pointers
are reloaded every time through the loop even with optimization.  I guess
you're not trying to minimize the amount of work outside of the asm functions,
to measure them as part of a messy loop.  So for the version that have a false
dependency, you're making that dependency on the result of this:

movrax,QWORD PTR [rip+0x2ebd]  # reload xmm1
vmovapd xmm1,XMMWORD PTR [rax+rbx*1]   # index xmm1

Anyway, I think there's too much noise in the data, and lots of reason to
expect that vcvtsd2ss %xmm0, %xmm0, %xmm1 is strictly better than
VPXOR+convert, except in cases where adding an extra uop actually helps, or
where code-alignment effects matter.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #5 from Peter Cordes  ---
(In reply to H.J. Lu from comment #4)
> (In reply to Peter Cordes from comment #2)

> >  Can you show some
> > asm where this performs better?
> 
> Please try cvtsd2ss branch at:
> 
> https://github.com/hjl-tools/microbenchmark/
> 
> On Intel Core i7-6700K, I got

I have the same CPU.

> [hjl@gnu-skl-2 microbenchmark]$ make
> gcc -g -I.-c -o test.o test.c
> gcc -g   -c -o sse.o sse.S
> gcc -g   -c -o sse-clear.o sse-clear.S
> gcc -g   -c -o avx.o avx.S
> gcc -g   -c -o avx2.o avx2.S
> gcc -g   -c -o avx-clear.o avx-clear.S
> gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o
> ./test
> sse  : 24533145
> sse_clear: 24286462
> avx  : 64117779
> avx2 : 62186716
> avx_clear: 58684727
> [hjl@gnu-skl-2 microbenchmark]$

You forgot the RET at the end of the AVX functions (but not the SSE ones); The
AVX functions fall through into each other, then into __libc_csu_init before
jumping around and eventually returning.  That's why they're much slower. 
Single-step through the loop in GDB...

   │0x5660 vcvtsd2ss xmm0,xmm0,xmm1
  >│0x5664  nopWORD PTR cs:[rax+rax*1+0x0]
   │0x566e  xchg   ax,ax
   │0x5670vcvtsd2ss xmm0,xmm1,xmm1
   │0x5674  nopWORD PTR cs:[rax+rax*1+0x0]
   │0x567e  xchg   ax,ax
   │0x5680   vxorps xmm0,xmm0,xmm0
   │0x5684 vcvtsd2ss xmm0,xmm0,xmm1
   │0x5688  nopDWORD PTR [rax+rax*1+0x0]
   │0x5690 <__libc_csu_init>endbr64
   │0x5694 <__libc_csu_init+4>  push   r15
   │0x5696 <__libc_csu_init+6>  movr15,rdx

And BTW, SSE vs. SSE_clear are about the same speed because your loop
bottlenecks on the store/reload latency of keeping a loop counter in memory
(because you compiled the C without optimization).  Plus, the C caller loads
write-only into XMM0 and XMM1 every iteration, breaking any loop-carried
dependency the false dep would create.

I'm not sure why it makes a measurable difference to run the extra NOPS, and 3x
vcvtsd2ss instead of 1 for avx() vs. avx_clear(), because the C caller should
still be breaking dependencies for the AVX-128 instructions.

But whatever the effect is, it's totally unrelated to what you were *trying* to
test. :/

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #4 from H.J. Lu  ---
(In reply to Peter Cordes from comment #2)
> (In reply to H.J. Lu from comment #1)
> > But
> > 
> > vxorps  %xmm0, %xmm0, %xmm0
> > vcvtsd2ss   %xmm1, %xmm0, %xmm0
> > 
> > are faster than both.
> 
> On Skylake-client (i7-6700k), I can't reproduce this result in a
> hand-written asm loop.  (I was using NASM to make a static executable that
> runs a 100M iteration loop so I could measure with perf).  Can you show some
> asm where this performs better?

Please try cvtsd2ss branch at:

https://github.com/hjl-tools/microbenchmark/

On Intel Core i7-6700K, I got

[hjl@gnu-skl-2 microbenchmark]$ make
gcc -g -I.-c -o test.o test.c
gcc -g   -c -o sse.o sse.S
gcc -g   -c -o sse-clear.o sse-clear.S
gcc -g   -c -o avx.o avx.S
gcc -g   -c -o avx2.o avx2.S
gcc -g   -c -o avx-clear.o avx-clear.S
gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o
./test
sse  : 24533145
sse_clear: 24286462
avx  : 64117779
avx2 : 62186716
avx_clear: 58684727
[hjl@gnu-skl-2 microbenchmark]$

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #3 from Peter Cordes  ---
(In reply to H.J. Lu from comment #1)
I have a patch for PR 87007:
> 
> https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html
> 
> which inserts a vxorps at the last possible position.  vxorps
> will be executed only once in a function.

That's talking about the mem,reg case, which like I said is different.  I
reported Bug 80571 a while ago about the mem,reg case (or gp-reg for si2ss/d),
so it's great that you have a fix for that, doing one xor-zeroing and reusing
that as a merge target for a whole function / loop.

But this bug is about the reg,reg case, where I'm pretty sure there's nothing
to be gained from xor-zeroing anything.  We can fully avoid any false dep just
by choosing both source registers = src, making the destination properly
write-only.

If you *have* an xor-zeroed register, there's no apparent harm in using it as
the merge-target for a reg-reg vcvt, vsqrt, vround, or whatever, but there's no
benefit either vs. just setting both source registers the same.  So whichever
is easier to implement, but ideally we want to avoid introducing a vxorps into
functions / blocks that don't need it at all.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #2 from Peter Cordes  ---
(In reply to H.J. Lu from comment #1)
> But
> 
>   vxorps  %xmm0, %xmm0, %xmm0
>   vcvtsd2ss   %xmm1, %xmm0, %xmm0
> 
> are faster than both.

On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written
asm loop.  (I was using NASM to make a static executable that runs a 100M
iteration loop so I could measure with perf).  Can you show some asm where this
performs better?

vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination
being an xor-zeroed register.  (Either zeroed outside the loop, or inside, or
once per 4 converts with an unrolled loop.)

I can't construct a case where  vcvtsd2ss %xmm1, %xmm1, %xmm0  is worse in any
way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst
= middle source.  I wasn't mixing it with other instructions other than VXORPS,
but I don't think anything is going to get rid of its 2nd uop, and choosing
both inputs = the same source removes any benefit from dep-breaking the output.

If adding a VXORPS helped, its probably due to some other side-effect.

Could the effect you saw have been due to code-gen changes for memory sources,
maybe  vxorps + vcvtsd2ss (mem), %xmm0, %xmm0   vs.  vmovsd + vcvtsd2ss %xmm1,
%xmm1, %xmm0?  (Those should be about equal, but memory-source SS2SD is
cheaper, no port5 uop.)



BTW, the false-dependency effect is much more obvious with SS2SD, where the
latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS.

Even without dependency-breaking, repeated

 vcvtsd2ss  %xmm1, %xmm0, %xmm0

can run at 1 per clock (same as with dep breaking), because the port-5 uop that
merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd.  So latency
from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle.

With dep-breaking, they both still bottleneck on the port5 uop if you're doing
nothing else.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

H.J. Lu  changed:

   What|Removed |Added

 Depends on||87007

--- Comment #1 from H.J. Lu  ---
vcvtsd2ss   %xmm1, %xmm1, %xmm0

is faster than

vcvtsd2ss   %xmm1, %xmm0, %xmm0

But

vxorps  %xmm0, %xmm0, %xmm0
vcvtsd2ss   %xmm1, %xmm0, %xmm0

are faster than both.  I have a patch for PR 87007:

https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html

which inserts a vxorps at the last possible position.  vxorps
will be executed only once in a function.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007
[Bug 87007] [8/9 Regression] 10% slowdown with -march=skylake-avx512