[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 Bug 89071 depends on bug 87007, which changed state. Bug 87007 Summary: [8 Regression] 10% slowdown with -march=skylake-avx512 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #22 from Peter Cordes --- Nice, that's exactly the kind of thing I suggested in bug 80571. If this covers * vsqrtss/sd (mem),%merge_into, %xmm * vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont * vcmptrueps %same,%same, %ymm # splat -1 without AVX2. false dep on all known uarches as well as int->FP conversions, then we could probably close that as fixed by this as well. bug 80571 does suggest that we could look for any cold reg, like a non-zero constant, instead of requiring an xor-zeroed vector, so it might go slightly beyond what this patch does. And looking for known-to-be-ready dead regs from earlier in the same dep chain could certainly be useful for non-AVX code-gen, allowing us to copy-and-sqrt without introducing a dependency on anything that's not already ready. (In reply to h...@gcc.gnu.org from comment #21) > Author: hjl > Date: Fri Feb 22 15:54:08 2019 > New Revision: 269119
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #21 from hjl at gcc dot gnu.org --- Author: hjl Date: Fri Feb 22 15:54:08 2019 New Revision: 269119 URL: https://gcc.gnu.org/viewcvs?rev=269119=gcc=rev Log: i386: Add pass_remove_partial_avx_dependency With -mavx, for $ cat foo.i extern float f; extern double d; extern int i; void foo (void) { d = f; f = i; } we need to generate vxorp[ds] %xmmN, %xmmN, %xmmN ... vcvtss2sd f(%rip), %xmmN, %xmmX ... vcvtsi2ss i(%rip), %xmmN, %xmmY to avoid partial XMM register stall. This patch adds a pass to generate a single vxorps %xmmN, %xmmN, %xmmN at entry of the nearest dominator for basic blocks with SF/DF conversions, which is in the fake loop that contains the whole function, instead of generating one vxorp[ds] %xmmN, %xmmN, %xmmN for each SF/DF conversion. NB: The LCM algorithm isn't appropriate here since it may place a vxorps inside the loop. Simple testcase show this: $ cat badcase.c extern float f; extern double d; void foo (int n, int k) { for (int j = 0; j != n; j++) if (j < k) d = f; } It generates ... loop: if(j < k) vxorps%xmm0, %xmm0, %xmm0 vcvtss2sd f(%rip), %xmm0, %xmm0 ... loopend ... This is because LCM only works when there is a certain benifit. But for conditional branch, LCM wouldn't move vxorps %xmm0, %xmm0, %xmm0 out of loop. SPEC CPU 2017 on Intel Xeon with AVX512 shows: 1. The nearest dominator |RATE |Improvement| |500.perlbench_r| 0.55% | |538.imagick_r | 8.43% | |544.nab_r | 0.71% | 2. LCM |RATE |Improvement| |500.perlbench_r| -0.76% | |538.imagick_r | 7.96% | |544.nab_r | -0.13% | Performance impacts of SPEC CPU 2017 rate on Intel Xeon with AVX512 using -Ofast -flto -march=skylake-avx512 -funroll-loops before commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576 Author: uros Date: Thu Jan 31 20:06:42 2019 + PR target/89071 * config/i386/i386.md (*extendsfdf2): Split out reg->reg alternative to avoid partial SSE register stall for TARGET_AVX. (truncdfsf2): Ditto. (sse4_1_round2): Ditto. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268427 138bc75d-0d04-0410-961f-82ee72b054a4 are: |INT RATE |Improvement| |500.perlbench_r| 0.55% | |502.gcc_r | 0.14% | |505.mcf_r | 0.08% | |523.xalancbmk_r| 0.18% | |525.x264_r |-0.49% | |531.deepsjeng_r|-0.04% | |541.leela_r|-0.26% | |548.exchange2_r|-0.3% | |557.xz_r |BuildSame| |FP RATE|Improvement| |503.bwaves_r |-0.29% | |507.cactuBSSN_r| 0.04% | |508.namd_r |-0.74% | |510.parest_r |-0.01% | |511.povray_r | 2.23% | |519.lbm_r | 0.1% | |521.wrf_r | 0.49% | |526.blender_r | 0.13% | |527.cam4_r | 0.65% | |538.imagick_r | 8.43% | |544.nab_r | 0.71% | |549.fotonik3d_r| 0.15% | |554.roms_r | 0.08% | After commit e739972ad6ad05e32a1dd5c29c0b950a4c4bd576, on Skylake client, impacts on 538.imagick_r with -fno-unsafe-math-optimizations -march=native -Ofast -funroll-loops -flto 1. Size comparision: before: textdata bss dec hex filename 243637783524528 2449257 255f69 imagick_r after: textdata bss dec hex filename 242524983524528 2438129 2533f1 imagick_r 2. Number of vxorps: before after difference 49484135-19.66% 3. Performance improvement: |RATE |Improvement| |538.imagick_r | 5.5% | gcc/ 2019-02-22 H.J. Lu Hongtao Liu Sunil K Pandey PR target/87007 * config/i386/i386-passes.def: Add pass_remove_partial_avx_dependency. * config/i386/i386-protos.h (make_pass_remove_partial_avx_dependency): New. * config/i386/i386.c (make_pass_remove_partial_avx_dependency): New function. (pass_data_remove_partial_avx_dependency): New. (pass_remove_partial_avx_dependency): Likewise. (make_pass_remove_partial_avx_dependency): Likewise. * config/i386/i386.md (avx_partial_xmm_update): New attribute. (*extendsfdf2): Add avx_partial_xmm_update. (truncdfsf2): Likewise. (*float2): Likewise. (SF/DF conversion splitters): Disabled for TARGET_AVX. gcc/testsuite/ 2019-02-22 H.J. Lu Hongtao Liu Sunil K Pandey PR target/87007 * gcc.target/i386/pr87007-1.c: New test. * gcc.target/i386/pr87007-2.c: Likewise. Added: trunk/gcc/testsuite/gcc.target/i386/pr87007-1.c
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 Uroš Bizjak changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #20 from Uroš Bizjak --- (In reply to H.J. Lu from comment #19) > > Do we need XOR for cvtsd2ss mem->xmm? > > Yes, we do since > > vcvtss2sd f(%rip), %xmm0, %xmm0 > > partially updates %xmm0. This is part of PR 87007, so let's call this PR FIXED.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 Uroš Bizjak changed: What|Removed |Added Target Milestone|--- |9.0
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #19 from H.J. Lu --- (In reply to Uroš Bizjak from comment #18) > The only remaining question is on cvtsd2ss mem->xmm, where ICC goes with the > same strategy as with other non-conversion SSE unops: > >vmovsdd(%rip), %xmm0 >vcvtsd2ss %xmm0, %xmm0, %xmm0 > > but with cvtss2sd: > >vxorpd%xmm0, %xmm0, %xmm0 >vcvtss2sd f(%rip), %xmm0, %xmm0 > > Do we need XOR for cvtsd2ss mem->xmm? Yes, we do since vcvtss2sd f(%rip), %xmm0, %xmm0 partially updates %xmm0.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #18 from Uroš Bizjak --- The only remaining question is on cvtsd2ss mem->xmm, where ICC goes with the same strategy as with other non-conversion SSE unops: vmovsdd(%rip), %xmm0 vcvtsd2ss %xmm0, %xmm0, %xmm0 but with cvtss2sd: vxorpd%xmm0, %xmm0, %xmm0 vcvtss2sd f(%rip), %xmm0, %xmm0 Do we need XOR for cvtsd2ss mem->xmm?
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #17 from uros at gcc dot gnu.org --- Author: uros Date: Sun Feb 3 16:48:41 2019 New Revision: 268496 URL: https://gcc.gnu.org/viewcvs?rev=268496=gcc=rev Log: PR target/89071 * config/i386/i386.md (*sqrt2_sse): Add (v,0) alternative. Do not prefer (v,v) alternative for non-AVX targets and (m,v) alternative for speed when TARGET_SSE_PARTIAL_REG_DEPENDENCY is set. (*rcpsf2_sse): Ditto. (*rsqrtsf2_sse): Ditto. (sse4_1_round
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #16 from Uroš Bizjak --- (In reply to Peter Cordes from comment #15) > (In reply to Uroš Bizjak from comment #13) > > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP > > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we > > currently don't emit XOR clear in front of these instrucitons, when they > > operate with memory input. > > They *do* have an output dependency. It might or might not actually be a > problem and be worth clogging the front-end with extra uops to avoid, it > depending on surrounding code. >.< OK, I'll proceed with the patch from Comment #14 then. > * CVTSS2SD vs. PD, and SD2SS vs. PD2PS > packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), > Nano3000, KNL. On Silvermont by just 1 cycle latency (so even a MOVAPS on > the critical path would make it equal.) Similar on Atom. Slower on CPUs > that do 128-bit vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 > and older. > > packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c > throughput), Prescott, P4. Much faster on Jaguar (1c vs. 8c throughput, and > 1 uop vs. 2). We do have infrastructure to convert scalar conversions to packed: /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion from FP to FP. This form of instructions avoids partial write to the destination. */ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts", m_AMDFAM10) /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion from integer to FP. */ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10) And, as can be seen from above tunes, they are currently enabled for AMDFAM10, it is just a matter of selecting relevant tune for the selected target.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #15 from Peter Cordes --- (In reply to Uroš Bizjak from comment #13) > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we > currently don't emit XOR clear in front of these instrucitons, when they > operate with memory input. They *do* have an output dependency. It might or might not actually be a problem and be worth clogging the front-end with extra uops to avoid, it depending on surrounding code. >.< e.g. ROUNDSD: DEST[127:63] remains unchanged Thanks, Intel. You'd think by SSE4.1 they would have learned that false dependencies suck, and that it's extremely rare to actually take advantage of this merge behaviour, but no. For register-source ROUNDSD / ROUNDSS, we can use ROUNDPD / ROUNDPS which write the full destination register and have identical performance on all CPUs that support them. (Except Silvermont, where roundps/pd have 5c latency vs. 4c for roundss/sd. Goldmont makes them equal.) KNL has faster (V)ROUNDPS/D than ROUNDSS/SD, maybe only because of the SSE encoding? Agner Fog isn't clear, and doesn't have an entry that would match vroundss/sd. Copy-and-round is good for avoiding extra MOVAPS instructions which can make SSE code front-end bound, and reduce the effective size of the out-of-order window. Preserving FP exception semantics for packed instead of scalar register-source: * if the upper element(s) of the source is/are known 0, we can always do this with sqrt and round, and convert: they won't produce any FP exceptions, not even inexact. (But not rsqrt / rcpps, of course.) This will be the case after a scalar load, so if we need the original value in memory *and* the result of one of these instructions, we're all set. * with rounding, the immediate can control masking of precision exceptions, but not Invalid which is always raised by SRC = SNaN. If we can rule out SNaN in the upper elements of the input, we can use ROUNDPS / ROUNDPD roundps/d can't produce a denormal output. I don't think denormal inputs slow it down on any CPUs, but worth checking for cases where we don't care about preserving exception semantics and want to use it with potentially-arbitrary garbage in high elements. rsqrtps can't produce a denormal output because sqrt makes the output closer to 1.0 (reducing the magnitude of the exponent). (And thus neither can sqrtps.) SQRTPS/PD is the same performance as SQRTSS/SD on new CPUs, but old CPUs that crack 128-bit ops into 64-bit are slower: Pentium III, Pentium M, and Bobcat. And Jaguar for sqrt. Also Silvermont is *MUCH* slower for SQRTPD/PS then SD/SS, and even Goldmont Plus has slower packed SQRT, RSQRT, and RCP than scalar. But RCPPS can produce a denormal. (double)1.0/FLT_MAX = 2.938736e-39, which is smaller than FLT_MIN = 1.175494e-38 So according to Agner's tables: * ROUNDPS/PD is never slower than ROUNDSS/SD on any CPU that support them. * SQRTPS/PD *are* slower than scalar on Silvermont through Goldmont Plus, and Bobcat, Nano 3000, and P4 Prescott/Nocona. By about a factor of 2, enough that should probably care about it for tune=generic. For ss/ps only (not double), also K10 and Jaguar have slower sqrtps than ss. Also in 32-bit mode, P4, Pentium M and earlier Intel, and Atom, are much slower for packed than scalar sqrt. SQRTPD is *faster* than SQRTSD on KNL. (But hopefully we're never tuning for KNL without AVX available.) * RSQRT / RCP: packed is slower on Atom, Silvermont, and Goldmont (multi-uop so a big decode stall). Somewhat slower on Goldmont Plus (1 uop but half throughput). Also slower on Nano3000, and slightly slower on Pentium 4 (before and after Prescott/Nocona), and KNL. (But hopefully KNL can always use VRSQRT28PS/PD or scalar) Pentium M and older again decode as at least 2 uops for packed, same as Bobcat and K8. Same performance for packed vs. scalar on Jaguar, K10, bdver1-4, ryzen, Core2 and later, and SnB-family. * CVTSS2SD vs. PD, and SD2SS vs. PD2PS packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), Nano3000, KNL. On Silvermont by just 1 cycle latency (so even a MOVAPS on the critical path would make it equal.) Similar on Atom. Slower on CPUs that do 128-bit vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 and older. packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c throughput), Prescott, P4. Much faster on Jaguar (1c vs. 8c throughput, and 1 uop vs. 2). same speed (but without the false dep) for SnB-family (mostly), Core 2, Ryzen. Odd stuff: Agner reports: Nehalem: ps2pd = 2 uops / 2c, ss2sd = 1 uop / 1c. (I guess just zero-padding the significand, no rounding required). pd2ps and sd2ss are equal at 2 uops / 4c latency. SnB: cvtpd2ps is 1c higher latency than sd2ss. IvB: ps2pd on IvB is 1c vs. 2c for ss2sd On HSW and later things have settled down to
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 Uroš Bizjak changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-02-01 Assignee|unassigned at gcc dot gnu.org |ubizjak at gmail dot com Ever confirmed|0 |1 --- Comment #14 from Uroš Bizjak --- Created attachment 45582 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45582=edit Additional patch to break partial SSE reg dependencies Here is another patch that may help with partial SSE reg dependencies for {R,}SQRTS{S,D}, RCPS{S,D} and ROUNDS{S,D} instructions. It takes the same strategy as both ICC and clang take, that is: a) load from mem with MOVS{S,D} and b) in case of SSE, match input and output register. The implementation uses preferred_for_speed attribute, so in cold sections or when compiled with -Os, the compiler is still able to create direct load from memory (SSE, AVX) and use unmatched registers for SSE targets. So, the sqrt from memory is now compikled to: movsd z(%rip), %xmm0 sqrtsd %xmm0, %xmm0 (SSE) or vmovsd z(%rip), %xmm1 vsqrtsd %xmm1, %xmm1, %xmm0 (AVX). And sqrt from unmatched input register will compile to: sqrtsd %xmm1, %xmm1 movapd %xmm1, %xmm0 (SSE) or vsqrtsd %xmm1, %xmm1, %xmm0 . HJ, can you please benchmark this patch?
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #13 from Uroš Bizjak --- I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we currently don't emit XOR clear in front of these instrucitons, when they operate with memory input.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #12 from Uroš Bizjak --- (In reply to Peter Cordes from comment #10) > It also bizarrely uses it for VMOVSS, which gcc should only emit if it > actually wants to merge (right?). *If* this part of the patch isn't a bug > > - return "vmovss\t{%1, %0, %0|%0, %0, %1}"; > + return "vmovss\t{%d1, %0|%0, %d1}"; > > then even better would be vmovaps %1, %0 (which can benefit from > mov-elimination, and doesn't need a port-5-only ALU uop.) Same for vmovsd > of course. This is actually overridden in mode calculations, where it is disabled for TARGET_SSE_PARTIAL_REG_DEPENDENCY targets.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #11 from uros at gcc dot gnu.org --- Author: uros Date: Thu Jan 31 20:06:42 2019 New Revision: 268427 URL: https://gcc.gnu.org/viewcvs?rev=268427=gcc=rev Log: PR target/89071 * config/i386/i386.md (*extendsfdf2): Split out reg->reg alternative to avoid partial SSE register stall for TARGET_AVX. (truncdfsf2): Ditto. (sse4_1_round2): Ditto. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #10 from Peter Cordes --- (In reply to Uroš Bizjak from comment #9) > There was similar patch for sqrt [1], I think that the approach is > straightforward, and could be applied to other reg->reg scalar insns as > well, independently of PR87007 patch. > > [1] https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00202.html Yeah, that looks good. So I think it's just vcvtss2sd and sd2ss, and VROUNDSS/SD that aren't done yet. That patch covers VSQRTSS/SD, VRCPSS, and VRSQRTSS. It also bizarrely uses it for VMOVSS, which gcc should only emit if it actually wants to merge (right?). *If* this part of the patch isn't a bug - return "vmovss\t{%1, %0, %0|%0, %0, %1}"; + return "vmovss\t{%d1, %0|%0, %d1}"; then even better would be vmovaps %1, %0 (which can benefit from mov-elimination, and doesn't need a port-5-only ALU uop.) Same for vmovsd of course.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #9 from Uroš Bizjak --- There was similar patch for sqrt [1], I think that the approach is straightforward, and could be applied to other reg->reg scalar insns as well, independently of PR87007 patch. [1] https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00202.html
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #8 from Peter Cordes --- Created attachment 45544 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544=edit testloop-cvtss2sd.asm (In reply to H.J. Lu from comment #7) > I fixed assembly codes and run it on different AVX machines. > I got similar results: > > ./test > sse : 28346518 > sse_clear: 28046302 > avx : 28214775 > avx2 : 28251195 > avx_clear: 28092687 > > avx_clear: > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > ret > > is slightly faster. I'm pretty sure that's a coincidence, or an unrelated microarchitectural effect where adding any extra uop makes a difference. Or just chance of code alignment for the uop-cache (32-byte or maybe 64-byte boundaries). You're still testing with the caller compiled without optimization. The loop is a mess of sign-extension and reloads, of course, but most importantly keeping the loop counter in memory creates a dependency chain involving store-forwarding latency. Attempting a load later can make it succeed more quickly in store-forwarding cases, on Intel Sandybridge-family, so perhaps an extra xor-zeroing uop is reducing the average latency of the store/reloads for the loop counter (which is probably the real bottleneck.) https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization Loads are weird in general: the scheduler anticipates their latency and dispatches uops that will consume their results in the cycle when it expects a load will put the result on the forwarding network. But if the load *isn't* ready when expected, it may have to replay the uops that wanted that input. See https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop for a detailed analysis of this effect on IvyBridge. (Skylake doesn't have the same restrictions on stores next to loads, but other effects can cause replays.) https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base/52358810#52358810 is an interesting case for pointer-chasing where the load port speculates that it can use the base pointer for TLB lookups, instead of the base+offset. https://stackoverflow.com/questions/52527325/why-does-the-number-of-uops-per-iteration-increase-with-the-stride-of-streaming shows load replays on cache misses. So there's a huge amount of complicating factors from using a calling loop that keeps its loop counter in memory, because SnB-family doesn't have a simple fixed latency for store forwarding. If I put the tests in a different order, I sometimes get results like: ./test sse : 26882815 sse_clear: 26207589 avx_clear: 25968108 avx : 25920897 avx2 : 25956683 Often avx (with the false dep on the load result into XMM1) is slower than avx_clear of avx2, but there's a ton of noise. Adding vxorps %xmm2, %xmm2, %xmm2 to avx.S also seems to have sped it up; now it's the same speed as the others, even though I'm *not* breaking the dependency chain anymore. XMM2 is unrelated, nothing touches it. This basically proves that your benchmark is sensitive to extra instructions, whether they interact with vcvtsd2ss or not. We know that in the general case, throwing in extra NOPs or xor-zeroing instructions on unused registers does not make code faster, so we should definitely distrust the result of this microbenchmark. I've attached my NASM loop. It has various commented-out loop bodies, and notes in comments on results I found with performance counters. I don't know if it will be useful (because it's a bit messy), but it's what I use for testing snippets of asm in a static binary with near-zero startup overhead. I just run perf stat on the whole executable and look at cycles / uops.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #7 from H.J. Lu --- I fixed assembly codes and run it on different AVX machines. I got similar results: ./test sse : 28346518 sse_clear: 28046302 avx : 28214775 avx2 : 28251195 avx_clear: 28092687 avx_clear: vxorps %xmm0, %xmm0, %xmm0 vcvtsd2ss %xmm1, %xmm0, %xmm0 ret is slightly faster.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #6 from Peter Cordes --- (In reply to Peter Cordes from comment #5) > But whatever the effect is, it's totally unrelated to what you were *trying* > to test. :/ After adding a `ret` to each AVX function, all 5 are basically the same speed (compiling the C with `-O2` or -O2 -march=native), with just noise making it hard to see anything clearly. sse_clear tends to be faster than sse in a group of runs, but if there are differences it's more likely due to weird front-end effects and all the loads of inputs + store/reload of the return address by call/ret. I did while ./test; : ;done to factor out CPU clock-speed ramp up and maybe some cache warmup stuff, but it's still noisy from run to run. Making printf/write system calls between tests will cause TLB / branch-prediction effects because of kernel spectre mitigation, so I guess every test is in the same boat, running right after a system call. Adding loads and stores into the mix makes microbenchmarking a lot harder. Also notice that since `xmm0` and `xmm1` pointers are global, those pointers are reloaded every time through the loop even with optimization. I guess you're not trying to minimize the amount of work outside of the asm functions, to measure them as part of a messy loop. So for the version that have a false dependency, you're making that dependency on the result of this: movrax,QWORD PTR [rip+0x2ebd] # reload xmm1 vmovapd xmm1,XMMWORD PTR [rax+rbx*1] # index xmm1 Anyway, I think there's too much noise in the data, and lots of reason to expect that vcvtsd2ss %xmm0, %xmm0, %xmm1 is strictly better than VPXOR+convert, except in cases where adding an extra uop actually helps, or where code-alignment effects matter.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #5 from Peter Cordes --- (In reply to H.J. Lu from comment #4) > (In reply to Peter Cordes from comment #2) > > Can you show some > > asm where this performs better? > > Please try cvtsd2ss branch at: > > https://github.com/hjl-tools/microbenchmark/ > > On Intel Core i7-6700K, I got I have the same CPU. > [hjl@gnu-skl-2 microbenchmark]$ make > gcc -g -I.-c -o test.o test.c > gcc -g -c -o sse.o sse.S > gcc -g -c -o sse-clear.o sse-clear.S > gcc -g -c -o avx.o avx.S > gcc -g -c -o avx2.o avx2.S > gcc -g -c -o avx-clear.o avx-clear.S > gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o > ./test > sse : 24533145 > sse_clear: 24286462 > avx : 64117779 > avx2 : 62186716 > avx_clear: 58684727 > [hjl@gnu-skl-2 microbenchmark]$ You forgot the RET at the end of the AVX functions (but not the SSE ones); The AVX functions fall through into each other, then into __libc_csu_init before jumping around and eventually returning. That's why they're much slower. Single-step through the loop in GDB... │0x5660 vcvtsd2ss xmm0,xmm0,xmm1 >│0x5664 nopWORD PTR cs:[rax+rax*1+0x0] │0x566e xchg ax,ax │0x5670vcvtsd2ss xmm0,xmm1,xmm1 │0x5674 nopWORD PTR cs:[rax+rax*1+0x0] │0x567e xchg ax,ax │0x5680 vxorps xmm0,xmm0,xmm0 │0x5684 vcvtsd2ss xmm0,xmm0,xmm1 │0x5688 nopDWORD PTR [rax+rax*1+0x0] │0x5690 <__libc_csu_init>endbr64 │0x5694 <__libc_csu_init+4> push r15 │0x5696 <__libc_csu_init+6> movr15,rdx And BTW, SSE vs. SSE_clear are about the same speed because your loop bottlenecks on the store/reload latency of keeping a loop counter in memory (because you compiled the C without optimization). Plus, the C caller loads write-only into XMM0 and XMM1 every iteration, breaking any loop-carried dependency the false dep would create. I'm not sure why it makes a measurable difference to run the extra NOPS, and 3x vcvtsd2ss instead of 1 for avx() vs. avx_clear(), because the C caller should still be breaking dependencies for the AVX-128 instructions. But whatever the effect is, it's totally unrelated to what you were *trying* to test. :/
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #4 from H.J. Lu --- (In reply to Peter Cordes from comment #2) > (In reply to H.J. Lu from comment #1) > > But > > > > vxorps %xmm0, %xmm0, %xmm0 > > vcvtsd2ss %xmm1, %xmm0, %xmm0 > > > > are faster than both. > > On Skylake-client (i7-6700k), I can't reproduce this result in a > hand-written asm loop. (I was using NASM to make a static executable that > runs a 100M iteration loop so I could measure with perf). Can you show some > asm where this performs better? Please try cvtsd2ss branch at: https://github.com/hjl-tools/microbenchmark/ On Intel Core i7-6700K, I got [hjl@gnu-skl-2 microbenchmark]$ make gcc -g -I.-c -o test.o test.c gcc -g -c -o sse.o sse.S gcc -g -c -o sse-clear.o sse-clear.S gcc -g -c -o avx.o avx.S gcc -g -c -o avx2.o avx2.S gcc -g -c -o avx-clear.o avx-clear.S gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o ./test sse : 24533145 sse_clear: 24286462 avx : 64117779 avx2 : 62186716 avx_clear: 58684727 [hjl@gnu-skl-2 microbenchmark]$
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #3 from Peter Cordes --- (In reply to H.J. Lu from comment #1) I have a patch for PR 87007: > > https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html > > which inserts a vxorps at the last possible position. vxorps > will be executed only once in a function. That's talking about the mem,reg case, which like I said is different. I reported Bug 80571 a while ago about the mem,reg case (or gp-reg for si2ss/d), so it's great that you have a fix for that, doing one xor-zeroing and reusing that as a merge target for a whole function / loop. But this bug is about the reg,reg case, where I'm pretty sure there's nothing to be gained from xor-zeroing anything. We can fully avoid any false dep just by choosing both source registers = src, making the destination properly write-only. If you *have* an xor-zeroed register, there's no apparent harm in using it as the merge-target for a reg-reg vcvt, vsqrt, vround, or whatever, but there's no benefit either vs. just setting both source registers the same. So whichever is easier to implement, but ideally we want to avoid introducing a vxorps into functions / blocks that don't need it at all.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #2 from Peter Cordes --- (In reply to H.J. Lu from comment #1) > But > > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > > are faster than both. On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written asm loop. (I was using NASM to make a static executable that runs a 100M iteration loop so I could measure with perf). Can you show some asm where this performs better? vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination being an xor-zeroed register. (Either zeroed outside the loop, or inside, or once per 4 converts with an unrolled loop.) I can't construct a case where vcvtsd2ss %xmm1, %xmm1, %xmm0 is worse in any way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst = middle source. I wasn't mixing it with other instructions other than VXORPS, but I don't think anything is going to get rid of its 2nd uop, and choosing both inputs = the same source removes any benefit from dep-breaking the output. If adding a VXORPS helped, its probably due to some other side-effect. Could the effect you saw have been due to code-gen changes for memory sources, maybe vxorps + vcvtsd2ss (mem), %xmm0, %xmm0 vs. vmovsd + vcvtsd2ss %xmm1, %xmm1, %xmm0? (Those should be about equal, but memory-source SS2SD is cheaper, no port5 uop.) BTW, the false-dependency effect is much more obvious with SS2SD, where the latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS. Even without dependency-breaking, repeated vcvtsd2ss %xmm1, %xmm0, %xmm0 can run at 1 per clock (same as with dep breaking), because the port-5 uop that merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd. So latency from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle. With dep-breaking, they both still bottleneck on the port5 uop if you're doing nothing else.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 H.J. Lu changed: What|Removed |Added Depends on||87007 --- Comment #1 from H.J. Lu --- vcvtsd2ss %xmm1, %xmm1, %xmm0 is faster than vcvtsd2ss %xmm1, %xmm0, %xmm0 But vxorps %xmm0, %xmm0, %xmm0 vcvtsd2ss %xmm1, %xmm0, %xmm0 are faster than both. I have a patch for PR 87007: https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html which inserts a vxorps at the last possible position. vxorps will be executed only once in a function. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87007 [Bug 87007] [8/9 Regression] 10% slowdown with -march=skylake-avx512