[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 Richard Biener changed: What|Removed |Added Target Milestone|12.4|12.5 --- Comment #13 from Richard Biener --- GCC 12.4 is being released, retargeting bugs to GCC 12.5.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 Richard Biener changed: What|Removed |Added Target Milestone|12.3|12.4 --- Comment #12 from Richard Biener --- GCC 12.3 is being released, retargeting bugs to GCC 12.4.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 Richard Biener changed: What|Removed |Added Target Milestone|12.2|12.3 --- Comment #11 from Richard Biener --- GCC 12.2 is being released, retargeting bugs to GCC 12.3.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 Jakub Jelinek changed: What|Removed |Added Target Milestone|12.0|12.2 --- Comment #10 from Jakub Jelinek --- GCC 12.1 is being released, retargeting bugs to GCC 12.2.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 --- Comment #9 from Hongtao.liu --- (In reply to H.J. Lu from comment #8) > It turns out that reading YMM registers with all zero bits needs VZEROUPPER > on Sandy Bride, Ivy Bridge, Haswell, Broadwell and Alder Lake to avoid > SSE <-> AVX transition penalty. We should target tune for r12-2571-g9775e465c1fbfc32656de77c618c61acf5bd905d.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 H.J. Lu changed: What|Removed |Added Ever confirmed|0 |1 Resolution|FIXED |--- Status|RESOLVED|REOPENED Last reconfirmed||2022-02-15 --- Comment #8 from H.J. Lu --- It turns out that reading YMM registers with all zero bits needs VZEROUPPER on Sandy Bride, Ivy Bridge, Haswell, Broadwell and Alder Lake to avoid SSE <-> AVX transition penalty.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 H.J. Lu changed: What|Removed |Added Target Milestone|--- |12.0 Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #7 from H.J. Lu --- Fixed for GCC 12. Please reopen it if there are other cases where vzeroupper should be skipped.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 --- Comment #6 from CVS Commits --- The master branch has been updated by H.J. Lu : https://gcc.gnu.org/g:9775e465c1fbfc32656de77c618c61acf5bd905d commit r12-2571-g9775e465c1fbfc32656de77c618c61acf5bd905d Author: H.J. Lu Date: Tue Jul 27 07:46:04 2021 -0700 x86: Don't set AVX_U128_DIRTY when zeroing YMM/ZMM register There is no SSE <-> AVX transition penalty if the upper bits of YMM/ZMM registers are unchanged and YMM/ZMM store doesn't change the upper bits of YMM/ZMM registers. 1. Since zeroing YMM/ZMM register is implemented with zeroing XMM register, don't set AVX_U128_DIRTY when zeroing YMM/ZMM register. 2. Since store doesn't change the INIT state on the upper bits of YMM/ZMM register, don't set AVX_U128_DIRTY on store if the source of store was never non-zero. Here are the vzeroupper count differences on SPEC CPU 2017 with -Ofast -march=skylake-avx512 Before AfterDiff 500.perlbench_r 226 225 -0.44% 502.gcc_r 12631103-12.67% 503.bwaves_r14 14 0.00% 505.mcf_r 29 28 -3.45% 507.cactuBSSN_r 46514628-0.49% 508.namd_r 433 432 -0.23% 510.parest_r20380 19347 -5.07% 511.povray_r495 452 -8.69% 519.lbm_r 2 2 0.00% 520.omnetpp_r 59545677-4.65% 521.wrf_r 12353 12339 -0.11% 523.xalancbmk_r 13137 13001 -1.04% 525.x264_r 192 191 -0.52% 526.blender_r 25152366-5.92% 527.cam4_r 46014583-0.39% 531.deepsjeng_r 20 19 -5.00% 538.imagick_r 898 805 -10.36% 541.leela_r 427 399 -6.56% 544.nab_r 74 74 0.00% 548.exchange2_r 72 72 0.00% 549.fotonik3d_r 318 318 0.00% 554.roms_r 558 554 -0.72% 557.xz_r79 52 -34.18% and performance differences are within noise range. gcc/ PR target/101456 * config/i386/i386.c (ix86_avx_u128_mode_needed): Don't set AVX_U128_DIRTY when all bits are zero. gcc/testsuite/ PR target/101456 * gcc.target/i386/pr101456-1.c: New test. * gcc.target/i386/pr101456-2.c: Likewise.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 --- Comment #5 from H.J. Lu --- We need to verify that LOADING the zero YMM register won't trigger SSE<->AVX transition penalty.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 H.J. Lu changed: What|Removed |Added Attachment #51153|0 |1 is obsolete|| --- Comment #4 from H.J. Lu --- Created attachment 51157 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51157=edit A new patch
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 --- Comment #3 from H.J. Lu --- (In reply to Arjan van de Ven from comment #1) > Actually it's not that they're zero (they are) but they're in "init" state > since the vpxor wrote to xmm not ymm We generate: vxorpd %xmm0, %xmm0, %xmm0 # 5 [c=4 l=4] movv4df_internal/0 to zero all bits in YMM and ZMM registers.
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 --- Comment #2 from H.J. Lu --- Created attachment 51153 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51153=edit A patch
[Bug target/101456] Unnecessary vzeroupper when upper bits of YMM registers already zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101456 Arjan van de Ven changed: What|Removed |Added CC||arjan at linux dot intel.com --- Comment #1 from Arjan van de Ven --- Actually it's not that they're zero (they are) but they're in "init" state since the vpxor wrote to xmm not ymm