On Thu, May 13, 2021 at 11:18 AM Hongtao Liu <crazy...@gmail.com> wrote: > > Hi: > When __builtin_ia32_vzeroupper is called explicitly, the corresponding > vzeroupper pattern does not carry any CLOBBERS or SETs before LRA, > which leads to incorrect optimization in pass_reload. > In order to solve this problem, this patch introduces a pre_reload > splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the > problem in pr. > > At the same time, in order to optimize the low 128 bits in > post_reload CSE, this patch also transforms those CLOBBERS to SETs in > pass_vzeroupper. > > It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15 > are callee-saved, so even if there're no other uses of xmm6-xmm15 in the > function, because of vzeroupper's pattern, pro_epilog will save and > restore those registers, which is obviously redundant. In order to > eliminate this redundancy, a post_reload splitter is introduced, which > drops those SETs, until epilogue_completed splitter adds those SETs > back, it looks to be safe since there's no CSE between post_reload > split2 and epilogue_completed split3??? Also frame info needs to be > updated in pro_epilog, which saves and restores xmm6-xmm15 only if > there's usage other than explicit vzeroupper pattern. > > Bootstrapped and regtested on X86_64-linux-gnu{-m32,} > Ok for trunk?
Some time ago a support for CLOBBER_HIGH RTX was added (and later removed for some reason). Perhaps we could resurrect the patch for the purpose of ferrying 128bit modes via vzeroupper RTX? +(define_split + [(match_parallel 0 "vzeroupper_pattern" + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])] + "TARGET_AVX && ix86_pre_reload_split ()" + [(match_dup 0)] +{ + /* When vzeroupper is explictly used, for LRA purpose, make it clear + the instruction kills sse registers. */ + gcc_assert (cfun->machine->has_explicit_vzeroupper); + unsigned int nregs = TARGET_64BIT ? 16 : 8; + rtvec vec = rtvec_alloc (nregs + 1); + RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode, + gen_rtvec (1, const1_rtx), + UNSPECV_VZEROUPPER); + for (unsigned int i = 0; i < nregs; ++i) + { + unsigned int regno = GET_SSE_REGNO (i); + rtx reg = gen_rtx_REG (V2DImode, regno); + RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg); + } + operands[0] = gen_rtx_PARALLEL (VOIDmode, vec); +}) Wouldn't this also kill lower 128bit values that are not touched by vzeroupper? A CLOBBER_HIGH would be more appropriate here. Uros. > gcc/ChangeLog: > > PR target/82735 > * config/i386/i386-expand.c (ix86_expand_builtin): Count > number of __builtin_ia32_vzeroupper. > * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): > Transform CLOBBERs to SETs for explicit vzeroupper pattern so > that CSE can optimize lower 128 bits. > * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog): > New. > (ix86_save_reg): If there's no use of xmm6~xmm15 other than > explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save > REGNO. > (ix86_finalize_stack_frame_flags): Recompute frame layout if > there's explicit vzeroupper under TARGET_64BIT_MS_ABI. > * config/i386/i386.h (struct machine_function): Change type of > has_explicit_vzeroupper from BOOL_BITFILED to unsigned int. > * config/i386/sse.md (*avx_vzeroupper_2): New post-reload > splitter which will drop all SETs for explicit vzeroupper > patterns. > (*avx_vzeroupper_1): Generate SET reg to reg instead of > CLOBBER, and add pre-reload splitter after it. > > gcc/testsuite/ChangeLog: > > PR target/82735 > * gcc.target/i386/pr82735-1.c: New test. > * gcc.target/i386/pr82735-2.c: New test. > * gcc.target/i386/pr82735-3.c: New test. > * gcc.target/i386/pr82735-4.c: New test. > * gcc.target/i386/pr82735-5.c: New test. > > > -- > BR, > Hongtao