On Thu, May 13, 2021 at 11:40 AM Uros Bizjak <ubiz...@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 11:18 AM Hongtao Liu <crazy...@gmail.com> wrote:
> >
> > Hi:
> >   When __builtin_ia32_vzeroupper is called explicitly, the corresponding
> > vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
> > which leads to incorrect optimization in pass_reload.
> > In order to solve this problem, this patch introduces a pre_reload
> > splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the
> > problem in pr.
> >
> > At the same time, in order to optimize the low 128 bits in
> > post_reload CSE, this patch also transforms those CLOBBERS to SETs in
> > pass_vzeroupper.
> >
> > It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15
> > are callee-saved, so even if there're no other uses of xmm6-xmm15 in the
> > function, because of vzeroupper's pattern, pro_epilog will save and
> > restore those registers, which is obviously redundant. In order to
> > eliminate this redundancy, a post_reload splitter is introduced, which
> > drops those SETs, until epilogue_completed splitter adds those SETs
> > back, it looks to be safe since there's no CSE between post_reload
> > split2 and epilogue_completed split3??? Also frame info needs to be
> > updated in pro_epilog, which saves and restores xmm6-xmm15 only if
> > there's usage other than explicit vzeroupper pattern.
> >
> >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> >   Ok for trunk?
>
> Some time ago a support for CLOBBER_HIGH RTX was added (and later
> removed for some reason). Perhaps we could resurrect the patch for the
> purpose of ferrying 128bit modes via vzeroupper RTX?

https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html

Uros.

>
> +(define_split
> +  [(match_parallel 0 "vzeroupper_pattern"
> +     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> +  "TARGET_AVX && ix86_pre_reload_split ()"
> +  [(match_dup 0)]
> +{
> +  /* When vzeroupper is explictly used, for LRA purpose, make it clear
> +     the instruction kills sse registers.  */
> +  gcc_assert (cfun->machine->has_explicit_vzeroupper);
> +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> +  rtvec vec = rtvec_alloc (nregs + 1);
> +  RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode,
> +                        gen_rtvec (1, const1_rtx),
> +                        UNSPECV_VZEROUPPER);
> +  for (unsigned int i = 0; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      rtx reg = gen_rtx_REG (V2DImode, regno);
> +      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +    }
> +  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> +})
>
> Wouldn't this also kill lower 128bit values that are not touched by
> vzeroupper? A CLOBBER_HIGH would be more appropriate here.
>
> Uros.
>
>
> > gcc/ChangeLog:
> >
> >         PR target/82735
> >         * config/i386/i386-expand.c (ix86_expand_builtin): Count
> >         number of __builtin_ia32_vzeroupper.
> >         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers):
> >         Transform CLOBBERs to SETs for explicit vzeroupper pattern so
> >         that CSE can optimize lower 128 bits.
> >         * config/i386/i386.c 
> > (ix86_handle_explicit_vzeroupper_in_pro_epilog):
> >         New.
> >         (ix86_save_reg): If there's no use of xmm6~xmm15 other than
> >         explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save
> >         REGNO.
> >         (ix86_finalize_stack_frame_flags): Recompute frame layout if
> >         there's explicit vzeroupper under TARGET_64BIT_MS_ABI.
> >         * config/i386/i386.h (struct machine_function): Change type of
> >         has_explicit_vzeroupper from BOOL_BITFILED to unsigned int.
> >         * config/i386/sse.md (*avx_vzeroupper_2): New post-reload
> >         splitter which will drop all SETs for explicit vzeroupper
> >         patterns.
> >         (*avx_vzeroupper_1): Generate SET reg to reg instead of
> >         CLOBBER, and add pre-reload splitter after it.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/82735
> >         * gcc.target/i386/pr82735-1.c: New test.
> >         * gcc.target/i386/pr82735-2.c: New test.
> >         * gcc.target/i386/pr82735-3.c: New test.
> >         * gcc.target/i386/pr82735-4.c: New test.
> >         * gcc.target/i386/pr82735-5.c: New test.
> >
> >
> > --
> > BR,
> > Hongtao

Reply via email to