On Fri, Jan 23, 2026 at 8:38 AM Liu, Hongtao <[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Thursday, January 22, 2026 8:19 PM
> > To: Hongtao Liu <[email protected]>
> > Cc: Liu, Hongtao <[email protected]>; [email protected]
> > Subject: Re: [PATCH] [x86] Add u-arch tune prefer_bcst_from_integer.
> >
> > On Thu, 22 Jan 2026, Hongtao Liu wrote:
> >
> > > On Thu, Jan 22, 2026 at 6:37 PM Richard Biener <[email protected]>
> > wrote:
> > > >
> > > > On Tue, 20 Jan 2026, liuhongt wrote:
> > > >
> > > > > /* X86_TUNE_PREFER_BCST_FROM_INTEGER: Enable broadcast from
> > integer for
> > > > >    128/256/512-bit vector, if disabled, the move will be done by
> > > > >    broadcast/load from constant pool
> > > > >
> > > > >    broadcast from integer:
> > > > >       mov    $0xa,%eax
> > > > >       vmovd  %eax,%xmm0
> > > > >       vpbroadcastd %xmm0,%xmm0
> > > > >
> > > > >    broadcast/load from constant pool:
> > > > >       vpbroadcastd CST.0(%rip), %xmm0  */
> > > > >
> > > > > The tune is on by default.
> > > >
> > > > Thanks!
> > > >
> > > > I wonder if the broadcast/load from constnat pool could be late
> > > > split in to the former when a GPR is available (and the tune
> > > > indicates it's profitable).
> > > Good point, maybe the implementation of this tune could also be
> > > changed to convert the load of const_vector into broadcast from memory
> > > before RA, and then decide after RA whether we need to use a GPR to
> > > avoid this load. I'm not sure if doing this optimization after RA
> > > might have potential issues, I suspect there could be some level of
> > > conflict between this and RA.
> >
> > I'd say it will clearly help the case when there's a lot of GPR register 
> > pressure.
> Based on similar situations I've seen before, I think in this case the RA 
> will put this constant into memory and then directly broadcast from it. 
> However, I will continue to verify whether this is the case.

The testcase in the link is with high register pressure, and RA will
"allocate" constant into broadcast from memory instead of putting it
in gpr and broadcast.
https://godbolt.org/z/6aE3obz5b

Maybe there're still some mis-optimizations, but it at least tunable
through cost of integer <-> sse.
>
> > OTOH possibly the bcast from memory pattern could have an alternative with
> > a scratch GPR for later use in spliting so the RA would have choice?  That
> > alternative could be disabled
> > when !X86_TUNE_PREFER_BCST_FROM_INTEGER?
> >
> > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >       PR target/123631
> > > > >       * config/i386/i386-expand.cc (ix86_vector_duplicate_value):
> > > > >       Don't force CONST_INT to
> > reg !TARGET_PREFER_BCST_FROM_INTEGER,
> > > > >       force it to mem instead.
> > > > >       * config/i386/i386.h (TARGET_PREFER_BCST_FROM_INTEGER): New
> > macro.
> > > > >       * config/i386/x86-tune.def
> > > > >       (X86_TUNE_PREFER_BCST_FROM_INTEGER): New tune.
> > > > > ---
> > > > >  gcc/config/i386/i386-expand.cc | 17 +++++++++++++----
> > > > >  gcc/config/i386/i386.h         |  3 +++
> > > > >  gcc/config/i386/x86-tune.def   | 15 +++++++++++++++
> > > > >  3 files changed, 31 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/gcc/config/i386/i386-expand.cc
> > > > > b/gcc/config/i386/i386-expand.cc index d6525ddcdd0..a82bb4399c9
> > > > > 100644
> > > > > --- a/gcc/config/i386/i386-expand.cc
> > > > > +++ b/gcc/config/i386/i386-expand.cc
> > > > > @@ -17361,12 +17361,21 @@ ix86_vector_duplicate_value
> > (machine_mode mode, rtx target, rtx val)
> > > > >        machine_mode innermode = GET_MODE_INNER (mode);
> > > > >        rtx reg;
> > > > >
> > > > > -      /* If that fails, force VAL into a register.  */
> > > > > +      /* If that fails, force VAL into a register or mem.  */
> > > > >
> > > > >        start_sequence ();
> > > > > -      reg = force_reg (innermode, val);
> > > > > -      if (GET_MODE (reg) != innermode)
> > > > > -     reg = gen_lowpart (innermode, reg);
> > > > > +
> > > > > +      if (!TARGET_PREFER_BCST_FROM_INTEGER && CONST_INT_P (val)
> > > > > +       && GET_MODE_BITSIZE (innermode) <=
> > HOST_BITS_PER_WIDE_INT
> > > > > +       && GET_MODE_BITSIZE(mode) >= 128)
> > > > > +     reg = validize_mem (force_const_mem (innermode, val));
> > > > > +      else
> > > > > +     {
> > > > > +       reg = force_reg (innermode, val);
> > > > > +       if (GET_MODE (reg) != innermode)
> > > > > +         reg = gen_lowpart (innermode, reg);
> > > > > +     }
> > > > > +
> > > > >        SET_SRC (PATTERN (insn)) = gen_vec_duplicate (mode, reg);
> > > > >        seq = end_sequence ();
> > > > >        if (seq)
> > > > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index
> > > > > 71bacc22052..888edfed88f 100644
> > > > > --- a/gcc/config/i386/i386.h
> > > > > +++ b/gcc/config/i386/i386.h
> > > > > @@ -409,6 +409,9 @@ extern unsigned char
> > ix86_tune_features[X86_TUNE_LAST];
> > > > >       ix86_tune_features[X86_TUNE_INTER_UNIT_MOVES_FROM_VEC]
> > > > >  #define TARGET_INTER_UNIT_CONVERSIONS \
> > > > >       ix86_tune_features[X86_TUNE_INTER_UNIT_CONVERSIONS]
> > > > > +#define TARGET_PREFER_BCST_FROM_INTEGER \
> > > > > +  ix86_tune_features[X86_TUNE_PREFER_BCST_FROM_INTEGER]
> > > > > +
> > > > >  #define TARGET_FOUR_JUMP_LIMIT
> > ix86_tune_features[X86_TUNE_FOUR_JUMP_LIMIT]
> > > > >  #define TARGET_SCHEDULE
> > ix86_tune_features[X86_TUNE_SCHEDULE]
> > > > >  #define TARGET_USE_BT
> > ix86_tune_features[X86_TUNE_USE_BT]
> > > > > diff --git a/gcc/config/i386/x86-tune.def
> > > > > b/gcc/config/i386/x86-tune.def index a1944620daf..53cf1a19433
> > > > > 100644
> > > > > --- a/gcc/config/i386/x86-tune.def
> > > > > +++ b/gcc/config/i386/x86-tune.def
> > > > > @@ -488,6 +488,21 @@ DEF_TUNE
> > (X86_TUNE_INTER_UNIT_MOVES_FROM_VEC,
> > > > > "inter_unit_moves_from_vec",  DEF_TUNE
> > (X86_TUNE_INTER_UNIT_CONVERSIONS, "inter_unit_conversions",
> > > > >            ~(m_AMDFAM10 | m_BDVER))
> > > > >
> > > > > +/* X86_TUNE_PREFER_BCST_FROM_INTEGER: Enable broadcast from
> > integer for
> > > > > +   128/256/512-bit vector, if disabled, the move will be done by
> > > > > +   broadcast/load from constant pool
> > > > > +
> > > > > +   broadcast from integer:
> > > > > +      mov    $0xa,%eax
> > > > > +      vmovd  %eax,%xmm0
> > > > > +      vpbroadcastd %xmm0,%xmm0
> > > > > +
> > > > > +   broadcast/load from constant pool:
> > > > > +      vpbroadcastd CST.0(%rip), %xmm0  */
> > > > > +
> > > > > +DEF_TUNE (X86_TUNE_PREFER_BCST_FROM_INTEGER,
> > "prefer_bcst_from_integer",
> > > > > +          m_ALL)
> > > > > +
> > > > >  /* X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS: Try to split
> > memory operand for
> > > > >     fp converts to destination register.  */  DEF_TUNE
> > > > > (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS,
> > > > > "split_mem_opnd_for_fp_converts",
> > > > >
> > > >
> > > > --
> > > > Richard Biener <[email protected]>
> > > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > > Nuernberg, Germany;
> > > > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > Nuernberg)
> > >
> > >
> > >
> > >
> >
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)



-- 
BR,
Hongtao

Reply via email to