On Thu, Aug 27, 2020 at 8:24 PM Jakub Jelinek <ja...@redhat.com> wrote:
>
> On Thu, Jul 09, 2020 at 04:33:46PM +0800, Hongtao Liu via Gcc-patches wrote:
> > +static void
> > +replace_constant_pool_with_broadcast (rtx_insn* insn)
> > +{
> > +  subrtx_ptr_iterator::array_type array;
> > +  FOR_EACH_SUBRTX_PTR (iter, array, &PATTERN (insn), ALL)
> > +    {
> > +      rtx *loc = *iter;
> > +      rtx x = *loc;
> > +      rtx broadcast_mem, vec_dup, constant, first;
> > +      machine_mode mode;
> > +      if (GET_CODE (x) != MEM
>
> MEM_P
>
> > +       || GET_CODE (XEXP (x, 0)) != SYMBOL_REF
>
> SYMBOL_REF_P
>
> > +       || !CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)))
> > +     continue;
> > +
> > +      mode = GET_MODE (x);
> > +      if (!VECTOR_MODE_P (mode))
> > +     return;
> > +
> > +      constant = get_pool_constant (XEXP (x, 0));
> > +      first = XVECEXP (constant, 0, 0);
>
> Shouldn't this verify first that GET_CODE (constant) == CONST_VECTOR
> and punt otherwise?
>
> > +      broadcast_mem = force_const_mem (GET_MODE_INNER (mode), first);
> > +      vec_dup = gen_rtx_VEC_DUPLICATE (mode, broadcast_mem);
> > +      *loc = vec_dup;
> > +      INSN_CODE (insn) = -1;
> > +      /* Revert change if there's no corresponding pattern.  */
> > +      if (recog_memoized (insn) < 0)
> > +             {
> > +               *loc = x;
> > +               recog_memoized (insn);
> > +             }
>
> The usual way of doing this would be through
>   validate_change (insn, loc, vec_dup, 0);
>
> Also, isn't the pass also useful for TARGET_AVX and above (but in that case
> only if it is a simple memory load)?  Or are avx/avx2 broadcast slower than
> full vector loads?
>

Yes, broadcast is insufficient. refer to [1]
[1]. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87767#c5

Since pass_combine won't combine load instruction into memory operand,
i.e. it wouldn't combine *mov mem xmm* and *vaddps xmm, xmm, xmm* into
*vaddps mem, xmm, xmm*
It just left this to ira, broadcast under avx512f isn't excluded.

Maybe patch should be rewritten by using post reload splitter, then
there would be no concern of an extra pass, but still have the latter
issue as you mentioned.

> As Jeff wrote, I wonder if when successfully replacing those pool constants
> the old constant pool entries will be omitted.
>
> Another thing I wonder about is whether more analysis shouldn't be used.
> E.g. if the constant pool entry is already emitted into .rodata anyway
> (e.g. some earlier function needed it), using the broadcast will mean

Yes, some later function may need it either, so we need a global view
to decide the replacement, hope it could be done by generic constant
pool code.

> actually larger .rodata.  If {1to8} and similar is as fast as reading all
> the same elements from memory (or faster), perhaps in that case it should
> broadcast from the first element of the existing constant pool full vector
> rather than creating a new one.
> And similarly, perhaps the function should look at all constant pool entries
> in the current function (not yet emitted into .rodata) and if it would
> succeed for some and not for others, either use broadcast from its first
> element or not perform it for the others too.
>
>         Jakub
>


-- 
BR,
Hongtao

Reply via email to