Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Richard Sandiford Thu, 21 May 2020 02:23:20 -0700

"Yangfei (Felix)" <felix.y...@huawei.com> writes:
> Hi,
>
>   Notice a tiny SVE-related performance issue:  
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254 
>
>   For the given test case, SLP succeeds with VNx8HImode with or without 
> option -msve-vector-bits=256.
>   The root cause for the difference is that we choose a different mode in 
> aarch64_vectorize_related_mode under -msve-vector-bits=256: VNx2HImode 
> instead of V4HImode.
>   Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT from 
> V4HImode to VNx2HImode.
>
>   PATCH catch and simplify the pattern in aarch64_expand_sve_mem_move, 
> emitting a mov pattern of V4HImode instead.
>   I am assuming endianness does not make a difference here considering this 
> simplification.
>   Bootstrap and tested on aarch64-linux-gnu.  Comments?


I think we should try to catch this at the gimple level, possibly
during SLP itself.  Although the patch handles the case in which
the V4HI is stored directly to memory, I assume it won't help
if the code instead does:

    for (i = 0; i < 4; i++)
      b[i] = u.a[i] + 1;

targetm.can_change_mode_class (..., ALL_REGS) would be a good
indicator of whether the needed VIEW_CONVERT_EXPR is cheap or expensive.

I agree it might still be worth handling this in the move patterns too.
It feels like a target-independent optimisation though, and for example
should also apply to V4HI moves involving subregs of VNx2HIs.

So I think it would be worth trying to do this in emit_move_insn.
In principle it would be useful for:

  // M1 and M2 equal size, !targetm.can_change_mode_class (M1, M2, ALL_REGS)

  (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...)))
  (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR))
  (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...)))
  (set (subreg:M1 (reg:M2 ...)) (constant C))

It would be nice if we could do this even without the can_change_mode_class
condition, provided that it doesn't turn valid M1 constants or MEMs into
invalid M2 ones (or at least, M2 ones that need to be validated).
Unfortunately, going that far is likely to interfere with target-specific
choices, so it's probably too dangerous.

With the can_change_mode_class condition it should be fine though,
since it's avoiding an implicit round trip through memory.  The change
should be a win even if the MEMs or constants need legitimising for M2.

> [...]
> +  if (inner != NULL_RTX
> +      && aarch64_classify_vector_mode (inner_mode) == VEC_ADVSIMD
> +      && GET_MODE_INNER (mode) == GET_MODE_INNER (inner_mode)
> +      && known_eq (GET_MODE_SIZE (mode), GET_MODE_SIZE (inner_mode))
> +      && GET_MODE_BITSIZE (inner_mode).to_constant () <= alignment)
> +    {
> +      rtx addr, mem;
> +      if (MEM_P (src))
> +     {
> +       addr = XEXP (src, 0);
> +       mem = gen_rtx_MEM (inner_mode, addr);
> +       emit_move_insn (inner, mem);
> +     }
> +      else
> +     {
> +       addr = XEXP (dest, 0);
> +       mem = gen_rtx_MEM (inner_mode, addr);
> +       emit_move_insn (mem, inner);

gen_rtx_MEM shouldn't be used to create new versions of existing MEMs,
since it drops all the attributes.  It's better to use something like
adjust_address instead.  That will also take care of making sure that
the address is valid for the new mode.

Thanks,
Richard

Re: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Reply via email to