Kyrill  Tkachov <kyrylo.tkac...@foss.arm.com> writes:
> Hi all,
>
> We've a deficiency in our vec_set family of patterns.  We don't
> support directly loading a vector lane using LD1 for V2DImode and all
> the vector floating-point modes.  We do do it correctly for the other
> integer vector modes (V4SI, V8HI etc) though.
>
> The alternatives on the relative floating-point patterns only allow a
> register-to-register INS instruction.  That means if we want to load a
> value into a vector lane we must first load it into a scalar register
> and then perform an INS, which is wasteful.
>
> There is also an explicit V2DI vec_set expander dangling around for no
> reason that I can see. It seems to do the exact same things as the
> other vec_set expanders. This patch removes that.  It now unifies all
> vec_set expansions into a single "vec_set<mode>" define_expand using
> the catch-all VALL_F16 iterator.
>
> I decided to leave two aarch64_simd_vec_set<mode> define_insns. One
> for the integer vector modes (that now include V2DI) and one for the
> floating-point vector modes. That is so that we can avoid specifying
> "w,r" alternatives for floating-point modes in case the
> register-allocator gets confused and starts gratuitously moving
> registers between the two banks.  So the floating-point pattern only
> two alternatives, one for SIMD-to-SIMD INS and one for LD1.

Did you see any cases in which this was necessary?  In some ways it
seems to run counter to Wilco's recent patches, which tended to remove
the * markers from the "unnatural" register class and trust the register
allocator to make a sensible decision.

I think our default position should be trust the allocator here.
If the consumers all require "w" registers then the RA will surely
try to use "w" registers if at all possible.  But if the consumers
don't care then it seems reasonable to offer both, since in those
cases it doesn't really make much difference whether the payload
happens to be SF or SI (say).

There are also cases in which the consumer could actively require
an integer register.  E.g. some code uses unions to bitcast floats
to ints and then do bitwise arithmetic on them.

> With this patch we avoid loading values into scalar registers and then
> doing an explicit INS on them to move them into the desired vector
> lanes. For example for:
>
> typedef float v4sf __attribute__ ((vector_size (16)));
> typedef long long v2di __attribute__ ((vector_size (16)));
>
> v2di
> foo_v2di (long long *a, long long *b)
> {
>    v2di res = { *a, *b };
>    return res;
> }
>
> v4sf
> foo_v4sf (float *a, float *b, float *c, float *d)
> {
>    v4sf res = { *a, *b, *c, *d };
>    return res;
> }
>
> we currently generate:
>
> foo_v2di:
>          ldr     d0, [x0]
>          ldr     x0, [x1]
>          ins     v0.d[1], x0
>          ret
>
> foo_v4sf:
>          ldr     s0, [x0]
>          ldr     s3, [x1]
>          ldr     s2, [x2]
>          ldr     s1, [x3]
>          ins     v0.s[1], v3.s[0]
>          ins     v0.s[2], v2.s[0]
>          ins     v0.s[3], v1.s[0]
>          ret
>
> but with this patch we generate the much cleaner:
> foo_v2di:
>          ldr     d0, [x0]
>          ld1     {v0.d}[1], [x1]
>          ret
>
> foo_v4sf:
>          ldr     s0, [x0]
>          ld1     {v0.s}[1], [x1]
>          ld1     {v0.s}[2], [x2]
>          ld1     {v0.s}[3], [x3]
>          ret

Nice!  The original reason for:

  /* FIXME: At the moment the cost model seems to underestimate the
     cost of using elementwise accesses.  This check preserves the
     traditional behavior until that can be fixed.  */
  if (*memory_access_type == VMAT_ELEMENTWISE
      && !STMT_VINFO_STRIDED_P (stmt_info)
      && !(stmt == GROUP_FIRST_ELEMENT (stmt_info)
           && !GROUP_NEXT_ELEMENT (stmt_info)
           && !pow2p_hwi (GROUP_SIZE (stmt_info))))
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "not falling back to elementwise accesses\n");
      return false;
    }

was that we seemed to be too optimistic about how cheap it was to
construct a vector from scalars.  Maybe this patch brings the code
closer to the cost (for AArch64 only of course).

FWIW, the patch looks good to me bar the GPR/FPR split.

Thanks,
Richard

Reply via email to