STR.

Richard Sandiford Tue, 06 May 2025 14:02:39 -0700

Jennifer Schmitz <jschm...@nvidia.com> writes:
> About the tests: Non-power-of-2 patterns are already being tested in
> gcc.target/aarch64/sve/acle/general/whilelt_5.c.


OK

> For the case of svptrue_b16 ()
> with 8-bit load, I added a test case for it. Currently, it has a single test 
> case,
> but if necessary I can add more tests for other data types and for stores as 
> well.

Nah, that should be fine.

>
> I bootstrapped and tested and the check-function-bodies test in 
> gcc.target/aarch64/pr117557.c
> is currently failing.
> The current GCC trunk produces:
> f:
>       add x3, x1, 8
>       cmp x2, x3
>       add x3, x2, 60
>       ccmp x1, x3, 2, cc
>       bcc .L2
>       ptrue p7.h, vl8
>       index z31.s, #0, #8
>       ld1b z0.h, p7/z, [x1]
>       punpklo p6.h, p7.b
>       lsl z0.h, z0.h, #2
>       punpkhi p7.h, p7.b
>       uunpklo z1.s, z0.h
>       uunpkhi z0.s, z0.h
>       ld1w z29.s, p6/z, [x0, z1.s, sxtw]
>       ld1w z30.s, p7/z, [x0, z0.s, sxtw]
>       st1w z29.s, p6, [x2, z31.s, sxtw]
>       incb x2, all, mul #2
>       st1w z30.s, p7, [x2, z31.s, sxtw]
>       ret
>       …
>
> WITH my patch it produces:
> f:
>       add     x3, x1, 8
>       cmp     x2, x3
>       add     x3, x2, 60
>       ccmp    x1, x3, 2, cc
>       bcc     .L2
>       ldr     d31, [x1]
>       addvl   sp, sp, #-1
>       ptrue   p5.b, all
>       addpl   x1, sp, #4
>       ptrue   p7.h, vl8
>       punpklo p6.h, p7.b
>       punpkhi p7.h, p7.b
>       str     d31, [x1]
>       ld1b    z31.h, p5/z, [sp, #1, mul vl]
>       lsl     z31.h, z31.h, #2
>       uunpklo z30.s, z31.h
>       uunpkhi z31.s, z31.h
>       ld1w    z29.s, p6/z, [x0, z30.s, sxtw]
>       ld1w    z30.s, p7/z, [x0, z31.s, sxtw]
>       index   z31.s, #0, #8
>       st1w    z29.s, p6, [x2, z31.s, sxtw]
>       incb    x2, all, mul #2
>       st1w    z30.s, p7, [x2, z31.s, sxtw]
>       addvl   sp, sp, #1
>       ret
>       …
>
> The patch seems to fold the ptrue p7.h, vl8 and ld1b z0.h, p7/z, [x1] to an 
> LDR,
> but because the ptrue predicate in p7 is needed again later, this is not 
> really 
> an optimization. Could we prevent the fold somehow in case the predicate is 
> used
> multiple times?

Using LDR and STR is still better in many cases, because it has one
fewer register to rename and because there's the possiblity that the
compiler might form LDPs and STPs.

> [...]
> @@ -3698,6 +3706,19 @@ aarch64_partial_ptrue_length (rtx_vector_builder 
> &builder,
>    return vl;
>  }
>  
> +/* For predicate PRED, return the number of active lanes.  */

That's not really what the function does.  How about:

/* Return:

   * -1 if all bits of PRED are set
   * N if PRED has N leading set bits followed by all clear bits
   * 0 if PRED does not have any of these forms.  */

> +int
> +aarch64_partial_ptrue_length (rtx pred)
> +{
> +  rtx_vector_builder builder;
> +  if (!aarch64_get_sve_pred_bits (builder, pred))
> +    return 0;
> +
> +  auto elt_size = vector_element_size (GET_MODE_BITSIZE (GET_MODE (pred)),
> +                                    GET_MODE_NUNITS (GET_MODE (pred)));
> +  return aarch64_partial_ptrue_length (builder, elt_size);
> +}
> +
>  /* See if there is an svpattern that encodes an SVE predicate of mode
>     PRED_MODE in which the first VL bits are set and the rest are clear.
>     Return the pattern if so, otherwise return AARCH64_NUM_SVPATTERNS.
> @@ -6410,8 +6431,32 @@ aarch64_stack_protect_canary_mem (machine_mode mode, 
> rtx decl_rtl,
>    return gen_rtx_MEM (mode, force_reg (Pmode, addr));
>  }
>  
> -/* Emit an SVE predicated move from SRC to DEST.  PRED is a predicate
> -   that is known to contain PTRUE.  */
> +/* Emit a load/store from a subreg of SRC to a subreg of DEST.
> +   The subregs have mode NEW_MODE. Use only for reg<->mem moves.  */
> +void
> +aarch64_emit_load_store_through_mode (rtx dest, rtx src, machine_mode 
> new_mode)
> +{
> +  gcc_assert ((REG_P (src) && MEM_P (dest))
> +           || (REG_P (dest) && MEM_P (src)));

We should allow subregs too, so how about:

  gcc_assert ((MEM_P (dest) && register_operand (src, VOIDmode))
              || (MEM_P (src) && register_operand (dest, VOIDmode)));

> +  auto mode = GET_MODE (dest);
> +  auto int_mode = aarch64_sve_int_mode (mode);
> +  if (MEM_P (src))
> +    {
> +      rtx tmp = force_reg (new_mode, adjust_address (src, new_mode, 0));
> +      tmp = force_lowpart_subreg (int_mode, tmp, new_mode);
> +      emit_move_insn (dest, force_lowpart_subreg (mode, tmp, int_mode));
> +    }
> +  else
> +    {
> +      src = force_lowpart_subreg (int_mode, src, mode);
> +      emit_move_insn (adjust_address (dest, new_mode, 0),
> +                   force_lowpart_subreg (new_mode, src, int_mode));
> +    }
> +}
> +
> +/* PRED is a predicate that is known to contain PTRUE.
> +   For 128-bit VLS loads/stores, emit LDR/STR.
> +   Else, emit an SVE predicated move from SRC to DEST.  */
>  
>  void
>  aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx src)
> @@ -6421,16 +6466,7 @@ aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx 
> src)
>        && known_eq (GET_MODE_SIZE (mode), 16)
>        && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA
>        && !BYTES_BIG_ENDIAN)
> -    {
> -      if (MEM_P (src))
> -     {
> -       rtx tmp = force_reg (V16QImode, adjust_address (src, V16QImode, 0));
> -       emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode));
> -     }
> -      else
> -     emit_move_insn (adjust_address (dest, V16QImode, 0),
> -                     force_lowpart_subreg (V16QImode, src, mode));
> -    }
> +    aarch64_emit_load_store_through_mode (dest, src, V16QImode);
>    else
>      {
>        expand_operand ops[3];
> @@ -23526,6 +23562,31 @@ aarch64_simd_valid_imm (rtx op, simd_immediate_info 
> *info,
>    return false;
>  }
>  
> +/* If the predicate in operands[2] is a patterned SVE PTRUE predicate
> +   with patterns VL1, VL2, VL4, VL8, or VL16 and at most the bottom
> +   128 bits are loaded/stored, emit an ASIMD load/store and return true.
> +   Else, return false.  */
> +bool
> +aarch64_sve_maskloadstore (rtx *operands, machine_mode mode)

How about s/sve_maskloadstore/expand_maskloadstore/?  Since the name
of the function is general either way, how about a function comment:

/* Try to optimize the expansion of a maskload or maskstore with
   the operands in OPERANDS, given that the vector being loaded or
   stored has mode MODE.  Return true on success or false if the normal
   expansion should be used.  */

Then we can describe the individual optimisations in the function body
(admittedly just one optimisation for now), using your comment above
with "and return true..." removed.

> +{
> +  int vl = aarch64_partial_ptrue_length (operands[2]);
> +  int width = vl * GET_MODE_UNIT_BITSIZE (mode);
> +  if (width <= 128 && pow2p_hwi (vl) && (vl == 1 || !BYTES_BIG_ENDIAN))

It looks like the check for fully-packed modes got dropped.  How about:

  if (width <= 128
      && pow2p_hwi (vl)
      && (vl == 1
          || (!BYTES_BIG_ENDIAN
              && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA)))

Otherwise it looks good, thanks.

Richard

Re: [PATCH] [PR117978] AArch64: Fold SVE load/store with certain ptrue patterns to LDR/STR.

Reply via email to