> On 2 May 2025, at 18:01, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Jennifer Schmitz <jschm...@nvidia.com> writes:
>> SVE loads/stores using predicates that select the bottom 8, 16, 32, 64,
>> or 128 bits of a register can be folded to ASIMD LDR/STR, thus avoiding the
>> predicate.
>> For example,
>> svuint8_t foo (uint8_t *x) {
>>  return svld1 (svwhilelt_b8 (0, 16), x);
>> }
>> was previously compiled to:
>> foo:
>>      ptrue   p3.b, vl16
>>      ld1b    z0.b, p3/z, [x0]
>>      ret
>> 
>> and is now compiled to:
>> foo:
>>      ldr     q0, [x0]
>>      ret
>> 
>> The optimization is applied during the expand pass and was implemented
>> by making the following changes to maskload<mode><vpred> and
>> maskstore<mode><vpred>:
>> - the existing define_insns were renamed and new define_expands for maskloads
>>  and maskstores were added with predicates for the SVE predicate that match
>>  both register operands and constant-vector operands.
>> - if the SVE predicate is a constant vector and contains a pattern as
>>  described above, an ASIMD load/store is emitted instead of the SVE 
>> load/store.
>> 
>> The patch implements the optimization for LD1 and ST1, for 8-bit, 16-bit,
>> 32-bit, 64-bit, and 128-bit moves, for all full SVE data vector modes.
>> Note that VNx8HFmode and VNx2BFmode with a VL2 pattern were excluded, because
>> there are no move patterns for V2HFmode and V2BFmode (yet).
>> 
>> Follow-up patches for LD2/3/4 and ST2/3/4 and potentially partial SVE vector
>> modes are planned.
>> 
>> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
>> OK for mainline?
>> 
>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>> 
>> gcc/
>>      PR target/117978
>>      * config/aarch64/aarch64-protos.h: Declare
>>      aarch64_simd_container_mode, aarch64_sve_full_data_mode_p,
>>      aarch64_count_pred_pat_128, aarch64_emit_load_store_through_mode.
>>      * config/aarch64/aarch64-sve.md
>>      (maskload<mode><vpred>): New define_expand folding maskloads with
>>      certain predicate patterns to ASIMD loads.
>>      (*aarch64_maskload<mode><vpred>): Renamed from maskload<mode><vpred>.
>>      (maskstore<mode><vpred>): New define_expand folding maskstores with
>>      certain predicate patterns to ASIMD stores.
>>      (*aarch64_maskstore<mode><vpred>): Renamed from maskstore<mode><vpred>.
>>      * config/aarch64/aarch64.cc
>>      (aarch64_sve_full_data_mode_p): New function returning true if a given
>>      mode is a full SVE data vector mode.
>>      (aarch64_emit_load_store_through_mode): New function emitting a
>>      load/store through subregs of a given mode.
>>      (aarch64_emit_sve_pred_move): Refactor to use
>>      aarch64_emit_load_store_through_mode.
>>      (aarch64_v8_mode): New function returning an 8-bit mode.
>>      (aarch64_v16_mode): New function returning a 16-bit mode.
>>      (aarch64_v32_mode): New function returning a 32-bit mode.
>>      (aarch64_simd_container_mode): Make public and extend to find
>>      8-bit, 16-bit, and 32-bit container modes.
>>      (aarch64_count_pred_pat_128): New function to find SVE predicates
>>      with VL1, VL2, VL4, VL8, or VL16 patterns.
>>      * config/aarch64/iterators.md (elem_bits): Extend to cover partial
>>      SVE vector modes.
>>      * config/aarch64/predicates.md (aarch64_sve_reg_or_const_pred): New
>>      predicate matching register operands or constant-vector operands.
>> 
>> gcc/testsuite/
>>      PR target/117978
>>      * gcc.target/aarch64/sve/acle/general/whilelt_5.c: Adjust expected
>>      outcome.
>>      * gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c: New test.
>>      * gcc.target/aarch64/sve/while_7.c: Adjust expected outcome.
>>      * gcc.target/aarch64/sve/while_9.c: Adjust expected outcome.
>> ---
>> gcc/config/aarch64/aarch64-protos.h           |   4 +
>> gcc/config/aarch64/aarch64-sve.md             |  62 ++++++++-
>> gcc/config/aarch64/aarch64.cc                 | 128 +++++++++++++++---
>> gcc/config/aarch64/iterators.md               |  19 ++-
>> gcc/config/aarch64/predicates.md              |   4 +
>> .../aarch64/sve/acle/general/whilelt_5.c      |  24 +++-
>> .../aarch64/sve/ldst_ptrue_pat_128_to_neon.c  |  81 +++++++++++
>> .../gcc.target/aarch64/sve/while_7.c          |   4 +-
>> .../gcc.target/aarch64/sve/while_9.c          |   2 +-
>> 9 files changed, 296 insertions(+), 32 deletions(-)
>> create mode 100644 
>> gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c
>> 
>> diff --git a/gcc/config/aarch64/aarch64-protos.h 
>> b/gcc/config/aarch64/aarch64-protos.h
>> index 1ca86c9d175..a03f091fe3a 100644
>> --- a/gcc/config/aarch64/aarch64-protos.h
>> +++ b/gcc/config/aarch64/aarch64-protos.h
>> @@ -857,6 +857,7 @@ enum aarch64_symbol_type 
>> aarch64_classify_symbolic_expression (rtx);
>> bool aarch64_advsimd_struct_mode_p (machine_mode mode);
>> opt_machine_mode aarch64_v64_mode (scalar_mode);
>> opt_machine_mode aarch64_v128_mode (scalar_mode);
>> +machine_mode aarch64_simd_container_mode (scalar_mode, poly_int64);
>> opt_machine_mode aarch64_full_sve_mode (scalar_mode);
>> bool aarch64_can_const_movi_rtx_p (rtx x, machine_mode mode);
>> bool aarch64_valid_fp_move (rtx, rtx, machine_mode);
>> @@ -903,8 +904,10 @@ opt_machine_mode aarch64_advsimd_vector_array_mode 
>> (machine_mode,
>>                                                  unsigned HOST_WIDE_INT);
>> opt_machine_mode aarch64_sve_data_mode (scalar_mode, poly_uint64);
>> bool aarch64_sve_mode_p (machine_mode);
>> +bool aarch64_sve_full_data_mode_p (machine_mode);
>> HOST_WIDE_INT aarch64_fold_sve_cnt_pat (aarch64_svpattern, unsigned int);
>> bool aarch64_sve_cnt_immediate_p (rtx);
>> +int aarch64_count_pred_pat_128 (rtx, machine_mode);
>> bool aarch64_sve_scalar_inc_dec_immediate_p (rtx);
>> bool aarch64_sve_rdvl_immediate_p (rtx);
>> bool aarch64_sve_addvl_addpl_immediate_p (rtx);
>> @@ -1026,6 +1029,7 @@ rtx aarch64_ptrue_reg (machine_mode, unsigned int);
>> rtx aarch64_ptrue_reg (machine_mode, machine_mode);
>> rtx aarch64_pfalse_reg (machine_mode);
>> bool aarch64_sve_same_pred_for_ptest_p (rtx *, rtx *);
>> +void aarch64_emit_load_store_through_mode (rtx, rtx, machine_mode);
>> void aarch64_emit_sve_pred_move (rtx, rtx, rtx);
>> void aarch64_expand_sve_mem_move (rtx, rtx, machine_mode);
>> bool aarch64_maybe_expand_sve_subreg_move (rtx, rtx);
>> diff --git a/gcc/config/aarch64/aarch64-sve.md 
>> b/gcc/config/aarch64/aarch64-sve.md
>> index d4af3706294..d9392e3611a 100644
>> --- a/gcc/config/aarch64/aarch64-sve.md
>> +++ b/gcc/config/aarch64/aarch64-sve.md
>> @@ -1286,7 +1286,36 @@
>> ;; -------------------------------------------------------------------------
>> 
>> ;; Predicated LD1 (single).
>> -(define_insn "maskload<mode><vpred>"
>> +(define_expand "maskload<mode><vpred>"
>> +  [(set (match_operand:SVE_ALL 0 "register_operand")
>> +     (unspec:SVE_ALL
>> +       [(match_operand:<VPRED> 2 "aarch64_sve_reg_or_const_pred")
> 
> I don't think we need a new predicate for this.  nonmemory_operand
> should be good enough.
> 
>> +        (match_operand:SVE_ALL 1 "memory_operand")
>> +        (match_operand:SVE_ALL 3 "aarch64_maskload_else_operand")]
>> +       UNSPEC_LD1_SVE))]
>> +  "TARGET_SVE"
>> +  {
>> +    int pat_cnt = aarch64_count_pred_pat_128 (operands[2], <MODE>mode);
>> +    int width = <elem_bits> * pat_cnt;
>> +    if (aarch64_sve_full_data_mode_p (<MODE>mode)
>> +     && pat_cnt && (pat_cnt == 1 || !BYTES_BIG_ENDIAN)
>> +     && known_le (width, 128))
>> +      {
>> +     machine_mode mode = aarch64_simd_container_mode (<VEL>mode, width);
> 
> I'm not sure this needs to be so complex.  It should be enough to pick
> an arbitrary 128-bit vector mode (say V16QI) for width==128,
> an arbitrary 64-bit vector mode (say V8QI) for width==64,
> and an integer mode for smaller widths
> (int_mode_for_size (...).require ()).
> 
> Experimenting locally, I assume you did the above to avoid ICEs while
> forming the subregs.  But that's something we should fix in
> aarch64_emit_load_store_through_mode.  Writing it as:
> 
> /* Emit a load/store from a subreg of SRC to a subreg of DEST.
>   The subregs have mode NEW_MODE. Use only for reg<->mem moves.  */
> void
> aarch64_emit_load_store_through_mode (rtx dest, rtx src, machine_mode 
> new_mode)
> {
>  gcc_assert ((REG_P (src) && MEM_P (dest))
>              || (REG_P (dest) && MEM_P (src)));
>  auto mode = GET_MODE (dest);
>  auto int_mode = aarch64_sve_int_mode (mode);
>  if (MEM_P (src))
>    {
>      rtx tmp = force_reg (new_mode, adjust_address (src, new_mode, 0));
>      tmp = force_lowpart_subreg (int_mode, tmp, new_mode);
>      emit_move_insn (dest, force_lowpart_subreg (mode, tmp, int_mode));
>    }
>  else
>    {
>      src = force_lowpart_subreg (int_mode, src, mode);
>      emit_move_insn (adjust_address (dest, new_mode, 0),
>                      force_lowpart_subreg (new_mode, src, int_mode));
>    }
> }
> 
> should cope with the general case, and means that the new test has
> no SVE loads and stores.
> 
>> +     if (mode != VOIDmode)
>> +     {
>> +       aarch64_emit_load_store_through_mode (operands[0],
>> +                                             operands[1], mode);
>> +       DONE;
>> +     }
>> +      }
>> +    if (!REG_P (operands[2]))
> 
> Probably better to use if (CONSTANT_P (...)) or if (!register_operand (...)),
> so that we don't create a new register if operands[2] is a subreg.
> 
>> +      operands[2] = force_reg (<VPRED>mode, operands[2]);
>> +  }
>> +)
>> +
>> +;; Predicated LD1 (single).
>> +(define_insn "*aarch64_maskload<mode><vpred>"
>>   [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
>>      (unspec:SVE_ALL
>>        [(match_operand:<VPRED> 2 "register_operand" "Upl")
>> @@ -2287,7 +2316,36 @@
>> ;; -------------------------------------------------------------------------
>> 
>> ;; Predicated ST1 (single).
>> -(define_insn "maskstore<mode><vpred>"
>> +(define_expand "maskstore<mode><vpred>"
>> +  [(set (match_operand:SVE_ALL 0 "memory_operand")
>> +     (unspec:SVE_ALL
>> +       [(match_operand:<VPRED> 2 "aarch64_sve_reg_or_const_pred")
>> +        (match_operand:SVE_ALL 1 "register_operand")
>> +        (match_dup 0)]
>> +       UNSPEC_ST1_SVE))]
>> +  "TARGET_SVE"
>> +  {
>> +    int pat_cnt = aarch64_count_pred_pat_128 (operands[2], <MODE>mode);
>> +    int width = <elem_bits> * pat_cnt;
>> +    if (aarch64_sve_full_data_mode_p (<MODE>mode)
>> +     && pat_cnt && (pat_cnt == 1 || !BYTES_BIG_ENDIAN)
>> +     && known_le (width, 128))
>> +      {
>> +     machine_mode mode = aarch64_simd_container_mode (<VEL>mode, width);
>> +     if (mode != VOIDmode)
>> +     {
>> +       aarch64_emit_load_store_through_mode (operands[0],
>> +                                             operands[1], mode);
>> +       DONE;
>> +     }
> 
> This seems a bit too complex for cut-&-paste.  Could you put it in a common
> aarch64.cc helper?
> 
>> +      }
>> +    if (!REG_P (operands[2]))
>> +      operands[2] = force_reg (<VPRED>mode, operands[2]);
>> +  }
>> +)
>> +
>> +;; Predicated ST1 (single).
>> +(define_insn "*aarch64_maskstore<mode><vpred>"
>>   [(set (match_operand:SVE_ALL 0 "memory_operand" "+m")
>>      (unspec:SVE_ALL
>>        [(match_operand:<VPRED> 2 "register_operand" "Upl")
>> [...]
>> @@ -23526,6 +23602,26 @@ aarch64_simd_valid_imm (rtx op, simd_immediate_info 
>> *info,
>>   return false;
>> }
>> 
>> +/* If PRED is a patterned SVE PTRUE predicate with patterns
>> +   VL1, VL2, VL4, VL8, or VL16, return the number of active lanes
>> +   for the mode MODE. Else return 0.  */
>> +int
>> +aarch64_count_pred_pat_128 (rtx pred, machine_mode mode)
>> +{
>> +  struct simd_immediate_info info;
>> +  bool is_valid;
>> +  is_valid = aarch64_simd_valid_imm (pred, &info, AARCH64_CHECK_MOV);
>> +  if (!is_valid || info.insn != simd_immediate_info::PTRUE)
>> +    return 0;
>> +  aarch64_svpattern pattern = info.u.pattern;
>> +  unsigned int cnt
>> +    = aarch64_fold_sve_cnt_pat (pattern, 128 / GET_MODE_UNIT_BITSIZE 
>> (mode));
>> +  if (pattern <= AARCH64_SV_VL16 && pow2p_hwi (cnt))
>> +    return cnt;
>> +  else
>> +    return 0;
>> +}
>> +
> 
> It would be better to avoid the round-trip through aarch64_simd_valid_imm,
> since the caller doesn't really care whether the length is a valid ptrue
> operand.  How about instead:
> 
> int
> aarch64_partial_ptrue_length (rtx pred)
> {
>  rtx_vector_builder builder;
>  if (!aarch64_get_sve_pred_bits (builder, pred))
>    return 0;
> 
>  auto elt_size = vector_element_size (GET_MODE_BITSIZE (GET_MODE (pred)),
>                                       GET_MODE_NUNITS (GET_MODE (pred)));
>  return aarch64_partial_ptrue_length (builder, elt_size);
> }
> 
> This needs an extra change to aarch64_partial_ptrue_length to handle
> the VL1 case:  (I remember the VL1 case being discussed recently in
> a different context.)
> 
> diff --git a/gcc/config/aarch64/aarch64.ccb/gcc/config/aarch64/aarch64.cc
> index fff8d9da49d..1c4194ce883 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -3667,6 +3674,14 @@ aarch64_partial_ptrue_length (rtx_vector_builder 
> &builder,
>   if (builder.nelts_per_pattern () == 3)
>     return 0;
> 
> +  /* It is conservatively correct to drop the element size to a lower value,
> +     and we must do so if the predicate consists of a leading "foreground"
> +     sequence that is smaller than the element size.  Without this,
> +     we would test only one bit and so treat everything as either an
> +     all-true or an all-false predicate.  */
> +  if (builder.nelts_per_pattern () == 2)
> +    elt_size = MIN (elt_size, builder.npatterns ());
> +
>   /* Skip over leading set bits.  */
>   unsigned int nelts = builder.encoded_nelts ();
>   unsigned int i = 0;
> 
> I haven't bootstrapped & regression-tested that yet, but I'll give it a go.
Dear Richard,
Thanks for the detailed and helpful feedback. I made the changes as suggested.
About the tests: Non-power-of-2 patterns are already being tested in
gcc.target/aarch64/sve/acle/general/whilelt_5.c. For the case of svptrue_b16 ()
with 8-bit load, I added a test case for it. Currently, it has a single test 
case,
but if necessary I can add more tests for other data types and for stores as 
well.

I bootstrapped and tested and the check-function-bodies test in 
gcc.target/aarch64/pr117557.c
is currently failing.
The current GCC trunk produces:
f:
        add x3, x1, 8
        cmp x2, x3
        add x3, x2, 60
        ccmp x1, x3, 2, cc
        bcc .L2
        ptrue p7.h, vl8
        index z31.s, #0, #8
        ld1b z0.h, p7/z, [x1]
        punpklo p6.h, p7.b
        lsl z0.h, z0.h, #2
        punpkhi p7.h, p7.b
        uunpklo z1.s, z0.h
        uunpkhi z0.s, z0.h
        ld1w z29.s, p6/z, [x0, z1.s, sxtw]
        ld1w z30.s, p7/z, [x0, z0.s, sxtw]
        st1w z29.s, p6, [x2, z31.s, sxtw]
        incb x2, all, mul #2
        st1w z30.s, p7, [x2, z31.s, sxtw]
        ret
        …

WITH my patch it produces:
f:
        add     x3, x1, 8
        cmp     x2, x3
        add     x3, x2, 60
        ccmp    x1, x3, 2, cc
        bcc     .L2
        ldr     d31, [x1]
        addvl   sp, sp, #-1
        ptrue   p5.b, all
        addpl   x1, sp, #4
        ptrue   p7.h, vl8
        punpklo p6.h, p7.b
        punpkhi p7.h, p7.b
        str     d31, [x1]
        ld1b    z31.h, p5/z, [sp, #1, mul vl]
        lsl     z31.h, z31.h, #2
        uunpklo z30.s, z31.h
        uunpkhi z31.s, z31.h
        ld1w    z29.s, p6/z, [x0, z30.s, sxtw]
        ld1w    z30.s, p7/z, [x0, z31.s, sxtw]
        index   z31.s, #0, #8
        st1w    z29.s, p6, [x2, z31.s, sxtw]
        incb    x2, all, mul #2
        st1w    z30.s, p7, [x2, z31.s, sxtw]
        addvl   sp, sp, #1
        ret
        …

The patch seems to fold the ptrue p7.h, vl8 and ld1b z0.h, p7/z, [x1] to an LDR,
but because the ptrue predicate in p7 is needed again later, this is not really 
an optimization. Could we prevent the fold somehow in case the predicate is used
multiple times?

Thanks,
Jennifer


SVE loads/stores using predicates that select the bottom 8, 16, 32, 64,
or 128 bits of a register can be folded to ASIMD LDR/STR, thus avoiding the
predicate.
For example,
svuint8_t foo (uint8_t *x) {
  return svld1 (svwhilelt_b8 (0, 16), x);
}
was previously compiled to:
foo:
        ptrue   p3.b, vl16
        ld1b    z0.b, p3/z, [x0]
        ret

and is now compiled to:
foo:
        ldr     q0, [x0]
        ret

The optimization is applied during the expand pass and was implemented
by making the following changes to maskload<mode><vpred> and
maskstore<mode><vpred>:
- the existing define_insns were renamed and new define_expands for maskloads
  and maskstores were added with nonmemory_operand as predicate such that the
  SVE predicate matches both register operands and constant-vector operands.
- if the SVE predicate is a constant vector and contains a pattern as
  described above, an ASIMD load/store is emitted instead of the SVE load/store.

The patch implements the optimization for LD1 and ST1, for 8-bit, 16-bit,
32-bit, 64-bit, and 128-bit moves, for all full SVE data vector modes.

Follow-up patches for LD2/3/4 and ST2/3/4 and potentially partial SVE vector
modes are planned.

The patch was bootstrapped and tested on aarch64-linux-gnu:
Currently gcc.target/aarch64/pr117557.c fails, needs to be fixed.

Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>

gcc/
        PR target/117978
        * config/aarch64/aarch64-protos.h: Declare
        aarch64_emit_load_store_through_mode and aarch64_sve_maskloadstore.
        * config/aarch64/aarch64-sve.md
        (maskload<mode><vpred>): New define_expand folding maskloads with
        certain predicate patterns to ASIMD loads.
        (*aarch64_maskload<mode><vpred>): Renamed from maskload<mode><vpred>.
        (maskstore<mode><vpred>): New define_expand folding maskstores with
        certain predicate patterns to ASIMD stores.
        (*aarch64_maskstore<mode><vpred>): Renamed from maskstore<mode><vpred>.
        * config/aarch64/aarch64.cc
        (aarch64_emit_load_store_through_mode): New function emitting a
        load/store through subregs of a given mode.
        (aarch64_emit_sve_pred_move): Refactor to use
        aarch64_emit_load_store_through_mode.
        (aarch64_sve_maskloadstore): New function to emit ASIMD loads/stores
        for maskloads/stores with SVE predicates with VL1, VL2, VL4, VL8, or
        VL16 patterns.
        (aarch64_partial_ptrue_length): New function returning the number of
        active lanes for a given SVE predicate.

gcc/testsuite/
        PR target/117978
        * gcc.target/aarch64/sve/acle/general/whilelt_5.c: Adjust expected
        outcome.
        * gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c: New test.
        * gcc.target/aarch64/sve/while_7.c: Adjust expected outcome.
        * gcc.target/aarch64/sve/while_9.c: Adjust expected outcome.
---
 gcc/config/aarch64/aarch64-protos.h           |  2 +
 gcc/config/aarch64/aarch64-sve.md             | 38 ++++++++-
 gcc/config/aarch64/aarch64.cc                 | 85 ++++++++++++++++---
 .../aarch64/sve/acle/general/whilelt_5.c      | 24 ++++--
 .../aarch64/sve/ldst_ptrue_pat_128_to_neon.c  | 81 ++++++++++++++++++
 .../gcc.target/aarch64/sve/while_7.c          |  4 +-
 .../gcc.target/aarch64/sve/while_9.c          |  2 +-
 7 files changed, 214 insertions(+), 22 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 1ca86c9d175..b2003f77af8 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1026,6 +1026,8 @@ rtx aarch64_ptrue_reg (machine_mode, unsigned int);
 rtx aarch64_ptrue_reg (machine_mode, machine_mode);
 rtx aarch64_pfalse_reg (machine_mode);
 bool aarch64_sve_same_pred_for_ptest_p (rtx *, rtx *);
+void aarch64_emit_load_store_through_mode (rtx, rtx, machine_mode);
+bool aarch64_sve_maskloadstore (rtx *, machine_mode);
 void aarch64_emit_sve_pred_move (rtx, rtx, rtx);
 void aarch64_expand_sve_mem_move (rtx, rtx, machine_mode);
 bool aarch64_maybe_expand_sve_subreg_move (rtx, rtx);
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 7bf12ff25cc..a34b0bfeaa2 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1286,7 +1286,24 @@
 ;; -------------------------------------------------------------------------
 
 ;; Predicated LD1 (single).
-(define_insn "maskload<mode><vpred>"
+(define_expand "maskload<mode><vpred>"
+  [(set (match_operand:SVE_ALL 0 "register_operand")
+       (unspec:SVE_ALL
+         [(match_operand:<VPRED> 2 "nonmemory_operand")
+          (match_operand:SVE_ALL 1 "memory_operand")
+          (match_operand:SVE_ALL 3 "aarch64_maskload_else_operand")]
+         UNSPEC_LD1_SVE))]
+  "TARGET_SVE"
+  {
+    if (aarch64_sve_maskloadstore (operands, <MODE>mode))
+      DONE;
+    if (CONSTANT_P (operands[2]))
+      operands[2] = force_reg (<VPRED>mode, operands[2]);
+  }
+)
+
+;; Predicated LD1 (single).
+(define_insn "*aarch64_maskload<mode><vpred>"
   [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
        (unspec:SVE_ALL
          [(match_operand:<VPRED> 2 "register_operand" "Upl")
@@ -2287,7 +2304,24 @@
 ;; -------------------------------------------------------------------------
 
 ;; Predicated ST1 (single).
-(define_insn "maskstore<mode><vpred>"
+(define_expand "maskstore<mode><vpred>"
+  [(set (match_operand:SVE_ALL 0 "memory_operand")
+       (unspec:SVE_ALL
+         [(match_operand:<VPRED> 2 "nonmemory_operand")
+          (match_operand:SVE_ALL 1 "register_operand")
+          (match_dup 0)]
+         UNSPEC_ST1_SVE))]
+  "TARGET_SVE"
+  {
+    if (aarch64_sve_maskloadstore (operands, <MODE>mode))
+      DONE;
+    if (CONSTANT_P (operands[2]))
+      operands[2] = force_reg (<VPRED>mode, operands[2]);
+  }
+)
+
+;; Predicated ST1 (single).
+(define_insn "*aarch64_maskstore<mode><vpred>"
   [(set (match_operand:SVE_ALL 0 "memory_operand" "+m")
        (unspec:SVE_ALL
          [(match_operand:<VPRED> 2 "register_operand" "Upl")
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index fff8d9da49d..2287e28f6d2 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3667,6 +3667,14 @@ aarch64_partial_ptrue_length (rtx_vector_builder 
&builder,
   if (builder.nelts_per_pattern () == 3)
     return 0;
 
+  /* It is conservatively correct to drop the element size to a lower value,
+     and we must do so if the predicate consists of a leading "foreground"
+     sequence that is smaller than the element size.  Without this,
+     we would test only one bit and so treat everything as either an
+     all-true or an all-false predicate.  */
+  if (builder.nelts_per_pattern () == 2)
+    elt_size = MIN (elt_size, builder.npatterns ());
+
   /* Skip over leading set bits.  */
   unsigned int nelts = builder.encoded_nelts ();
   unsigned int i = 0;
@@ -3698,6 +3706,19 @@ aarch64_partial_ptrue_length (rtx_vector_builder 
&builder,
   return vl;
 }
 
+/* For predicate PRED, return the number of active lanes.  */
+int
+aarch64_partial_ptrue_length (rtx pred)
+{
+  rtx_vector_builder builder;
+  if (!aarch64_get_sve_pred_bits (builder, pred))
+    return 0;
+
+  auto elt_size = vector_element_size (GET_MODE_BITSIZE (GET_MODE (pred)),
+                                      GET_MODE_NUNITS (GET_MODE (pred)));
+  return aarch64_partial_ptrue_length (builder, elt_size);
+}
+
 /* See if there is an svpattern that encodes an SVE predicate of mode
    PRED_MODE in which the first VL bits are set and the rest are clear.
    Return the pattern if so, otherwise return AARCH64_NUM_SVPATTERNS.
@@ -6410,8 +6431,32 @@ aarch64_stack_protect_canary_mem (machine_mode mode, rtx 
decl_rtl,
   return gen_rtx_MEM (mode, force_reg (Pmode, addr));
 }
 
-/* Emit an SVE predicated move from SRC to DEST.  PRED is a predicate
-   that is known to contain PTRUE.  */
+/* Emit a load/store from a subreg of SRC to a subreg of DEST.
+   The subregs have mode NEW_MODE. Use only for reg<->mem moves.  */
+void
+aarch64_emit_load_store_through_mode (rtx dest, rtx src, machine_mode new_mode)
+{
+  gcc_assert ((REG_P (src) && MEM_P (dest))
+             || (REG_P (dest) && MEM_P (src)));
+  auto mode = GET_MODE (dest);
+  auto int_mode = aarch64_sve_int_mode (mode);
+  if (MEM_P (src))
+    {
+      rtx tmp = force_reg (new_mode, adjust_address (src, new_mode, 0));
+      tmp = force_lowpart_subreg (int_mode, tmp, new_mode);
+      emit_move_insn (dest, force_lowpart_subreg (mode, tmp, int_mode));
+    }
+  else
+    {
+      src = force_lowpart_subreg (int_mode, src, mode);
+      emit_move_insn (adjust_address (dest, new_mode, 0),
+                     force_lowpart_subreg (new_mode, src, int_mode));
+    }
+}
+
+/* PRED is a predicate that is known to contain PTRUE.
+   For 128-bit VLS loads/stores, emit LDR/STR.
+   Else, emit an SVE predicated move from SRC to DEST.  */
 
 void
 aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx src)
@@ -6421,16 +6466,7 @@ aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx src)
       && known_eq (GET_MODE_SIZE (mode), 16)
       && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA
       && !BYTES_BIG_ENDIAN)
-    {
-      if (MEM_P (src))
-       {
-         rtx tmp = force_reg (V16QImode, adjust_address (src, V16QImode, 0));
-         emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode));
-       }
-      else
-       emit_move_insn (adjust_address (dest, V16QImode, 0),
-                       force_lowpart_subreg (V16QImode, src, mode));
-    }
+    aarch64_emit_load_store_through_mode (dest, src, V16QImode);
   else
     {
       expand_operand ops[3];
@@ -23526,6 +23562,31 @@ aarch64_simd_valid_imm (rtx op, simd_immediate_info 
*info,
   return false;
 }
 
+/* If the predicate in operands[2] is a patterned SVE PTRUE predicate
+   with patterns VL1, VL2, VL4, VL8, or VL16 and at most the bottom
+   128 bits are loaded/stored, emit an ASIMD load/store and return true.
+   Else, return false.  */
+bool
+aarch64_sve_maskloadstore (rtx *operands, machine_mode mode)
+{
+  int vl = aarch64_partial_ptrue_length (operands[2]);
+  int width = vl * GET_MODE_UNIT_BITSIZE (mode);
+  if (width <= 128 && pow2p_hwi (vl) && (vl == 1 || !BYTES_BIG_ENDIAN))
+    {
+      machine_mode new_mode;
+      if (known_eq (width, 128))
+       new_mode = V16QImode;
+      else if (known_eq (width, 64))
+       new_mode = V8QImode;
+      else
+       new_mode = int_mode_for_size (width, 0).require ();
+      aarch64_emit_load_store_through_mode (operands[0], operands[1],
+                                           new_mode);
+      return true;
+    }
+  return false;
+}
+
 /* Return true if OP is a valid SIMD move immediate for SVE or AdvSIMD.  */
 bool
 aarch64_simd_valid_mov_imm (rtx op)
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/whilelt_5.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/whilelt_5.c
index f06a74aa2da..05e266aad7d 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/whilelt_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/whilelt_5.c
@@ -11,8 +11,7 @@ extern "C" {
 
 /*
 ** load_vl1:
-**     ptrue   (p[0-7])\.[bhsd], vl1
-**     ld1h    z0\.h, \1/z, \[x0\]
+**     ldr     h0, \[x0\]
 **     ret
 */
 svint16_t
@@ -22,7 +21,12 @@ load_vl1 (int16_t *ptr)
 }
 
 /*
-** load_vl2:
+** load_vl2: { target aarch64_little_endian }
+**     ldr     s0, \[x0\]
+**     ret
+*/
+/*
+** load_vl2: { target aarch64_big_endian }
 **     ptrue   (p[0-7])\.h, vl2
 **     ld1h    z0\.h, \1/z, \[x0\]
 **     ret
@@ -46,7 +50,12 @@ load_vl3 (int16_t *ptr)
 }
 
 /*
-** load_vl4:
+** load_vl4: { target aarch64_little_endian }
+**     ldr     d0, \[x0\]
+**     ret
+*/
+/*
+** load_vl4: { target aarch64_big_endian }
 **     ptrue   (p[0-7])\.h, vl4
 **     ld1h    z0\.h, \1/z, \[x0\]
 **     ret
@@ -94,7 +103,12 @@ load_vl7 (int16_t *ptr)
 }
 
 /*
-** load_vl8:
+** load_vl8: { target aarch64_little_endian }
+**     ldr     q0, \[x0\]
+**     ret
+*/
+/*
+** load_vl8: { target aarch64_big_endian }
 **     ptrue   (p[0-7])\.h, vl8
 **     ld1h    z0\.h, \1/z, \[x0\]
 **     ret
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c 
b/gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c
new file mode 100644
index 00000000000..2d47c1f1a3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c
@@ -0,0 +1,81 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-require-effective-target aarch64_little_endian } */
+
+#include <arm_sve.h>
+
+#define TEST(TYPE, TY, W, B)                                           \
+  sv##TYPE                                                             \
+  ld1_##TY##W##B##_1 (TYPE *x)                                         \
+  {                                                                    \
+    svbool_t pg = svwhilelt_b##B (0, W);                               \
+    return svld1_##TY##B (pg, x);                                      \
+  }                                                                    \
+  sv##TYPE                                                             \
+  ld1_##TY##W##B##_2 (TYPE *x)                                         \
+  {                                                                    \
+    svbool_t pg = svptrue_pat_b##B ((enum svpattern) (W > 8 ? 9 : W)); \
+    return svld1_##TY##B (pg, x);                                      \
+  }                                                                    \
+  void                                                                 \
+  st1_##TY##W##B##_1 (TYPE *x, sv##TYPE data)                                  
        \
+  {                                                                    \
+    svbool_t pg = svwhilelt_b##B (0, W);                               \
+    return svst1_##TY##B (pg, x, data);                                        
\
+  }                                                                    \
+  void                                                                 \
+  st1_##TY##W##B##_2 (TYPE *x, sv##TYPE data)                          \
+  {                                                                    \
+    svbool_t pg = svptrue_pat_b##B ((enum svpattern) (W > 8 ? 9 : W)); \
+    return svst1_##TY##B (pg, x, data);                                        
\
+  }                                                                    \
+
+#define TEST64(TYPE, TY, B)                            \
+  TEST (TYPE, TY, 1, B)                                        \
+  TEST (TYPE, TY, 2, B)                                        \
+
+#define TEST32(TYPE, TY, B)                            \
+  TEST64 (TYPE, TY, B)                                 \
+  TEST (TYPE, TY, 4, B)                                        \
+
+#define TEST16(TYPE, TY, B)                            \
+  TEST32 (TYPE, TY, B)                                 \
+  TEST (TYPE, TY, 8, B)                                        \
+
+#define TEST8(TYPE, TY, B)                             \
+  TEST16 (TYPE, TY, B)                                 \
+  TEST (TYPE, TY, 16, B)
+
+#define T(TYPE, TY, B)                 \
+  TEST##B (TYPE, TY, B)
+
+T (bfloat16_t, bf, 16)
+T (float16_t, f, 16)
+T (float32_t, f, 32)
+T (float64_t, f, 64)
+T (int8_t, s, 8)
+T (int16_t, s, 16)
+T (int32_t, s, 32)
+T (int64_t, s, 64)
+T (uint8_t, u, 8)
+T (uint16_t, u, 16)
+T (uint32_t, u, 32)
+T (uint64_t, u, 64)
+
+/* { dg-final { scan-assembler-times {\tldr\tq0, \[x0\]} 24 } } */
+/* { dg-final { scan-assembler-times {\tldr\td0, \[x0\]} 24 } } */
+/* { dg-final { scan-assembler-times {\tldr\ts0, \[x0\]} 18 } } */
+/* { dg-final { scan-assembler-times {\tldr\th0, \[x0\]} 12 } } */
+/* { dg-final { scan-assembler-times {\tldr\tb0, \[x0\]} 4 } } */
+
+/* { dg-final { scan-assembler-times {\tstr\tq0, \[x0\]} 24 } } */
+/* { dg-final { scan-assembler-times {\tstr\td0, \[x0\]} 24 } } */
+/* { dg-final { scan-assembler-times {\tstr\ts0, \[x0\]} 18 } } */
+/* { dg-final { scan-assembler-times {\tstr\th0, \[x0\]} 12 } } */
+/* { dg-final { scan-assembler-times {\tstr\tb0, \[x0\]} 4 } } */
+
+svint8_t foo (int8_t *x)
+{
+  return svld1_s8 (svptrue_b16 (), x);
+}
+/* { dg-final { scan-assembler-times {\tptrue\tp[0-7]\.h, all\n\tld1b} 1 } } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/while_7.c 
b/gcc/testsuite/gcc.target/aarch64/sve/while_7.c
index a66a20d21f6..ab2fa3646fc 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/while_7.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/while_7.c
@@ -19,7 +19,7 @@
 
 TEST_ALL (ADD_LOOP)
 
-/* { dg-final { scan-assembler-times {\tptrue\tp[0-7]\.b, vl8\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tptrue\tp[0-7]\.h, vl8\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\td[0-9]+, \[x0\]} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+, \[x0\]} 1 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s,} 2 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d,} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/while_9.c 
b/gcc/testsuite/gcc.target/aarch64/sve/while_9.c
index dd3f404ab39..99940dd73fa 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/while_9.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/while_9.c
@@ -19,7 +19,7 @@
 
 TEST_ALL (ADD_LOOP)
 
-/* { dg-final { scan-assembler-times {\tptrue\tp[0-7]\.b, vl16\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tldr\tq[0-9]+\, \[x0\]} 1 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h,} 2 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s,} 2 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d,} 2 } } */
-- 
2.34.1

> 
> On the tests: it would be good to have a second testcase for things that
> can't use the new path, such as non-power-of-2 ptrue VLs, or passing
> svptrue_b16() to an 8-bit load.  (Unless that's already covered elsewhere.)
> 
> Thanks,
> Richard


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to