Jennifer Schmitz <jschm...@nvidia.com> writes: > SVE loads/stores using predicates that select the bottom 8, 16, 32, 64, > or 128 bits of a register can be folded to ASIMD LDR/STR, thus avoiding the > predicate. > For example, > svuint8_t foo (uint8_t *x) { > return svld1 (svwhilelt_b8 (0, 16), x); > } > was previously compiled to: > foo: > ptrue p3.b, vl16 > ld1b z0.b, p3/z, [x0] > ret > > and is now compiled to: > foo: > ldr q0, [x0] > ret > > The optimization is applied during the expand pass and was implemented > by making the following changes to maskload<mode><vpred> and > maskstore<mode><vpred>: > - the existing define_insns were renamed and new define_expands for maskloads > and maskstores were added with predicates for the SVE predicate that match > both register operands and constant-vector operands. > - if the SVE predicate is a constant vector and contains a pattern as > described above, an ASIMD load/store is emitted instead of the SVE > load/store. > > The patch implements the optimization for LD1 and ST1, for 8-bit, 16-bit, > 32-bit, 64-bit, and 128-bit moves, for all full SVE data vector modes. > Note that VNx8HFmode and VNx2BFmode with a VL2 pattern were excluded, because > there are no move patterns for V2HFmode and V2BFmode (yet). > > Follow-up patches for LD2/3/4 and ST2/3/4 and potentially partial SVE vector > modes are planned. > > The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. > OK for mainline? > > Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com> > > gcc/ > PR target/117978 > * config/aarch64/aarch64-protos.h: Declare > aarch64_simd_container_mode, aarch64_sve_full_data_mode_p, > aarch64_count_pred_pat_128, aarch64_emit_load_store_through_mode. > * config/aarch64/aarch64-sve.md > (maskload<mode><vpred>): New define_expand folding maskloads with > certain predicate patterns to ASIMD loads. > (*aarch64_maskload<mode><vpred>): Renamed from maskload<mode><vpred>. > (maskstore<mode><vpred>): New define_expand folding maskstores with > certain predicate patterns to ASIMD stores. > (*aarch64_maskstore<mode><vpred>): Renamed from maskstore<mode><vpred>. > * config/aarch64/aarch64.cc > (aarch64_sve_full_data_mode_p): New function returning true if a given > mode is a full SVE data vector mode. > (aarch64_emit_load_store_through_mode): New function emitting a > load/store through subregs of a given mode. > (aarch64_emit_sve_pred_move): Refactor to use > aarch64_emit_load_store_through_mode. > (aarch64_v8_mode): New function returning an 8-bit mode. > (aarch64_v16_mode): New function returning a 16-bit mode. > (aarch64_v32_mode): New function returning a 32-bit mode. > (aarch64_simd_container_mode): Make public and extend to find > 8-bit, 16-bit, and 32-bit container modes. > (aarch64_count_pred_pat_128): New function to find SVE predicates > with VL1, VL2, VL4, VL8, or VL16 patterns. > * config/aarch64/iterators.md (elem_bits): Extend to cover partial > SVE vector modes. > * config/aarch64/predicates.md (aarch64_sve_reg_or_const_pred): New > predicate matching register operands or constant-vector operands. > > gcc/testsuite/ > PR target/117978 > * gcc.target/aarch64/sve/acle/general/whilelt_5.c: Adjust expected > outcome. > * gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c: New test. > * gcc.target/aarch64/sve/while_7.c: Adjust expected outcome. > * gcc.target/aarch64/sve/while_9.c: Adjust expected outcome. > --- > gcc/config/aarch64/aarch64-protos.h | 4 + > gcc/config/aarch64/aarch64-sve.md | 62 ++++++++- > gcc/config/aarch64/aarch64.cc | 128 +++++++++++++++--- > gcc/config/aarch64/iterators.md | 19 ++- > gcc/config/aarch64/predicates.md | 4 + > .../aarch64/sve/acle/general/whilelt_5.c | 24 +++- > .../aarch64/sve/ldst_ptrue_pat_128_to_neon.c | 81 +++++++++++ > .../gcc.target/aarch64/sve/while_7.c | 4 +- > .../gcc.target/aarch64/sve/while_9.c | 2 +- > 9 files changed, 296 insertions(+), 32 deletions(-) > create mode 100644 > gcc/testsuite/gcc.target/aarch64/sve/ldst_ptrue_pat_128_to_neon.c > > diff --git a/gcc/config/aarch64/aarch64-protos.h > b/gcc/config/aarch64/aarch64-protos.h > index 1ca86c9d175..a03f091fe3a 100644 > --- a/gcc/config/aarch64/aarch64-protos.h > +++ b/gcc/config/aarch64/aarch64-protos.h > @@ -857,6 +857,7 @@ enum aarch64_symbol_type > aarch64_classify_symbolic_expression (rtx); > bool aarch64_advsimd_struct_mode_p (machine_mode mode); > opt_machine_mode aarch64_v64_mode (scalar_mode); > opt_machine_mode aarch64_v128_mode (scalar_mode); > +machine_mode aarch64_simd_container_mode (scalar_mode, poly_int64); > opt_machine_mode aarch64_full_sve_mode (scalar_mode); > bool aarch64_can_const_movi_rtx_p (rtx x, machine_mode mode); > bool aarch64_valid_fp_move (rtx, rtx, machine_mode); > @@ -903,8 +904,10 @@ opt_machine_mode aarch64_advsimd_vector_array_mode > (machine_mode, > unsigned HOST_WIDE_INT); > opt_machine_mode aarch64_sve_data_mode (scalar_mode, poly_uint64); > bool aarch64_sve_mode_p (machine_mode); > +bool aarch64_sve_full_data_mode_p (machine_mode); > HOST_WIDE_INT aarch64_fold_sve_cnt_pat (aarch64_svpattern, unsigned int); > bool aarch64_sve_cnt_immediate_p (rtx); > +int aarch64_count_pred_pat_128 (rtx, machine_mode); > bool aarch64_sve_scalar_inc_dec_immediate_p (rtx); > bool aarch64_sve_rdvl_immediate_p (rtx); > bool aarch64_sve_addvl_addpl_immediate_p (rtx); > @@ -1026,6 +1029,7 @@ rtx aarch64_ptrue_reg (machine_mode, unsigned int); > rtx aarch64_ptrue_reg (machine_mode, machine_mode); > rtx aarch64_pfalse_reg (machine_mode); > bool aarch64_sve_same_pred_for_ptest_p (rtx *, rtx *); > +void aarch64_emit_load_store_through_mode (rtx, rtx, machine_mode); > void aarch64_emit_sve_pred_move (rtx, rtx, rtx); > void aarch64_expand_sve_mem_move (rtx, rtx, machine_mode); > bool aarch64_maybe_expand_sve_subreg_move (rtx, rtx); > diff --git a/gcc/config/aarch64/aarch64-sve.md > b/gcc/config/aarch64/aarch64-sve.md > index d4af3706294..d9392e3611a 100644 > --- a/gcc/config/aarch64/aarch64-sve.md > +++ b/gcc/config/aarch64/aarch64-sve.md > @@ -1286,7 +1286,36 @@ > ;; ------------------------------------------------------------------------- > > ;; Predicated LD1 (single). > -(define_insn "maskload<mode><vpred>" > +(define_expand "maskload<mode><vpred>" > + [(set (match_operand:SVE_ALL 0 "register_operand") > + (unspec:SVE_ALL > + [(match_operand:<VPRED> 2 "aarch64_sve_reg_or_const_pred")
I don't think we need a new predicate for this. nonmemory_operand should be good enough. > + (match_operand:SVE_ALL 1 "memory_operand") > + (match_operand:SVE_ALL 3 "aarch64_maskload_else_operand")] > + UNSPEC_LD1_SVE))] > + "TARGET_SVE" > + { > + int pat_cnt = aarch64_count_pred_pat_128 (operands[2], <MODE>mode); > + int width = <elem_bits> * pat_cnt; > + if (aarch64_sve_full_data_mode_p (<MODE>mode) > + && pat_cnt && (pat_cnt == 1 || !BYTES_BIG_ENDIAN) > + && known_le (width, 128)) > + { > + machine_mode mode = aarch64_simd_container_mode (<VEL>mode, width); I'm not sure this needs to be so complex. It should be enough to pick an arbitrary 128-bit vector mode (say V16QI) for width==128, an arbitrary 64-bit vector mode (say V8QI) for width==64, and an integer mode for smaller widths (int_mode_for_size (...).require ()). Experimenting locally, I assume you did the above to avoid ICEs while forming the subregs. But that's something we should fix in aarch64_emit_load_store_through_mode. Writing it as: /* Emit a load/store from a subreg of SRC to a subreg of DEST. The subregs have mode NEW_MODE. Use only for reg<->mem moves. */ void aarch64_emit_load_store_through_mode (rtx dest, rtx src, machine_mode new_mode) { gcc_assert ((REG_P (src) && MEM_P (dest)) || (REG_P (dest) && MEM_P (src))); auto mode = GET_MODE (dest); auto int_mode = aarch64_sve_int_mode (mode); if (MEM_P (src)) { rtx tmp = force_reg (new_mode, adjust_address (src, new_mode, 0)); tmp = force_lowpart_subreg (int_mode, tmp, new_mode); emit_move_insn (dest, force_lowpart_subreg (mode, tmp, int_mode)); } else { src = force_lowpart_subreg (int_mode, src, mode); emit_move_insn (adjust_address (dest, new_mode, 0), force_lowpart_subreg (new_mode, src, int_mode)); } } should cope with the general case, and means that the new test has no SVE loads and stores. > + if (mode != VOIDmode) > + { > + aarch64_emit_load_store_through_mode (operands[0], > + operands[1], mode); > + DONE; > + } > + } > + if (!REG_P (operands[2])) Probably better to use if (CONSTANT_P (...)) or if (!register_operand (...)), so that we don't create a new register if operands[2] is a subreg. > + operands[2] = force_reg (<VPRED>mode, operands[2]); > + } > +) > + > +;; Predicated LD1 (single). > +(define_insn "*aarch64_maskload<mode><vpred>" > [(set (match_operand:SVE_ALL 0 "register_operand" "=w") > (unspec:SVE_ALL > [(match_operand:<VPRED> 2 "register_operand" "Upl") > @@ -2287,7 +2316,36 @@ > ;; ------------------------------------------------------------------------- > > ;; Predicated ST1 (single). > -(define_insn "maskstore<mode><vpred>" > +(define_expand "maskstore<mode><vpred>" > + [(set (match_operand:SVE_ALL 0 "memory_operand") > + (unspec:SVE_ALL > + [(match_operand:<VPRED> 2 "aarch64_sve_reg_or_const_pred") > + (match_operand:SVE_ALL 1 "register_operand") > + (match_dup 0)] > + UNSPEC_ST1_SVE))] > + "TARGET_SVE" > + { > + int pat_cnt = aarch64_count_pred_pat_128 (operands[2], <MODE>mode); > + int width = <elem_bits> * pat_cnt; > + if (aarch64_sve_full_data_mode_p (<MODE>mode) > + && pat_cnt && (pat_cnt == 1 || !BYTES_BIG_ENDIAN) > + && known_le (width, 128)) > + { > + machine_mode mode = aarch64_simd_container_mode (<VEL>mode, width); > + if (mode != VOIDmode) > + { > + aarch64_emit_load_store_through_mode (operands[0], > + operands[1], mode); > + DONE; > + } This seems a bit too complex for cut-&-paste. Could you put it in a common aarch64.cc helper? > + } > + if (!REG_P (operands[2])) > + operands[2] = force_reg (<VPRED>mode, operands[2]); > + } > +) > + > +;; Predicated ST1 (single). > +(define_insn "*aarch64_maskstore<mode><vpred>" > [(set (match_operand:SVE_ALL 0 "memory_operand" "+m") > (unspec:SVE_ALL > [(match_operand:<VPRED> 2 "register_operand" "Upl") > [...] > @@ -23526,6 +23602,26 @@ aarch64_simd_valid_imm (rtx op, simd_immediate_info > *info, > return false; > } > > +/* If PRED is a patterned SVE PTRUE predicate with patterns > + VL1, VL2, VL4, VL8, or VL16, return the number of active lanes > + for the mode MODE. Else return 0. */ > +int > +aarch64_count_pred_pat_128 (rtx pred, machine_mode mode) > +{ > + struct simd_immediate_info info; > + bool is_valid; > + is_valid = aarch64_simd_valid_imm (pred, &info, AARCH64_CHECK_MOV); > + if (!is_valid || info.insn != simd_immediate_info::PTRUE) > + return 0; > + aarch64_svpattern pattern = info.u.pattern; > + unsigned int cnt > + = aarch64_fold_sve_cnt_pat (pattern, 128 / GET_MODE_UNIT_BITSIZE (mode)); > + if (pattern <= AARCH64_SV_VL16 && pow2p_hwi (cnt)) > + return cnt; > + else > + return 0; > +} > + It would be better to avoid the round-trip through aarch64_simd_valid_imm, since the caller doesn't really care whether the length is a valid ptrue operand. How about instead: int aarch64_partial_ptrue_length (rtx pred) { rtx_vector_builder builder; if (!aarch64_get_sve_pred_bits (builder, pred)) return 0; auto elt_size = vector_element_size (GET_MODE_BITSIZE (GET_MODE (pred)), GET_MODE_NUNITS (GET_MODE (pred))); return aarch64_partial_ptrue_length (builder, elt_size); } This needs an extra change to aarch64_partial_ptrue_length to handle the VL1 case: (I remember the VL1 case being discussed recently in a different context.) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index fff8d9da49d..1c4194ce883 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -3667,6 +3674,14 @@ aarch64_partial_ptrue_length (rtx_vector_builder &builder, if (builder.nelts_per_pattern () == 3) return 0; + /* It is conservatively correct to drop the element size to a lower value, + and we must do so if the predicate consists of a leading "foreground" + sequence that is smaller than the element size. Without this, + we would test only one bit and so treat everything as either an + all-true or an all-false predicate. */ + if (builder.nelts_per_pattern () == 2) + elt_size = MIN (elt_size, builder.npatterns ()); + /* Skip over leading set bits. */ unsigned int nelts = builder.encoded_nelts (); unsigned int i = 0; I haven't bootstrapped & regression-tested that yet, but I'll give it a go. On the tests: it would be good to have a second testcase for things that can't use the new path, such as non-power-of-2 ptrue VLs, or passing svptrue_b16() to an 8-bit load. (Unless that's already covered elsewhere.) Thanks, Richard