Jennifer Schmitz <jschm...@nvidia.com> writes: > About the tests: Non-power-of-2 patterns are already being tested in > gcc.target/aarch64/sve/acle/general/whilelt_5.c.
OK > For the case of svptrue_b16 () > with 8-bit load, I added a test case for it. Currently, it has a single test > case, > but if necessary I can add more tests for other data types and for stores as > well. Nah, that should be fine. > > I bootstrapped and tested and the check-function-bodies test in > gcc.target/aarch64/pr117557.c > is currently failing. > The current GCC trunk produces: > f: > add x3, x1, 8 > cmp x2, x3 > add x3, x2, 60 > ccmp x1, x3, 2, cc > bcc .L2 > ptrue p7.h, vl8 > index z31.s, #0, #8 > ld1b z0.h, p7/z, [x1] > punpklo p6.h, p7.b > lsl z0.h, z0.h, #2 > punpkhi p7.h, p7.b > uunpklo z1.s, z0.h > uunpkhi z0.s, z0.h > ld1w z29.s, p6/z, [x0, z1.s, sxtw] > ld1w z30.s, p7/z, [x0, z0.s, sxtw] > st1w z29.s, p6, [x2, z31.s, sxtw] > incb x2, all, mul #2 > st1w z30.s, p7, [x2, z31.s, sxtw] > ret > … > > WITH my patch it produces: > f: > add x3, x1, 8 > cmp x2, x3 > add x3, x2, 60 > ccmp x1, x3, 2, cc > bcc .L2 > ldr d31, [x1] > addvl sp, sp, #-1 > ptrue p5.b, all > addpl x1, sp, #4 > ptrue p7.h, vl8 > punpklo p6.h, p7.b > punpkhi p7.h, p7.b > str d31, [x1] > ld1b z31.h, p5/z, [sp, #1, mul vl] > lsl z31.h, z31.h, #2 > uunpklo z30.s, z31.h > uunpkhi z31.s, z31.h > ld1w z29.s, p6/z, [x0, z30.s, sxtw] > ld1w z30.s, p7/z, [x0, z31.s, sxtw] > index z31.s, #0, #8 > st1w z29.s, p6, [x2, z31.s, sxtw] > incb x2, all, mul #2 > st1w z30.s, p7, [x2, z31.s, sxtw] > addvl sp, sp, #1 > ret > … > > The patch seems to fold the ptrue p7.h, vl8 and ld1b z0.h, p7/z, [x1] to an > LDR, > but because the ptrue predicate in p7 is needed again later, this is not > really > an optimization. Could we prevent the fold somehow in case the predicate is > used > multiple times? Using LDR and STR is still better in many cases, because it has one fewer register to rename and because there's the possiblity that the compiler might form LDPs and STPs. > [...] > @@ -3698,6 +3706,19 @@ aarch64_partial_ptrue_length (rtx_vector_builder > &builder, > return vl; > } > > +/* For predicate PRED, return the number of active lanes. */ That's not really what the function does. How about: /* Return: * -1 if all bits of PRED are set * N if PRED has N leading set bits followed by all clear bits * 0 if PRED does not have any of these forms. */ > +int > +aarch64_partial_ptrue_length (rtx pred) > +{ > + rtx_vector_builder builder; > + if (!aarch64_get_sve_pred_bits (builder, pred)) > + return 0; > + > + auto elt_size = vector_element_size (GET_MODE_BITSIZE (GET_MODE (pred)), > + GET_MODE_NUNITS (GET_MODE (pred))); > + return aarch64_partial_ptrue_length (builder, elt_size); > +} > + > /* See if there is an svpattern that encodes an SVE predicate of mode > PRED_MODE in which the first VL bits are set and the rest are clear. > Return the pattern if so, otherwise return AARCH64_NUM_SVPATTERNS. > @@ -6410,8 +6431,32 @@ aarch64_stack_protect_canary_mem (machine_mode mode, > rtx decl_rtl, > return gen_rtx_MEM (mode, force_reg (Pmode, addr)); > } > > -/* Emit an SVE predicated move from SRC to DEST. PRED is a predicate > - that is known to contain PTRUE. */ > +/* Emit a load/store from a subreg of SRC to a subreg of DEST. > + The subregs have mode NEW_MODE. Use only for reg<->mem moves. */ > +void > +aarch64_emit_load_store_through_mode (rtx dest, rtx src, machine_mode > new_mode) > +{ > + gcc_assert ((REG_P (src) && MEM_P (dest)) > + || (REG_P (dest) && MEM_P (src))); We should allow subregs too, so how about: gcc_assert ((MEM_P (dest) && register_operand (src, VOIDmode)) || (MEM_P (src) && register_operand (dest, VOIDmode))); > + auto mode = GET_MODE (dest); > + auto int_mode = aarch64_sve_int_mode (mode); > + if (MEM_P (src)) > + { > + rtx tmp = force_reg (new_mode, adjust_address (src, new_mode, 0)); > + tmp = force_lowpart_subreg (int_mode, tmp, new_mode); > + emit_move_insn (dest, force_lowpart_subreg (mode, tmp, int_mode)); > + } > + else > + { > + src = force_lowpart_subreg (int_mode, src, mode); > + emit_move_insn (adjust_address (dest, new_mode, 0), > + force_lowpart_subreg (new_mode, src, int_mode)); > + } > +} > + > +/* PRED is a predicate that is known to contain PTRUE. > + For 128-bit VLS loads/stores, emit LDR/STR. > + Else, emit an SVE predicated move from SRC to DEST. */ > > void > aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx src) > @@ -6421,16 +6466,7 @@ aarch64_emit_sve_pred_move (rtx dest, rtx pred, rtx > src) > && known_eq (GET_MODE_SIZE (mode), 16) > && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA > && !BYTES_BIG_ENDIAN) > - { > - if (MEM_P (src)) > - { > - rtx tmp = force_reg (V16QImode, adjust_address (src, V16QImode, 0)); > - emit_move_insn (dest, lowpart_subreg (mode, tmp, V16QImode)); > - } > - else > - emit_move_insn (adjust_address (dest, V16QImode, 0), > - force_lowpart_subreg (V16QImode, src, mode)); > - } > + aarch64_emit_load_store_through_mode (dest, src, V16QImode); > else > { > expand_operand ops[3]; > @@ -23526,6 +23562,31 @@ aarch64_simd_valid_imm (rtx op, simd_immediate_info > *info, > return false; > } > > +/* If the predicate in operands[2] is a patterned SVE PTRUE predicate > + with patterns VL1, VL2, VL4, VL8, or VL16 and at most the bottom > + 128 bits are loaded/stored, emit an ASIMD load/store and return true. > + Else, return false. */ > +bool > +aarch64_sve_maskloadstore (rtx *operands, machine_mode mode) How about s/sve_maskloadstore/expand_maskloadstore/? Since the name of the function is general either way, how about a function comment: /* Try to optimize the expansion of a maskload or maskstore with the operands in OPERANDS, given that the vector being loaded or stored has mode MODE. Return true on success or false if the normal expansion should be used. */ Then we can describe the individual optimisations in the function body (admittedly just one optimisation for now), using your comment above with "and return true..." removed. > +{ > + int vl = aarch64_partial_ptrue_length (operands[2]); > + int width = vl * GET_MODE_UNIT_BITSIZE (mode); > + if (width <= 128 && pow2p_hwi (vl) && (vl == 1 || !BYTES_BIG_ENDIAN)) It looks like the check for fully-packed modes got dropped. How about: if (width <= 128 && pow2p_hwi (vl) && (vl == 1 || (!BYTES_BIG_ENDIAN && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA))) Otherwise it looks good, thanks. Richard