On 12/11/2025 10:36, Richard Sandiford wrote:
As a general comment, we do already support constructors of variable-length vectors, thanks to Tejas's earlier ACLE work. For example: https://godbolt.org/z/Eqd8sM4cv . The quality of the output is awful, but it does look correct.
Thanks, Richard!That does look inefficient, although seemingly no worse than the code that is currently generated by my patch series. It has the same behaviour of zeroing an SVE register, storing it to the stack, modifying it, reloading it, storing it to another stack location, modifying it, reloading it, etc.
I guess you mean commit 17b520a10cdaabbcbaeaf30fc5228986b368ce0c. The function that Tejas modified was
tree build_vector_from_ctor (tree type, const vec<constructor_elt, va_gc> *v);in tree.cc ("the low level primitives for operating on tree nodes"). It declares a local tree_vector_builder instead of requiring one to be passed by its caller (as gimple_build_vector does):
inline tree gimple_build_vector (gimple_seq *seq, tree_vector_builder *builder)I'll take a closer look at whether that approach can be adapted for BB SLP, since it might avoid the need for my current constant_lower_bound change.
Sorry if that was already common ground -- just mentioning it because I didn't see it in this subthread. Richard Biener <[email protected]> writes:On Fri, 7 Nov 2025, Christopher Bazley wrote:On 07/11/2025 13:35, Richard Biener wrote:On Wed, 5 Nov 2025, Christopher Bazley wrote:On 28/10/2025 13:51, Richard Biener wrote:On Tue, 28 Oct 2025, Christopher Bazley wrote:vect_create_constant_vectors is updated to pad with zeros between the end of a group and the end of a vector of the type chosen for the SLP node, when used for BB SLP. This function calls gimple_build_vector, which also has to be updated for SVE vector types (by using the lower bound as the number of elements, e.g., 16 for VNx16QI). --- gcc/gimple-fold.cc | 2 +- gcc/tree-vect-slp.cc | 43 +++++++++++++++++++++++++++++++++++-------- 2 files changed, 36 insertions(+), 9 deletions(-) diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc index edcc04adc08..e5fe0ea12a7 100644 --- a/gcc/gimple-fold.cc +++ b/gcc/gimple-fold.cc @@ -11275,7 +11275,7 @@ gimple_build_vector (gimple_stmt_iterator *gsi, { gimple_seq seq = NULL; tree type = builder->type (); - unsigned int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant (); + unsigned int nelts = constant_lower_bound (TYPE_VECTOR_SUBPARTS (type));I don't think this is desirable in a generic helper? How does the 'builder' class deal with the all-constant case? It seems handling for constant vs. non-constant will now differ semantically (instead of ICEing in one case previously).This was the most minimal change I could make to get the feature working (whilst debugging many other issues) and it seemed harmless to me, so I didn't spend much time thinking about it. I know very little about the builder, but my understanding is that it would behave as though elements beyond the lower bound do not exist. e.g., if the vector type is VNx16QI then TREE_CONSTANT would return true for the CONSTRUCTOR node created by build_constructor if elements 0..15 are constant. This is presumably not safe general-purpose behaviour, because it would leave any other elements uninitialised (which does not matter for my use-case). I have no objection to trying to solve this elsewhere (probably in vect_create_constant_vectors) but I'll first need to revert this change and remind myself what breaks.Fixing this upthread would be definitely better. Not sure exactly how. Alternatively the change could be done in a way to assert that the tree_vector_builder has less than or exactly the same number of elements as constant_lower_bound of nelts. I don't exactly remember what the builder tracks here and what constraints for initialization of VLA vectors are.I've done some further investigation. One of the tests that failed without my change to gimple_build_vector was gcc.target/aarch64/sve/slp_6.c. I made that change to enable building of the following constant (among others): _70 = {_85, _21, _55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; That constant has only 16 elements although the type of _70 is vector([16,16]) unsigned char: void vec_slp_int8_t (int8_t * restrict a, int8_t * restrict b, int n) { ... vector([16,16]) signed char vect_x0_43.58; vector([16,16]) signed char vect__90.57; vector([16,16]) unsigned char vect__89.56; vector([16,16]) unsigned char vect__87.55; vector([16,16]) signed char vect_x0_26.54; vector([16,16]) signed char vect_x0_34.47; ... vector([16,16]) signed char vect_x1_35.41; ... vector([16,16]) signed char vect_x2_36.35; ... void * _8; vector([16,16]) signed char[3] * _9; ... unsigned char _21; vector([16,16]) unsigned char _22; unsigned char _55; vector([16,16]) unsigned char _56; ... vector([16,16]) unsigned char _70; vector([16,16]) unsigned char _84; unsigned char _85; ... <bb 5> [local count: 105119324]: _84 = (vector([16,16]) unsigned char) vect_x0_34.47_82; _85 = .REDUC_PLUS (_84); _22 = (vector([16,16]) unsigned char) vect_x1_35.41_38; _21 = .REDUC_PLUS (_22); _56 = (vector([16,16]) unsigned char) vect_x2_36.35_58; _55 = .REDUC_PLUS (_56); _70 = {_85, _21, _55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; vect__89.56_3 = vect__87.55_2 + _70; vect__90.57_4 = VIEW_CONVERT_EXPR<vector([16,16]) signed char>(vect__89.56_3); <bb 6> [local count: 118111600]: # vect_x0_43.58_5 = PHI <vect__90.57_4(5), vect_x0_26.54_1(8)> .MASK_STORE (b_25(D), 8B, { -1, -1, -1, 0, 0, 0, 0, 0, ... }, vect_x0_43.58_5); [tail call] return; The compiled code looks correct, although movi d0,#0 only zeros the first 16 bytes of the variable-length vector constant:Architecturally, it zeros the whole register, including the “upper” SVE-only bits.
Yes, you're right: https://developer.arm.com/documentation/ddi0602/2025-09/Shared-Pseudocode/shared-functions-registers?lang=en#impl-shared.V.write.2 I wasn't previously aware of that behaviour.
But what mode is the move done in? Does the RTL move insn use a 16-byte mode or a VL-byte mode?
It seems to use a VL-byte mode: movi d0, #0 // // 41 [c=4 l=4] *aarch64_sve_movvnx16qi_ldr_str/3 st1b z0.b, p6, [sp, #2, mul vl] //, tmp141, // 46 [c=8 l=4] aarch64_pred_movvnx16qi/2 The relevant RTL is: (insn 41 94 46 (set (reg:VNx16QI 32 v0) (const_vector:VNx16QI repeat [ (const_int 0 [0]) ])) 5577 {*aarch64_sve_movvnx16qi_ldr_str} (nil)) (insn:TI 46 41 43 (set (mem/c:VNx16QI (plus:DI (reg/f:DI 31 sp) (const_poly_int:DI [32, 32])) [0 S[16, 16] A128]) (unspec:VNx16QI [ (reg:VNx16BI 74 p6 [141]) (reg:VNx16QI 32 v0) ] UNSPEC_PRED_X)) 5612 {aarch64_pred_movvnx16qi} (expr_list:REG_DEAD (reg:VNx16QI 32 v0) (nil)))
My understanding is that the RTL generated by my patchset guarantees zero-initialisation of the whole SVE register, at least for the use-case I intend.
I suspect it exercises this code path in aarch64_output_sve_mov_immediate: if (info.u.mov.value == const0_rtx && TARGET_NON_STREAMING) snprintf (templ, sizeof (templ), "movi\t%%d0, #0"); which was changed by Tamar last year (to use movi d0, #0).
addvl x0, sp, #2 movi d0, #0 st1b z0.b, p6, [sp, #2, mul vl] uaddv d27, p6, z27.b uaddv d26, p6, z26.b uaddv d25, p6, z25.b str b27, [x0] addvl x0, sp, #1 add x0, x0, 1 ptrue p7.b, vl3 ld1b z0.b, p6/z, [sp, #2, mul vl] st1b z0.b, p6, [sp, #1, mul vl] str b26, [x0] ld1b z0.b, p6/z, [sp, #1, mul vl] st1b z0.b, p6, [sp] str b25, [sp, 2] ld1b z0.b, p6/z, [sp] add z28.b, z0.b, z28.b st1b z28.b, p7, [x1] addvl sp, sp, #3 .cfi_def_cfa_offset 0 ret (This code has already been noted to be inefficient, which I plan to address separately.) The decision about how many bytes to zero is made in the calling function, vect_create_constant_vectors (which also uses constant_lower_bound), rather than in gimple_build_vector: unsigned int elt_count = group_size; if (is_a<bb_vec_info> (vinfo)) { /* We don't use duplicate_and_interleave for basic block vectorization. We know that the group size fits within a single vector, so all we need to do for VLA is to pad the constant to the minimum vector length. */ nunits = constant_lower_bound (TYPE_VECTOR_SUBPARTS (vector_type)); elt_count = MAX (nunits, group_size); } My current understanding is that you don't object to this part of my change. Whatever happens in gimple_build_vector won’t alter the fact that only the minimum number of bytes are zeroed, and in most cases that’s the desirable outcome. I therefore plan to keep my modification to gimple_build_vector, but add an assertion that builder->encoded_nelts () <= constant_lower_bound (TYPE_VECTOR_SUBPARTS (builder->type ())) so that the modified function never builds fewer elements than expected when one of them is non-constant. Would that be OK?I'm not sure builder->encoded_nelts () is the correct thing to check here. In particular any stepped encoding should be rejected as well, so nelts_per_pattern () must be <= 2. And even then the interpretation is then to fill with the last value IIRC, and as you get zero-filling with building a "short" CTOR that last element should be a zero. I'm not sure how to get at the "last" value, but I think that given we create a "short" CTOR we need to check that all remaining elements of the VLA vector are actually encoded as zeros?
Does that mean the tree_vector_builder initialised by vect_create_constant_vectors could need an extra zero at the end?
Agreed. The only valid situations seem to be:
(1) a duplicate of a single zero, where:
npatterns == nelts_per_pattern == encoded_nelts == 1
and the only encoded value is zero
(2) the combination of:
- nelts_per_pattern == 2
- multiple_p (TYPE_VECTOR_SUBPARTS (type), npatterns)
- the second half of the encoded elements are all zeros
But these combinations would not come about by chance. The caller
would have to take steps to ensure that they're true. So rather
than check for these relatively complex conditions, it might
be clearer to add a new gimple_build interface that explicitly
fills with zeros, using a normal array (instead of a
tree_vector_builder) for the explicitly-initialised elements.
Would a new gimple_build_*_with_zeros function remove the need for vect_create_constant_vectors to pad with zeros at all?
The design of vect_create_constant_vectors seems to be heavily built around use of a tree_vector_builder. I'm a bit reluctant to do anything that would require significant refactoring of vect_create_constant_vectors, or that would require this seemingly rather ordinary case to be treated specially.
-- Christopher Bazley Staff Software Engineer, GNU Tools Team. Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK. http://www.arm.com/
