On Wed, 26 Nov 2025, Christopher Bazley wrote:

> 
> On 24/11/2025 13:14, Richard Biener wrote:
> > On Mon, 17 Nov 2025, Christopher Bazley wrote:
> >
> >> On 14/11/2025 11:15, Richard Sandiford wrote:
> >>> Christopher Bazley <[email protected]> writes:
> >>>>>> Agreed.  The only valid situations seem to be:
> >>>>>>
> >>>>>> (1) a duplicate of a single zero, where:
> >>>>>>
> >>>>>>        npatterns == nelts_per_pattern == encoded_nelts == 1
> >>>>>>
> >>>>>>        and the only encoded value is zero
> >>>>>>
> >>>>>> (2) the combination of:
> >>>>>>
> >>>>>>        - nelts_per_pattern == 2
> >>>>>>        - multiple_p (TYPE_VECTOR_SUBPARTS (type), npatterns)
> >>>>>>        - the second half of the encoded elements are all zeros
> >>>>>>
> >>>>>> But these combinations would not come about by chance.  The caller
> >>>>>> would have to take steps to ensure that they're true.  So rather
> >>>>>> than check for these relatively complex conditions, it  might
> >>>>>> be clearer to add a new gimple_build interface that explicitly
> >>>>>> fills with zeros, using a normal array (instead of a
> >>>>>> tree_vector_builder) for the explicitly-initialised elements.
> >>>>> Would a new gimple_build_*_with_zeros function remove the need for
> >>>>> vect_create_constant_vectors to pad with zeros at all?
> >>>>>
> >>>>> The design of vect_create_constant_vectors seems to be heavily built
> >>>>> around use of a tree_vector_builder. I'm a bit reluctant to do
> >>>>> anything that would require significant refactoring of
> >>>>> vect_create_constant_vectors, or that would require this seemingly
> >>>>> rather ordinary case to be treated specially.
> >>> The current code is built for the normal VLA loop case, where the
> >>> sequence of scalar constants needs to be repeated to fill a vector.
> >>> For example, in:
> >>>
> >>>     for (int i = 0; i < 100; ++i)
> >>>       {
> >>>         x[i*2] += 1;
> >>>         x[i*2 + 1] += 2;
> >>>       }
> >>>
> >>> we need { 1, 2, 1, 2, 1, 2, ... }.
> >>>
> >>> We can't do that filling explicitly at compile-time because we don't
> >>> know how many copies are needed -- that depends on the runtime vector
> >>> length.  So instead we use a tree_vector_builder that encodes { 1, 2 }
> >>> and says that the pattern needs to be repeated to fill a vector.
> >>>
> >>> This also works for fixed-length loop vectorisation because, in the
> >>> general case, filling is needed there too.  We could of course do the
> >>> filling explicitly at compile time, but it would be somewhat wasted
> >>> effort, since the resulting constant would be canonicalised back to
> >>> the "{ 1, 2 } repeating" encoding.
> >>>
> >>> If you want to do something different for BB SLP then I think it makes
> >>> sense that there is some difference in the way that the constant is
> >>> constructed.  It doesn't need to be a big difference.  tree_vector_builder
> >>> inherits from auto_vec, so it would be possible to create a new
> >>> gimple_build_* that takes a vec (or, better, an array_slice) and still
> >>> share the current tree_vector_builder code in
> >>> vect_create_constant_vectors.
> >> I have an alternative to my original solution now, which doesn't require
> >> modification of the gimple_build_vector function. Instead, I have added a
> >> new
> >> gimple_build_vector_with_zero_padding function. It:
> >>
> >> * Prepares a vector of constructor elements and find out whether all of the
> >> element values are constant.
> >>
> >> * If all element values are constant then it returns a new VECTOR_CST node.
> >> Any elements for which no value is supplied will be zero.
> >>
> >> * Otherwise, it builds a constructor for only those element values that the
> >> caller provided, then assigns the result of that constructor to a temporary
> >> object.
> > In GIMPLE a CONSTRUCTOR node has not mentioned elements zero-filled
> > auto-magically.  So iff you assume that the target can create a VLA
> > vector with a n-element prefix (with n <= lower_bound (nunits)) then
> > you shouldn't need to do anything special.
> 
> If it is sufficient to build a constructor comprising only the lower elements
> of known non-constant value, then the existing gimple_build_vector already
> does that with only the minor modification to use the lower bound that was in
> my original patch set.
> 
> Should I therefore delete the gimple_build_vector_with_zero_padding function?
> I only created it because I thought that was what you and Richard Sandiford
> wanted. He suggested '...add a new gimple_build interface that explicitly
> fills with zeros...', so that's what I did.

I think that, for non-constant vectors, you should simply build a
CTOR directly and not go via gimple_build_vector.

> > Of course this requires the targets vec_init to do the heavy lifting
> > as IIRC there's no fallback in RTL expansion for VLA vector CONSTRUCTORs.
> >
> > But if code quality is awful why would we want to do this at all?
> > In general I'd expect N operations to construct a vector with N
> > non-zero leading elements.
> >
> > Richard.
> 
> The generated code quality is much better since I made a change yesterday to
> handle variable-length vector types in store_constructor, which I have not yet
> pushed to the mailing list. Previously, that function used a fallback path of
> calling store_constructor_field upon discovering that the number of subparts
> in the vector type was not a constant multiple of the number of subparts in
> the element type.
> 
> For example, yesterday's change allows GCC to generate the following AArch64
> assembly language output for the tail of a reduction in the slp_6 test:
> 
>     uaddv d31, p6, z31.b
>     uaddv d27, p6, z27.b
>     uaddv d26, p6, z26.b
>     movi  d30, #0
>     insr  z30.b, b26
>     insr  z30.b, b27
>     insr  z30.b, b31
>     add   z25.b, z25.b, z30.b
> 
> instead of the following output (with predicated tails for basic block SLP
> vectorization but without this change):
> 
>     addvl  x0, sp, #2
>     movi   d0, #0
>     st1b   z0.b, p6, [sp, #2, mul vl]
>     uaddv  d27, p6, z27.b
>     uaddv  d26, p6, z26.b
>     uaddv  d25, p6, z25.b
>     str    b27, [x0]
>     addvl  x0, sp, #1
>     add    x0, x0, 1
>     ptrue  p7.b, vl3
>     ld1b   z0.b, p6/z, [sp, #2, mul vl]
>     st1b   z0.b, p6, [sp, #1, mul vl]
>     str    b26, [x0]
>     ld1b   z0.b, p6/z, [sp, #1, mul vl]
>     st1b   z0.b, p6, [sp]
>     str    b25, [sp, 2]
>     ld1b   z0.b, p6/z, [sp]
>     add    z28.b, z28.b, z0.b
>     st1b   z28.b, p7, [x1]
>     addvl  sp, sp, #3
> 
> or the original assembly language output (with neither predicated tails for
> basic block SLP vectorization nor this change):
> 
>     uaddv  d31, p6, z31.b
>     fmov   x0, d31
>     uaddv  d31, p6, z26.b
>     add    w6, w6, w0
>     fmov   x0, d31
>     uaddv  d31, p6, z27.b
>     add    w5, w5, w0
>     fmov   x0, d31
>     add    w4, w4, w0
> 
> >> * If there are no implicitly-zero trailing elements then it returns the
> >> value
> >> of the temporary object.
> >>
> >> * Otherwise, it builds a mask that exposes only those lanes of the
> >> destination
> >> vector type for which the caller provided values...
> >>
> >> * ... then copies unmasked elements from the temporary object to the
> >> destination vector and assigns zero to masked elements.
> >>
> >> (The temporary object is needed because a non-constant constructor isn’t
> >> valid
> >> for use in a ternary operation.)
> >>
> >> Unfortunately, this approach currently produces less efficient output, for
> >> example:
> >>
> >>     ptrue    p15.b, vl16 ; *** :o(
> >> ....
> >>     ptrue    p7.b, vl3
> >> ....
> >>     movi    d25, #0 ; *** :o(
> >> ....
> >>      ld1b    z1.b, p6/z, [sp]
> >>      sel    z25.b, p15, z1.b, z25.b ; *** :o(
> >>      add    z28.b, z28.b, z25.b
> >>      st1b    z28.b, p7, [x1]
> >>
> >>
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to