On 24/11/2025 13:14, Richard Biener wrote:
On Mon, 17 Nov 2025, Christopher Bazley wrote:

On 14/11/2025 11:15, Richard Sandiford wrote:
Christopher Bazley <[email protected]> writes:
Agreed.  The only valid situations seem to be:

(1) a duplicate of a single zero, where:

       npatterns == nelts_per_pattern == encoded_nelts == 1

       and the only encoded value is zero

(2) the combination of:

       - nelts_per_pattern == 2
       - multiple_p (TYPE_VECTOR_SUBPARTS (type), npatterns)
       - the second half of the encoded elements are all zeros

But these combinations would not come about by chance.  The caller
would have to take steps to ensure that they're true.  So rather
than check for these relatively complex conditions, it  might
be clearer to add a new gimple_build interface that explicitly
fills with zeros, using a normal array (instead of a
tree_vector_builder) for the explicitly-initialised elements.
Would a new gimple_build_*_with_zeros function remove the need for
vect_create_constant_vectors to pad with zeros at all?

The design of vect_create_constant_vectors seems to be heavily built
around use of a tree_vector_builder. I'm a bit reluctant to do
anything that would require significant refactoring of
vect_create_constant_vectors, or that would require this seemingly
rather ordinary case to be treated specially.
The current code is built for the normal VLA loop case, where the
sequence of scalar constants needs to be repeated to fill a vector.
For example, in:

    for (int i = 0; i < 100; ++i)
      {
        x[i*2] += 1;
        x[i*2 + 1] += 2;
      }

we need { 1, 2, 1, 2, 1, 2, ... }.

We can't do that filling explicitly at compile-time because we don't
know how many copies are needed -- that depends on the runtime vector
length.  So instead we use a tree_vector_builder that encodes { 1, 2 }
and says that the pattern needs to be repeated to fill a vector.

This also works for fixed-length loop vectorisation because, in the
general case, filling is needed there too.  We could of course do the
filling explicitly at compile time, but it would be somewhat wasted
effort, since the resulting constant would be canonicalised back to
the "{ 1, 2 } repeating" encoding.

If you want to do something different for BB SLP then I think it makes
sense that there is some difference in the way that the constant is
constructed.  It doesn't need to be a big difference.  tree_vector_builder
inherits from auto_vec, so it would be possible to create a new
gimple_build_* that takes a vec (or, better, an array_slice) and still
share the current tree_vector_builder code in vect_create_constant_vectors.
I have an alternative to my original solution now, which doesn't require
modification of the gimple_build_vector function. Instead, I have added a new
gimple_build_vector_with_zero_padding function. It:

* Prepares a vector of constructor elements and find out whether all of the
element values are constant.

* If all element values are constant then it returns a new VECTOR_CST node.
Any elements for which no value is supplied will be zero.

* Otherwise, it builds a constructor for only those element values that the
caller provided, then assigns the result of that constructor to a temporary
object.
In GIMPLE a CONSTRUCTOR node has not mentioned elements zero-filled
auto-magically.  So iff you assume that the target can create a VLA
vector with a n-element prefix (with n <= lower_bound (nunits)) then
you shouldn't need to do anything special.

If it is sufficient to build a constructor comprising only the lower elements of known non-constant value, then the existing gimple_build_vector already does that with only the minor modification to use the lower bound that was in my original patch set.

Should I therefore delete the gimple_build_vector_with_zero_padding function? I only created it because I thought that was what you and Richard Sandiford wanted. He suggested '...add a new gimple_build interface that explicitly fills with zeros...', so that's what I did.

Of course this requires the targets vec_init to do the heavy lifting
as IIRC there's no fallback in RTL expansion for VLA vector CONSTRUCTORs.

But if code quality is awful why would we want to do this at all?
In general I'd expect N operations to construct a vector with N
non-zero leading elements.

Richard.

The generated code quality is much better since I made a change yesterday to handle variable-length vector types in store_constructor, which I have not yet pushed to the mailing list. Previously, that function used a fallback path of calling store_constructor_field upon discovering that the number of subparts in the vector type was not a constant multiple of the number of subparts in the element type.

For example, yesterday's change allows GCC to generate the following AArch64 assembly language output for the tail of a reduction in the slp_6 test:

    uaddv d31, p6, z31.b
    uaddv d27, p6, z27.b
    uaddv d26, p6, z26.b
    movi  d30, #0
    insr  z30.b, b26
    insr  z30.b, b27
    insr  z30.b, b31
    add   z25.b, z25.b, z30.b

instead of the following output (with predicated tails for basic block SLP vectorization but without this change):

    addvl  x0, sp, #2
    movi   d0, #0
    st1b   z0.b, p6, [sp, #2, mul vl]
    uaddv  d27, p6, z27.b
    uaddv  d26, p6, z26.b
    uaddv  d25, p6, z25.b
    str    b27, [x0]
    addvl  x0, sp, #1
    add    x0, x0, 1
    ptrue  p7.b, vl3
    ld1b   z0.b, p6/z, [sp, #2, mul vl]
    st1b   z0.b, p6, [sp, #1, mul vl]
    str    b26, [x0]
    ld1b   z0.b, p6/z, [sp, #1, mul vl]
    st1b   z0.b, p6, [sp]
    str    b25, [sp, 2]
    ld1b   z0.b, p6/z, [sp]
    add    z28.b, z28.b, z0.b
    st1b   z28.b, p7, [x1]
    addvl  sp, sp, #3

or the original assembly language output (with neither predicated tails for basic block SLP vectorization nor this change):

    uaddv  d31, p6, z31.b
    fmov   x0, d31
    uaddv  d31, p6, z26.b
    add    w6, w6, w0
    fmov   x0, d31
    uaddv  d31, p6, z27.b
    add    w5, w5, w0
    fmov   x0, d31
    add    w4, w4, w0

* If there are no implicitly-zero trailing elements then it returns the value
of the temporary object.

* Otherwise, it builds a mask that exposes only those lanes of the destination
vector type for which the caller provided values...

* ... then copies unmasked elements from the temporary object to the
destination vector and assigns zero to masked elements.

(The temporary object is needed because a non-constant constructor isn’t valid
for use in a ternary operation.)

Unfortunately, this approach currently produces less efficient output, for
example:

     ptrue    p15.b, vl16 ; *** :o(
....
     ptrue    p7.b, vl3
....
     movi    d25, #0 ; *** :o(
....
     ld1b    z1.b, p6/z, [sp]
     sel    z25.b, p15, z1.b, z25.b ; *** :o(
     add    z28.b, z28.b, z25.b
     st1b    z28.b, p7, [x1]


--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/

Reply via email to