On 28/10/2025 13:29, Richard Biener wrote:
+/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE that
+ operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
+ Masking is only required for the tail, therefore NULL_TREE is returned for
+ every value of INDEX except the last. Insert any set-up statements before
+ GSI. */
I think it might happen that some vectors are fully masked, say for
a conversion from double to int and V2DImode vs. V4SImode when we
have 5 lanes the conversion likely expects 4 V2DImode inputs to
produce 2 V4SImode outputs, but the 4th V2DImode input has no active
lanes at all.
But maybe you handle this situation differently, I'll see.
You hypothesise a conversion from 4 of V2DI = 8DI (8DI - 5DI = 3DI
inactive, and floor(3DI / 2DI)=1 of 2DI fully masked) to 2 of V4SI = 8SI
(8SI - 5SI = 3SI inactive and floor(3SI / 4SI)=0 of V4SI fully masked).
I don't think that the "1 of 2DI is fully masked" would ever happen
though, because a group of 5DI would be split long before the vectoriser
attempts to materialise masks. The only reason that a group of 5DI might
be allowed to survive that long would be if the number of subparts of
the natural vector type (the one currently being tried by
vect_slp_region) were at least 5, a factor of 5, or both. No such vector
types exist.
For example, consider this translation unit:
#include <stdint.h>
void convert(const uint64_t (*const di)[5], uint32_t (*const si)[5])
{
(*si)[0] = (*di)[0];
(*si)[1] = (*di)[1];
(*si)[2] = (*di)[2];
(*si)[3] = (*di)[3];
(*si)[4] = (*di)[4];
}
Is compiled (with -O2 -ftree-vectorize -march=armv9-a+sve
--param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable) as:
convert:
.LFB0:
.cfi_startproc
ldp q30, q31, [x0] ; vector load the first four lanes
ptrue p7.d, vl2 ; enable two lanes for vector stores
add x2, x1, 8
ldr x0, [x0, 32] ; load the fifth lane
st1w z30.d, p7, [x1] ; store least-significant 32 bits of
the first two lanes
st1w z31.d, p7, [x2] ; store least-significant 32 bits of
lanes 3 and 4
str w0, [x1, 16] ; store least-significant 32 bits of fifth
lane
ret
.cfi_endproc
The slp2 dump shows:
note: Starting SLP discovery for
note: (*si_13(D))[0] = _2;
note: (*si_13(D))[1] = _4;
note: (*si_13(D))[2] = _6;
note: (*si_13(D))[3] = _8;
note: (*si_13(D))[4] = _10;
note: Created SLP node 0x4bd9e00
note: starting SLP discovery for node 0x4bd9e00
note: get vectype for scalar type (group size 5): uint32_t
note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 5): vector([4,4]) unsigned int
note: vectype: vector([4,4]) unsigned int
note: nunits = [4,4]
missed: Build SLP failed: unrolling required in basic block SLP
This fails the check in vect_record_nunits because the group size of 5
may be larger than the number of subparts of vector([4,4]) unsigned int
(which could be as low as 4) and 5 is never an integral multiple of [4,4].
The vectoriser therefore splits the group of 5SI into 4SI + 1SI:
note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 4): vector([4,4]) unsigned int
note: Splitting SLP group at stmt 4
note: Split group into 4 and 1
note: Starting SLP discovery for
note: (*si_13(D))[0] = _2;
note: (*si_13(D))[1] = _4;
note: (*si_13(D))[2] = _6;
note: (*si_13(D))[3] = _8;
note: Created SLP node 0x4bd9ec0
note: starting SLP discovery for node 0x4bd9ec0
note: get vectype for scalar type (group size 4): uint32_t
note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 4): vector([4,4]) unsigned int
note: vectype: vector([4,4]) unsigned int
note: nunits = [4,4]
note: Build SLP for (*si_13(D))[0] = _2;
note: Build SLP for (*si_13(D))[1] = _4;
note: Build SLP for (*si_13(D))[2] = _6;
note: Build SLP for (*si_13(D))[3] = _8;
note: vect_is_simple_use: operand (unsigned int) _1, type of def: internal
note: vect_is_simple_use: operand (unsigned int) _3, type of def: internal
note: vect_is_simple_use: operand (unsigned int) _5, type of def: internal
note: vect_is_simple_use: operand (unsigned int) _7, type of def: internal
... which goes well untill it looks at the 64-bit inputs:
note: Created SLP node 0x4bda040
note: starting SLP discovery for node 0x4bda040
note: get vectype for scalar type (group size 4): const uint64_t
note: get_vectype_for_scalar_type: natural type for const uint64_t
(ignoring group size 4): const vector([2,2]) long unsigned int
note: vectype: const vector([2,2]) long unsigned int
note: nunits = [2,2]
missed: Build SLP failed: unrolling required in basic block SLP
This fails the check in vect_record_nunits because the group size of 4
may be larger than the number of subparts of vector([2,2]) unsigned int
(which could be as low as 2) and 4 is not necessarily an integral
multiple of [2,2] (e.g. the polynomial vector length could be 2+(2*3) if
the vectors are 512 bit).
The vectoriser doesn't give up though. Instead, it falls back to scalars
for the external node representing the 64-bit inputs:
note: Build SLP for _1 = (*di_12(D))[0];
note: Build SLP for _3 = (*di_12(D))[1];
note: Build SLP for _5 = (*di_12(D))[2];
note: Build SLP for _7 = (*di_12(D))[3];
note: SLP discovery for node 0x4bda040 failed
note: Building vector operands from scalars
note: Created SLP node 0x4bda100
note: SLP discovery for node 0x4bd9f80 succeeded
note: SLP discovery for node 0x4bd9ec0 succeeded
note: SLP size 3 vs. limit 16.
note: Final SLP tree for instance 0x4b174b0:
note: node 0x4bd9ec0 (nunits.min=4, nunits.max=4, refcnt=2)
vector([4,4]) unsigned int
note: op template: (*si_13(D))[0] = _2;
note: stmt 0 (*si_13(D))[0] = _2;
note: stmt 1 (*si_13(D))[1] = _4;
note: stmt 2 (*si_13(D))[2] = _6;
note: stmt 3 (*si_13(D))[3] = _8;
note: children 0x4bd9f80
note: node 0x4bd9f80 (nunits.min=4, nunits.max=4, refcnt=2)
vector([4,4]) unsigned int
note: op template: _2 = (unsigned int) _1;
note: stmt 0 _2 = (unsigned int) _1;
note: stmt 1 _4 = (unsigned int) _3;
note: stmt 2 _6 = (unsigned int) _5;
note: stmt 3 _8 = (unsigned int) _7;
note: children 0x4bda100
note: node (external) 0x4bda100 (nunits.min=18446744073709551615,
nunits.max=1, refcnt=1)
note: { _1, _3, _5, _7 }
The convert node wants vector([2,2]) long unsigned int (2 64-bit
values), which doesn't divide 4 lanes exactly, so the vectoriser falls
back to building from scalars:
note: === vect_slp_analyze_operations ===
note: ==> examining statement: _2 = (unsigned int) _1;
note: get_vectype_for_scalar_type: natural type for long unsigned int
(ignoring group size 4): vector([2,2]) long unsigned int
note: inferred vector type vector([2,2]) long unsigned int
missed: lanes=4 is not divisible by subparts=2.
missed: incompatible vector types for invariants
note: get_vectype_for_scalar_type: natural type for long unsigned int
(ignoring group size 4): vector([2,2]) long unsigned int
note: get_vectype_for_scalar_type: natural type for long unsigned int
(ignoring group size 4): vector([2,2]) long unsigned int
missed: not vectorized: relevant stmt not supported: _2 = (unsigned
int) _1;
note: Building vector operands of 0x4bd9f80 from scalars instead
note: ==> examining statement: (*si_13(D))[0] = _2;
note: updated vectype of operand 0x4bd9f80 with 4 lanes to
vector([4,4]) unsigned int
note: vect_model_store_cost: aligned.
note: vect_model_store_cost: inside_cost = 1, prologue_cost = 0 .
note: vect_prologue_cost_for_slp: node 0x4bd9f80, vector type
vector([4,4]) unsigned int, group_size 4
note: === vect_bb_partition_graph ===
note: ***** Analysis succeeded with vector mode VNx2DI
note: SLPing BB part
However, the vectorisation with mode VNx2DI is not deemed profitable:
note: Costing subgraph:
note: node 0x4bd9ec0 (nunits.min=4, nunits.max=4, refcnt=1)
vector([4,4]) unsigned int
note: op template: (*si_13(D))[0] = _2;
note: stmt 0 (*si_13(D))[0] = _2;
note: stmt 1 (*si_13(D))[1] = _4;
note: stmt 2 (*si_13(D))[2] = _6;
note: stmt 3 (*si_13(D))[3] = _8;
note: children 0x4bd9f80
note: node (external) 0x4bd9f80 (nunits.min=4, nunits.max=4, refcnt=1)
vector([4,4]) unsigned int
note: stmt 0 _2 = (unsigned int) _1;
note: stmt 1 _4 = (unsigned int) _3;
note: stmt 2 _6 = (unsigned int) _5;
note: stmt 3 _8 = (unsigned int) _7;
note: children 0x4bda100
note: node (external) 0x4bda100 (nunits.min=18446744073709551615,
nunits.max=1, refcnt=1)
note: { _1, _3, _5, _7 }
note: Cost model analysis:
_2 1 times scalar_store costs 1 in body
_4 1 times scalar_store costs 1 in body
_6 1 times scalar_store costs 1 in body
_8 1 times scalar_store costs 1 in body
_2 1 times vector_store costs 1 in body
node 0x4bd9f80 1 times vec_construct costs 3 in prologue
note: Cost model analysis for part in loop 0:
Vector cost: 11
Scalar cost: 4
missed: not vectorized: vectorization is not profitable.
note: ***** The result for vector mode VNx16QI would be the same
note: ***** The result for vector mode VNx8QI would be the same
note: ***** The result for vector mode VNx4QI would be the same
The vectoriser then successfully analyses the same block with VNx2QI:
note: === vect_analyze_slp ===
note: Starting SLP discovery for
note: (*si_13(D))[0] = _2;
note: (*si_13(D))[1] = _4;
note: (*si_13(D))[2] = _6;
note: (*si_13(D))[3] = _8;
note: (*si_13(D))[4] = _10;
note: Created SLP node 0x4bd9ec0
note: starting SLP discovery for node 0x4bd9ec0
note: get vectype for scalar type (group size 5): uint32_t
note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 5): vector([2,2]) unsigned int
note: vectype: vector([2,2]) unsigned int
note: nunits = [2,2]
missed: Build SLP failed: unrolling required in basic block SLP
note: Build SLP for (*si_13(D))[0] = _2;
note: Build SLP for (*si_13(D))[1] = _4;
note: Build SLP for (*si_13(D))[2] = _6;
note: Build SLP for (*si_13(D))[3] = _8;
note: Build SLP for (*si_13(D))[4] = _10;
note: SLP discovery for node 0x4bd9ec0 failed
It splits the group of 5 into 4 + 1:
note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 4): vector([2,2]) unsigned int
note: Splitting SLP group at stmt 4
note: Split group into 4 and 1
note: Starting SLP discovery for
note: (*si_13(D))[0] = _2;
note: (*si_13(D))[1] = _4;
note: (*si_13(D))[2] = _6;
note: (*si_13(D))[3] = _8;
note: Created SLP node 0x4bd9f80
note: starting SLP discovery for node 0x4bd9f80
note: get vectype for scalar type (group size 4): uint32_t
note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 4): vector([2,2]) unsigned int
note: vectype: vector([2,2]) unsigned int
note: nunits = [2,2]
missed: Build SLP failed: unrolling required in basic block SLP
note: Build SLP for (*si_13(D))[0] = _2;
note: Build SLP for (*si_13(D))[1] = _4;
note: Build SLP for (*si_13(D))[2] = _6;
note: Build SLP for (*si_13(D))[3] = _8;
note: SLP discovery for node 0x4bd9f80 failed
It then splits the group of 4 into 2 + 2:
note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 2): vector([2,2]) unsigned int
note: Splitting SLP group at stmt 2
note: Split group into 2 and 2
note: Starting SLP discovery for
note: (*si_13(D))[0] = _2;
note: (*si_13(D))[1] = _4;
note: Created SLP node 0x4bda100
note: starting SLP discovery for node 0x4bda100
note: get vectype for scalar type (group size 2): uint32_t
note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring
group size 2): vector([2,2]) unsigned int
note: vectype: vector([2,2]) unsigned int
note: nunits = [2,2]
note: Build SLP for (*si_13(D))[0] = _2;
note: Build SLP for (*si_13(D))[1] = _4;
note: vect_is_simple_use: operand (unsigned int) _1, type of def: internal
note: vect_is_simple_use: operand (unsigned int) _3, type of def: internal
note: Created SLP node 0x4bd9e00
note: starting SLP discovery for node 0x4bd9e00
note: get vectype for scalar type (group size 2): unsigned int
note: get_vectype_for_scalar_type: natural type for unsigned int
(ignoring group size 2): vector([2,2]) unsigned int
note: vectype: vector([2,2]) unsigned int
note: nunits = [2,2]
note: Build SLP for _2 = (unsigned int) _1;
note: Build SLP for _4 = (unsigned int) _3;
note: vect_is_simple_use: operand (*di_12(D))[0], type of def: internal
note: vect_is_simple_use: operand (*di_12(D))[1], type of def: internal
note: Created SLP node 0x4bda040
note: starting SLP discovery for node 0x4bda040
note: get vectype for scalar type (group size 2): const uint64_t
note: get_vectype_for_scalar_type: natural type for const uint64_t
(ignoring group size 2): const vector([2,2]) long unsigned int
note: vectype: const vector([2,2]) long unsigned int
note: nunits = [2,2]
note: Build SLP for _1 = (*di_12(D))[0];
note: Build SLP for _3 = (*di_12(D))[1];
note: SLP discovery for node 0x4bda040 succeeded
note: SLP discovery for node 0x4bd9e00 succeeded
note: SLP discovery for node 0x4bda100 succeeded
note: SLP size 3 vs. limit 16.
note: Final SLP tree for instance 0x4b174b0:
note: node 0x4bda100 (nunits.min=2, nunits.max=2, refcnt=2)
vector([2,2]) unsigned int
note: op template: (*si_13(D))[0] = _2;
note: stmt 0 (*si_13(D))[0] = _2;
note: stmt 1 (*si_13(D))[1] = _4;
note: children 0x4bd9e00
note: node 0x4bd9e00 (nunits.min=2, nunits.max=2, refcnt=2)
vector([2,2]) unsigned int
note: op template: _2 = (unsigned int) _1;
note: stmt 0 _2 = (unsigned int) _1;
note: stmt 1 _4 = (unsigned int) _3;
note: children 0x4bda040
note: node 0x4bda040 (nunits.min=2, nunits.max=2, refcnt=2) const
vector([2,2]) long unsigned int
note: op template: _1 = (*di_12(D))[0];
note: stmt 0 _1 = (*di_12(D))[0];
note: stmt 1 _3 = (*di_12(D))[1];
note: load permutation { 0 1 }
Unlike the previous attempt, this one is deemed profitable.
The resultant GIMPLE is:
void convert (const uint64_t[5] * const di, uint32_t[5] * const si)
{
uint32_t * vectp.14;
vector([2,2]) unsigned int * vectp_si.13;
vector([2,2]) unsigned int vect__6.12;
const vector([2,2]) long unsigned int vect__5.11;
const uint64_t * vectp.10;
const vector([2,2]) long unsigned int * vectp_di.9;
uint32_t * vectp.8;
vector([2,2]) unsigned int * vectp_si.7;
vector([2,2]) unsigned int vect__2.6;
const vector([2,2]) long unsigned int vect__1.5;
const uint64_t * vectp.4;
const vector([2,2]) long unsigned int * vectp_di.3;
long unsigned int _1;
unsigned int _2;
long unsigned int _3;
unsigned int _4;
long unsigned int _5;
unsigned int _6;
long unsigned int _7;
unsigned int _8;
long unsigned int _9;
unsigned int _10;
vector([2,2]) <signed-boolean:8> slp_mask_20;
vector([2,2]) <signed-boolean:8> slp_mask_24;
vector([2,2]) <signed-boolean:8> slp_mask_27;
vector([2,2]) <signed-boolean:8> slp_mask_31;
<bb 2> [local count: 1073741824]:
vectp.4_19 = &(*di_12(D))[0];
slp_mask_20 = .WHILE_ULT (0, 2, { 0, ... });
vect__1.5_21 = .MASK_LOAD (vectp.4_19, 64B, slp_mask_20, { 0, ... });
vect__2.6_22 = (vector([2,2]) unsigned int) vect__1.5_21;
_1 = (*di_12(D))[0];
_2 = (unsigned int) _1;
_3 = (*di_12(D))[1];
_4 = (unsigned int) _3;
vectp.8_23 = &(*si_13(D))[0];
slp_mask_24 = .WHILE_ULT (0, 2, { 0, ... });
.MASK_STORE (vectp.8_23, 32B, slp_mask_24, vect__2.6_22);
vectp.10_26 = &(*di_12(D))[2];
slp_mask_27 = .WHILE_ULT (0, 2, { 0, ... });
vect__5.11_28 = .MASK_LOAD (vectp.10_26, 64B, slp_mask_27, { 0, ... });
vect__6.12_29 = (vector([2,2]) unsigned int) vect__5.11_28;
_5 = (*di_12(D))[2];
_6 = (unsigned int) _5;
_7 = (*di_12(D))[3];
_8 = (unsigned int) _7;
vectp.14_30 = &(*si_13(D))[2];
slp_mask_31 = .WHILE_ULT (0, 2, { 0, ... });
.MASK_STORE (vectp.14_30, 32B, slp_mask_31, vect__6.12_29);
_9 = (*di_12(D))[4];
_10 = (unsigned int) _9;
(*si_13(D))[4] = _10;
return;
}
--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/