Hi Hongtao,
Thanks for your bug report. I have managed to reproduce the crash.
On 09/06/2026 07:38, Hongtao Liu wrote:
On Wed, Jun 3, 2026 at 11:25 PM Christopher Bazley <[email protected]> wrote:
Add two new fields to SLP tree nodes, which are accessed as
SLP_TREE_CAN_USE_PARTIAL_VECTORS_P and SLP_TREE_PARTIAL_VECTORS_STYLE.
SLP_TREE_CAN_USE_PARTIAL_VECTORS_P is analogous to the existing
predicate LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P. It is initialized to
true. This flag just records whether the target could vectorize a
node using a partial vector; it does not say anything about
whether the vector actually is partial, or how the target would support
use of a partial vector. Some kinds of node require mask/length for
partial vectors; others don't. In the latter case (e.g., for add
operations), SLP_TREE_CAN_USE_PARTIAL_VECTORS_P will remain true.
SLP_TREE_PARTIAL_VECTORS_STYLE is analogous to the existing field
LOOP_VINFO_PARTIAL_VECTORS_STYLE. Both are initialized to 'none'.
The vect_partial_vectors_avx512 enumerator is not used for BB SLP.
Unlike loop vectorization, a different style of partial vectors can be
chosen for each node during analysis of that node.
Implement the recently-introduced wrapper functions,
vect_record_(len|mask), for BB SLP by setting
SLP_TREE_PARTIAL_VECTORS_STYLE to indicate that a mask or length should
be used for a given SLP node. The passed-in vec_info is ignored.
Implement the vect_fully_(masked|with_length)_p wrapper functions for
BB SLP by checking the SLP_TREE_PARTIAL_VECTORS_STYLE. This should be
sufficient because at most one of vect_record_(len|mask) and
vect_cannot_use_partial_vectors are expected to be called for any
given SLP node. SLP_TREE_CAN_USE_PARTIAL_VECTORS_P should be true if
the style is not 'none', but its value isn't used beyond the analysis
phase.
The implementations of vect_get_mask and vect_get_len for BB SLP are
non-trivial (albeit simpler than for loop vectorization), therefore they
are delegated to SLP-specific functions defined in tree-vect-slp.cc.
Implement the vect_cannot_use_partial_vectors wrapper function by
setting the SLP_TREE_CAN_USE_PARTIAL_VECTORS_P flag to false.
To prevent regressions, vect_can_use_partial_vectors_p still returns
false for BB SLP regardless (for now). This prevents vect_record_mask
or vect_record_len from being called.
gcc/ChangeLog:
* tree-vect-slp.cc (_slp_tree::_slp_tree): initialize new
partial_vector_style, can_use_partial_vectors and
num_partial_vectors members.
(vect_slp_analyze_node_operations): Account for worst-case
prologue costs of per-node partial-vector mask or length
materialisation.
(vect_slp_record_bb_style): Set the partial vector style of an
SLP node, checking that the style does not flip-flop between mask
and length.
(vect_slp_record_bb_mask): Use vect_slp_record_bb_style to set
the partial vector style of the SLP tree node to
vect_partial_vectors_while_ult.
(vect_slp_get_bb_mask): New function to materialize a mask for
basic block SLP vectorization.
(vect_slp_record_bb_len): Use vect_slp_record_bb_style to set
the partial vector style of the SLP tree node to
vect_partial_vectors_len.
(vect_slp_get_bb_len): New function to materialize a length for
basic block SLP vectorization.
* tree-vect-stmts.cc (vectorizable_internal_function):
(vect_record_mask): Handle the basic block SLP use case by
delegating to vect_slp_record_bb_mask.
(vect_get_mask): Handle the basic block SLP use case by
delegating to vect_slp_get_bb_mask.
(vect_record_len): Handle the basic block SLP use case by
delegating to vect_slp_record_bb_len.
(vect_get_len): Handle the basic block SLP use case by
delegating to vect_slp_get_bb_len.
(vect_gen_while_ssa_name): New function containing code
refactored out of vect_gen_while for reuse by
vect_slp_get_bb_mask.
(vect_gen_while): Use vect_gen_while_ssa_name instead of custom
code for some of the implementation.
* tree-vectorizer.h (enum vect_partial_vector_style): Move this
definition earlier to allow reuse by struct _slp_tree.
(struct _slp_tree): Add a partial_vector_style member to record
whether to use a length or mask for the SLP tree node, if
partial vectors are required and supported.
Add a can_use_partial_vectors member to record whether partial
vectors are supported for the SLP tree node.
Add a num_partial_vectors member for costing.
(SLP_TREE_PARTIAL_VECTORS_STYLE): New member accessor macro.
(SLP_TREE_CAN_USE_PARTIAL_VECTORS_P): New member accessor macro.
(SLP_TREE_NUM_PARTIAL_VECTORS): New member accessor macro.
(vect_gen_while_ssa_name): Declaration of a new function.
(vect_slp_get_bb_mask): As above.
(vect_slp_get_bb_len): As above.
(vect_cannot_use_partial_vectors): Handle the basic block SLP
use-case by setting SLP_TREE_CAN_USE_PARTIAL_VECTORS_P to
false.
(vect_fully_with_length_p): Handle the basic block SLP use
case by checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
vect_partial_vectors_len.
(vect_fully_masked_p): Handle the basic block SLP use case by
checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
vect_partial_vectors_while_ult.
---
gcc/tree-vect-slp.cc | 182 +++++++++++++++++++++++++++++++++++++++++
gcc/tree-vect-stmts.cc | 52 +++++++-----
gcc/tree-vectorizer.h | 52 ++++++++----
3 files changed, 247 insertions(+), 39 deletions(-)
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 075e93f04a9..4dd7e6e1e21 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -125,6 +125,9 @@ _slp_tree::_slp_tree ()
SLP_TREE_GS_BASE (this) = NULL_TREE;
this->ldst_lanes = false;
this->avoid_stlf_fail = false;
+ SLP_TREE_PARTIAL_VECTORS_STYLE (this) = vect_partial_vectors_none;
+ SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (this) = true;
+ SLP_TREE_NUM_PARTIAL_VECTORS (this) = 0;
SLP_TREE_VECTYPE (this) = NULL_TREE;
SLP_TREE_REPRESENTATIVE (this) = NULL;
this->cycle_info.id = -1;
@@ -8958,6 +8961,40 @@ vect_slp_analyze_node_operations (vec_info *vinfo,
slp_tree node,
vect_prologue_cost_for_slp (vinfo, child, cost_vec);
}
+ if (res)
+ {
+ /* Take care of special costs for partial vectors.
+ Costing each partial vector is excessive for many SLP instances,
+ because it is common to materialise identical masks/lengths for related
+ operations (e.g., for vector loads and stores of the same length).
+ Masks/lengths can also be shared between SLP subgraphs or eliminated by
+ pattern-based lowering during instruction selection. However, it's
+ simpler and safer to use the worst-case cost; if this ends up being the
+ tie-breaker between vectorizing or not, then it's probably better not
+ to vectorize. */
+ const int num_partial_vectors = SLP_TREE_NUM_PARTIAL_VECTORS (node);
+
+ if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
+ == vect_partial_vectors_while_ult)
+ {
+ gcc_assert (num_partial_vectors > 0);
+ record_stmt_cost (cost_vec, num_partial_vectors, vector_stmt, NULL,
+ NULL, NULL_TREE, 0, vect_prologue);
+ }
+ else if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
+ == vect_partial_vectors_len)
+ {
+ /* Need to set up a length in the prologue. */
+ gcc_assert (num_partial_vectors > 0);
+ record_stmt_cost (cost_vec, num_partial_vectors, scalar_stmt, NULL,
+ NULL, NULL_TREE, 0, vect_prologue);
+ }
+ else
+ {
+ gcc_assert (num_partial_vectors == 0);
+ }
+ }
+
/* If this node or any of its children can't be vectorized, try pruning
the tree here rather than felling the whole thing. */
if (!res && vect_slp_convert_to_external (vinfo, node, node_instance))
@@ -12441,3 +12478,148 @@ vect_schedule_slp (vec_info *vinfo, const
vec<slp_instance> &slp_instances)
}
}
}
+
+/* Record that a specific partial vector style could be used to vectorize
+ SLP_NODE if required. */
+
+static void
+vect_slp_record_bb_style (slp_tree slp_node, vect_partial_vector_style style)
+{
+ gcc_assert (style != vect_partial_vectors_none);
+ gcc_assert (style != vect_partial_vectors_avx512);
+
+ if (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == vect_partial_vectors_none)
+ SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) = style;
+ else
+ gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == style);
+}
+
+/* Record that a complete set of masks associated with SLP_NODE would need to
+ contain a sequence of NVECTORS masks that each control a vector of type
+ VECTYPE. If SCALAR_MASK is nonnull, the fully-masked loop would AND
+ these vector masks with the vector version of SCALAR_MASK. */
+void
+vect_slp_record_bb_mask (slp_tree slp_node, unsigned int /* nvectors */,
+ tree /* vectype */, tree /* scalar_mask */)
+{
+ vect_slp_record_bb_style (slp_node, vect_partial_vectors_while_ult);
+
+ /* FORNOW: this often overestimates the number of masks for costing purposes
+ because, after lowering, masks have often been eliminated, shared between
+ SLP nodes, or even shared between SLP subgraphs. */
+ SLP_TREE_NUM_PARTIAL_VECTORS(slp_node) ++;
+}
+
+/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE that
+ operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
+ Insert any set-up statements before GSI. */
+
+tree
+vect_slp_get_bb_mask (slp_tree slp_node, gimple_stmt_iterator *gsi,
+ unsigned int nvectors, tree vectype, unsigned int index)
+{
+ gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
+ == vect_partial_vectors_while_ult);
+ gcc_assert (nvectors >= 1);
+ gcc_assert (index < nvectors);
+
+ const poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+ const unsigned int group_size = SLP_TREE_LANES (slp_node);
+ unsigned int mask_size = group_size;
+ const tree masktype = truth_type_for (vectype);
+
+ if (nunits.is_constant ())
+ {
+ /* Only the last vector can be a partial vector. */
+ if (index + 1 < nvectors)
+ return build_minus_one_cst (masktype);
+
+ /* Return a mask for a possibly-partial tail vector. */
+ const unsigned int const_nunits = nunits.to_constant ();
+ const unsigned int head_size = (nvectors - 1) * const_nunits;
+ gcc_assert (head_size <= group_size);
+ mask_size = group_size - head_size;
+
+ if (mask_size == const_nunits)
+ return build_minus_one_cst (masktype);
+ }
+ else
+ {
+ /* Return a mask for a single variable-length vector. */
+ gcc_assert (nvectors == 1);
+ gcc_assert (known_le (mask_size, nunits));
+ }
+
+ /* FORNOW: don't bother maintaining a set of mask constants to allow
+ sharing between nodes belonging to the same instance of bb_vec_info
+ or even within the same SLP subgraph. */
+ gimple_seq stmts = NULL;
+ const tree cmp_type = size_type_node;
+ const tree start_index = build_zero_cst (cmp_type);
+ const tree end_index = build_int_cst (cmp_type, mask_size);
+ const tree mask = make_temp_ssa_name (masktype, NULL, "slp_mask");
+ vect_gen_while_ssa_name (&stmts, masktype, start_index, end_index, mask);
Not a review, I've encountered an ICE when trying to compile with x86 avx512
./gcc/xgcc -B ./gcc -O3 -march=sapphirerapids slp_pred_1.c -S
during GIMPLE pass: slp
slp_pred_1.c: In function ‘f’:
slp_pred_1.c:11:1: internal compiler error: in
vect_gen_while_ssa_name, at tree-vect-stmts.cc:14883
11 | f (uint8_t *x)
| ^
0x26038eb internal_error(char const*, ...)
../../slp_pred_tail/gcc/diagnostic-global-context.cc:787
0x9e8768 fancy_abort(char const*, int, char const*)
../../slp_pred_tail/gcc/diagnostics/context.cc:1813
0x8dca22 vect_gen_while_ssa_name(gimple**, tree_node*, tree_node*,
tree_node*, tree_node*)
../../slp_pred_tail/gcc/tree-vect-stmts.cc:14883
0x14f182a vect_slp_get_bb_mask(_slp_tree*, gimple_stmt_iterator*,
unsigned int, tree_node*, unsigned int)
../../slp_pred_tail/gcc/tree-vect-slp.cc:12688
0x149cab7 vectorizable_load
../../slp_pred_tail/gcc/tree-vect-stmts.cc:11522
0x14ad760 vect_transform_stmt(vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, _slp_tree*, _slp_instance*)
../../slp_pred_tail/gcc/tree-vect-stmts.cc:13581
0x14eee89 vect_schedule_slp_node
../../slp_pred_tail/gcc/tree-vect-slp.cc:12171
0x15123d1 vect_schedule_slp_node
../../slp_pred_tail/gcc/tree-vect-slp.cc:11940
0x15123d1 vect_schedule_scc
../../slp_pred_tail/gcc/tree-vect-slp.cc:12418
0x151236a vect_schedule_scc
../../slp_pred_tail/gcc/tree-vect-slp.cc:12399
0x151236a vect_schedule_scc
../../slp_pred_tail/gcc/tree-vect-slp.cc:12399
0x1512a49 vect_schedule_slp(vec_info*, vec<_slp_instance*, va_heap,
vl_ptr> const&)
../../slp_pred_tail/gcc/tree-vect-slp.cc:12563
0x15145af vect_slp_region
../../slp_pred_tail/gcc/tree-vect-slp.cc:10445
0x151640b vect_slp_bbs
../../slp_pred_tail/gcc/tree-vect-slp.cc:10557
0x15169b4 vect_slp_function(function*)
../../slp_pred_tail/gcc/tree-vect-slp.cc:10679
0x1521ad2 execute
../../slp_pred_tail/gcc/tree-vectorizer.cc:1570
It materializes BB-SLP tail masks with WHILE_ULT, which x86 doesn’t support.
True.
I assume that I need to add a guard equivalent to the existing
direct_internal_fn_supported_p (IFN_WHILE_ULT, ...) call in the
vect_analyze_loop -> vect_analyze_loop_1 -> vect_analyze_loop_2 ->
vect_verify_full_masking -> can_produce_all_loop_masks_p function.
That function is not used for basic block vectorisation.
When debugging the slp_pred_1.c test on x86 with your suggested
configuration, the vect_get_partial_vector_style helper function
consistently returns vect_partial_vectors_while_ult because
targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode) &&
can_vec_mask_load_store_p (vecmode, mask_mode, is_load, NULL, elsvals).
This helper was my invention. Its return type seems questionable with
hindsight, but vect_slp_get_bb_mask does not fail because of the return
value of vect_get_partial_vector_style; it fails because
vect_slp_get_bb_mask does not distinguish between the WHILE_ULT and
AVX512 scenarios.
vect_slp_record_bb_mask sets SLP_TREE_PARTIAL_VECTORS_STYLE to
vect_partial_vectors_while_ult, regardless of whether AVX512-style masks
or WHILE_ULT masks are required. The assertion in vect_slp_get_bb_mask
that SLP_TREE_PARTIAL_VECTORS_STYLE == vect_partial_vectors_while_ult is
probably a contributing factor because it appears to mean that it is
safe to use vect_gen_while_ssa_name, which is untrue.
I propose this (modulo other review comments):
1. Rename vect_get_partial_vector_style as
vect_get_load_store_partial_vector_style.
2. Create a new three-value enumerated type, e.g.
vect_load_store_partial_vectors_(none|len|mask).
3. Use vect_load_store_partial_vectors_style as the return value of
vect_get_load_store_partial_vector_style.
4. Add a missing BB-SLP analysis step that sets the value of
SLP_TREE_PARTIAL_VECTORS_STYLE based on direct_internal_fn_supported_p
(IFN_WHILE_ULT, ...) and members of the SLP node. (I think Richard wants
the existing code to be reused here.)
5. In vect_slp_get_bb_mask, generate the kind of mask specified by
SLP_TREE_PARTIAL_VECTORS_STYLE instead of assuming WHILE_ULT.
After manually using a constant mask for avx512, I encountered another
performance issue.
if I change slp_pred_1.c to
void
f (uint8_t *x)
{
x[0] += 1;
x[1] += 2;
x[2] += 1;
x[3] += 2;
x[4] += 1;
x[5] += 2;
x[6] += 1;
x[7] += 2;
x[8] += 1;
x[9] += 2;
x[10] += 1;
x[11] += 2;
x[12] += 1;
x[13] += 2;
x[14] += 1;
x[15] += 4;
}
(This modified version of slp_pred_1.c looks similar to slp_pred_3.c)
with -march=sapphirerapids -O3, it generates
<bb 2> [local count: 1073741824]:
vectp.4_51 = x_34(D);
vect__1.5_52 = .MASK_LOAD (vectp.4_51, 8B, { -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0,
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0 });
vect__2.6_53 = vect__1.5_52 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
1, 2, 1, 4, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,
4 };
_1 = *x_34(D);
That looks equivalent to the GIMPLE generated for -march=armv8.2-a+sve:
<bb 2> [local count: 1073741824]:
vectp.4_51 = x_34(D);
slp_mask_52 = .WHILE_ULT (0, 16, { 0, ... });
vect__1.5_53 = .MASK_LOAD (vectp.4_51, 8B, slp_mask_52, { 0, ... });
vect__2.6_54 = vect__1.5_53 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
1, 2, 1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... };
_1 = *x_34(D);
When the target is AArch64, the masked load of 'vector([16,16]) unsigned
char' is eventually optimized into an unmasked load of 128 bits,
therefore the generated assembly language has no masks in it:
adrp x1, .LANCHOR0
ldr q31, [x0]
ldr q30, [x1, #:lo12:.LANCHOR0]
add z30.b, z30.b, z31.b
str q30, [x0]
But a 128-bit vector w/o mask should be used here instead of using
256-bit vector + mask off upper 128-bit.
<bb 2> [local count: 1073741824]:
vectp.4_51 = x_34(D);
vect__1.5_52 = MEM <vector(16) unsigned char> [(uint8_t *)vectp.4_51];
vect__2.6_53 = vect__1.5_52 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
1, 2, 1, 4 };
_1 = *x_34(D);
_2 = _1 + 1;
_3 = MEM[(uint8_t *)x_34(D) + 1B];
Similarly, for original slp-pred-1.c, a 128-bit vector should be used
with a mask instead of 256-bit vector.
I think the best solution would be to determine the smallest type whose
number of subparts is at least 1 << ceil_log2(group_size) first, before
checking that mask/length can be used to fix up usage of that type if it
is oversized.
Thanks,
--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/