Hi Hongtao,

Thanks for your bug report. I have managed to reproduce the crash.

On 09/06/2026 07:38, Hongtao Liu wrote:
On Wed, Jun 3, 2026 at 11:25 PM Christopher Bazley <[email protected]> wrote:

Add two new fields to SLP tree nodes, which are accessed as
SLP_TREE_CAN_USE_PARTIAL_VECTORS_P and SLP_TREE_PARTIAL_VECTORS_STYLE.

SLP_TREE_CAN_USE_PARTIAL_VECTORS_P is analogous to the existing
predicate LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P. It is initialized to
true. This flag just records whether the target could vectorize a
node using a partial vector; it does not say anything about
whether the vector actually is partial, or how the target would support
use of a partial vector. Some kinds of node require mask/length for
partial vectors; others don't. In the latter case (e.g., for add
operations), SLP_TREE_CAN_USE_PARTIAL_VECTORS_P will remain true.

SLP_TREE_PARTIAL_VECTORS_STYLE is analogous to the existing field
LOOP_VINFO_PARTIAL_VECTORS_STYLE. Both are initialized to 'none'.
The vect_partial_vectors_avx512 enumerator is not used for BB SLP.
Unlike loop vectorization, a different style of partial vectors can be
chosen for each node during analysis of that node.

Implement the recently-introduced wrapper functions,
vect_record_(len|mask), for BB SLP by setting
SLP_TREE_PARTIAL_VECTORS_STYLE to indicate that a mask or length should
be used for a given SLP node. The passed-in vec_info is ignored.

Implement the vect_fully_(masked|with_length)_p wrapper functions for
BB SLP by checking the SLP_TREE_PARTIAL_VECTORS_STYLE. This should be
sufficient because at most one of vect_record_(len|mask) and
vect_cannot_use_partial_vectors are expected to be called for any
given SLP node. SLP_TREE_CAN_USE_PARTIAL_VECTORS_P should be true if
the style is not 'none', but its value isn't used beyond the analysis
phase.

The implementations of vect_get_mask and vect_get_len for BB SLP are
non-trivial (albeit simpler than for loop vectorization), therefore they
are delegated to SLP-specific functions defined in tree-vect-slp.cc.

Implement the vect_cannot_use_partial_vectors wrapper function by
setting the SLP_TREE_CAN_USE_PARTIAL_VECTORS_P flag to false.
To prevent regressions, vect_can_use_partial_vectors_p still returns
false for BB SLP regardless (for now). This prevents vect_record_mask
or vect_record_len from being called.

gcc/ChangeLog:

         * tree-vect-slp.cc (_slp_tree::_slp_tree): initialize new
         partial_vector_style, can_use_partial_vectors and
         num_partial_vectors members.
         (vect_slp_analyze_node_operations): Account for worst-case
         prologue costs of per-node partial-vector mask or length
         materialisation.
         (vect_slp_record_bb_style): Set the partial vector style of an
         SLP node, checking that the style does not flip-flop between mask
         and length.
         (vect_slp_record_bb_mask): Use vect_slp_record_bb_style to set
         the partial vector style of the SLP tree node to
         vect_partial_vectors_while_ult.
         (vect_slp_get_bb_mask): New function to materialize a mask for
         basic block SLP vectorization.
         (vect_slp_record_bb_len): Use vect_slp_record_bb_style to set
         the partial vector style of the SLP tree node to
         vect_partial_vectors_len.
         (vect_slp_get_bb_len): New function to materialize a length for
         basic block SLP vectorization.
         * tree-vect-stmts.cc (vectorizable_internal_function):
         (vect_record_mask): Handle the basic block SLP use case by
         delegating to vect_slp_record_bb_mask.
         (vect_get_mask): Handle the basic block SLP use case by
         delegating to vect_slp_get_bb_mask.
         (vect_record_len): Handle the basic block SLP use case by
         delegating to vect_slp_record_bb_len.
         (vect_get_len): Handle the basic block SLP use case by
         delegating to vect_slp_get_bb_len.
         (vect_gen_while_ssa_name): New function containing code
         refactored out of vect_gen_while for reuse by
         vect_slp_get_bb_mask.
         (vect_gen_while): Use vect_gen_while_ssa_name instead of custom
         code for some of the implementation.
         * tree-vectorizer.h (enum vect_partial_vector_style): Move this
         definition earlier to allow reuse by struct _slp_tree.
         (struct _slp_tree): Add a partial_vector_style member to record
         whether to use a length or mask for the SLP tree node, if
         partial vectors are required and supported.
         Add a can_use_partial_vectors member to record whether partial
         vectors are supported for the SLP tree node.
         Add a num_partial_vectors member for costing.
         (SLP_TREE_PARTIAL_VECTORS_STYLE): New member accessor macro.
         (SLP_TREE_CAN_USE_PARTIAL_VECTORS_P): New member accessor macro.
         (SLP_TREE_NUM_PARTIAL_VECTORS): New member accessor macro.
         (vect_gen_while_ssa_name): Declaration of a new function.
         (vect_slp_get_bb_mask): As above.
         (vect_slp_get_bb_len): As above.
         (vect_cannot_use_partial_vectors): Handle the basic block SLP
         use-case by setting SLP_TREE_CAN_USE_PARTIAL_VECTORS_P to
         false.
         (vect_fully_with_length_p): Handle the basic block SLP use
         case by checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
         vect_partial_vectors_len.
         (vect_fully_masked_p): Handle the basic block SLP use case by
         checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
         vect_partial_vectors_while_ult.
---
  gcc/tree-vect-slp.cc   | 182 +++++++++++++++++++++++++++++++++++++++++
  gcc/tree-vect-stmts.cc |  52 +++++++-----
  gcc/tree-vectorizer.h  |  52 ++++++++----
  3 files changed, 247 insertions(+), 39 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 075e93f04a9..4dd7e6e1e21 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -125,6 +125,9 @@ _slp_tree::_slp_tree ()
    SLP_TREE_GS_BASE (this) = NULL_TREE;
    this->ldst_lanes = false;
    this->avoid_stlf_fail = false;
+  SLP_TREE_PARTIAL_VECTORS_STYLE (this) = vect_partial_vectors_none;
+  SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (this) = true;
+  SLP_TREE_NUM_PARTIAL_VECTORS (this) = 0;
    SLP_TREE_VECTYPE (this) = NULL_TREE;
    SLP_TREE_REPRESENTATIVE (this) = NULL;
    this->cycle_info.id = -1;
@@ -8958,6 +8961,40 @@ vect_slp_analyze_node_operations (vec_info *vinfo, 
slp_tree node,
           vect_prologue_cost_for_slp (vinfo, child, cost_vec);
         }

+  if (res)
+    {
+      /* Take care of special costs for partial vectors.
+        Costing each partial vector is excessive for many SLP instances,
+        because it is common to materialise identical masks/lengths for related
+        operations (e.g., for vector loads and stores of the same length).
+        Masks/lengths can also be shared between SLP subgraphs or eliminated by
+        pattern-based lowering during instruction selection.  However, it's
+        simpler and safer to use the worst-case cost; if this ends up being the
+        tie-breaker between vectorizing or not, then it's probably better not
+        to vectorize.  */
+      const int num_partial_vectors = SLP_TREE_NUM_PARTIAL_VECTORS (node);
+
+      if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
+         == vect_partial_vectors_while_ult)
+       {
+         gcc_assert (num_partial_vectors > 0);
+         record_stmt_cost (cost_vec, num_partial_vectors, vector_stmt, NULL,
+                           NULL, NULL_TREE, 0, vect_prologue);
+       }
+      else if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
+              == vect_partial_vectors_len)
+       {
+         /* Need to set up a length in the prologue.  */
+         gcc_assert (num_partial_vectors > 0);
+         record_stmt_cost (cost_vec, num_partial_vectors, scalar_stmt, NULL,
+                           NULL, NULL_TREE, 0, vect_prologue);
+       }
+      else
+       {
+         gcc_assert (num_partial_vectors == 0);
+       }
+    }
+
    /* If this node or any of its children can't be vectorized, try pruning
       the tree here rather than felling the whole thing.  */
    if (!res && vect_slp_convert_to_external (vinfo, node, node_instance))
@@ -12441,3 +12478,148 @@ vect_schedule_slp (vec_info *vinfo, const 
vec<slp_instance> &slp_instances)
          }
      }
  }
+
+/* Record that a specific partial vector style could be used to vectorize
+   SLP_NODE if required.  */
+
+static void
+vect_slp_record_bb_style (slp_tree slp_node, vect_partial_vector_style style)
+{
+  gcc_assert (style != vect_partial_vectors_none);
+  gcc_assert (style != vect_partial_vectors_avx512);
+
+  if (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == vect_partial_vectors_none)
+    SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) = style;
+  else
+    gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == style);
+}
+
+/* Record that a complete set of masks associated with SLP_NODE would need to
+   contain a sequence of NVECTORS masks that each control a vector of type
+   VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
+   these vector masks with the vector version of SCALAR_MASK.  */
+void
+vect_slp_record_bb_mask (slp_tree slp_node, unsigned int /* nvectors */,
+                        tree /* vectype */, tree /* scalar_mask */)
+{
+  vect_slp_record_bb_style (slp_node, vect_partial_vectors_while_ult);
+
+  /* FORNOW: this often overestimates the number of masks for costing purposes
+     because, after lowering, masks have often been eliminated, shared between
+     SLP nodes, or even shared between SLP subgraphs.  */
+  SLP_TREE_NUM_PARTIAL_VECTORS(slp_node) ++;
+}
+
+/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE that
+   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
+   Insert any set-up statements before GSI.  */
+
+tree
+vect_slp_get_bb_mask (slp_tree slp_node, gimple_stmt_iterator *gsi,
+                     unsigned int nvectors, tree vectype, unsigned int index)
+{
+  gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
+             == vect_partial_vectors_while_ult);
+  gcc_assert (nvectors >= 1);
+  gcc_assert (index < nvectors);
+
+  const poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  const unsigned int group_size = SLP_TREE_LANES (slp_node);
+  unsigned int mask_size = group_size;
+  const tree masktype = truth_type_for (vectype);
+
+  if (nunits.is_constant ())
+    {
+      /* Only the last vector can be a partial vector.  */
+      if (index + 1 < nvectors)
+       return build_minus_one_cst (masktype);
+
+      /* Return a mask for a possibly-partial tail vector. */
+      const unsigned int const_nunits = nunits.to_constant ();
+      const unsigned int head_size = (nvectors - 1) * const_nunits;
+      gcc_assert (head_size <= group_size);
+      mask_size = group_size - head_size;
+
+      if (mask_size == const_nunits)
+       return build_minus_one_cst (masktype);
+    }
+  else
+    {
+      /* Return a mask for a single variable-length vector. */
+      gcc_assert (nvectors == 1);
+      gcc_assert (known_le (mask_size, nunits));
+    }
+
+  /* FORNOW: don't bother maintaining a set of mask constants to allow
+     sharing between nodes belonging to the same instance of bb_vec_info
+     or even within the same SLP subgraph.  */
+  gimple_seq stmts = NULL;
+  const tree cmp_type = size_type_node;
+  const tree start_index = build_zero_cst (cmp_type);
+  const tree end_index = build_int_cst (cmp_type, mask_size);
+  const tree mask = make_temp_ssa_name (masktype, NULL, "slp_mask");
+  vect_gen_while_ssa_name (&stmts, masktype, start_index, end_index, mask);

Not a review, I've encountered an ICE when trying to compile with x86 avx512

./gcc/xgcc -B ./gcc -O3 -march=sapphirerapids slp_pred_1.c -S

during GIMPLE pass: slp
slp_pred_1.c: In function ‘f’:
slp_pred_1.c:11:1: internal compiler error: in
vect_gen_while_ssa_name, at tree-vect-stmts.cc:14883
    11 | f (uint8_t *x)
       | ^
0x26038eb internal_error(char const*, ...)
         ../../slp_pred_tail/gcc/diagnostic-global-context.cc:787
0x9e8768 fancy_abort(char const*, int, char const*)
         ../../slp_pred_tail/gcc/diagnostics/context.cc:1813
0x8dca22 vect_gen_while_ssa_name(gimple**, tree_node*, tree_node*,
tree_node*, tree_node*)
         ../../slp_pred_tail/gcc/tree-vect-stmts.cc:14883
0x14f182a vect_slp_get_bb_mask(_slp_tree*, gimple_stmt_iterator*,
unsigned int, tree_node*, unsigned int)
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:12688
0x149cab7 vectorizable_load
         ../../slp_pred_tail/gcc/tree-vect-stmts.cc:11522
0x14ad760 vect_transform_stmt(vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, _slp_tree*, _slp_instance*)
         ../../slp_pred_tail/gcc/tree-vect-stmts.cc:13581
0x14eee89 vect_schedule_slp_node
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:12171
0x15123d1 vect_schedule_slp_node
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:11940
0x15123d1 vect_schedule_scc
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:12418
0x151236a vect_schedule_scc
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:12399
0x151236a vect_schedule_scc
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:12399
0x1512a49 vect_schedule_slp(vec_info*, vec<_slp_instance*, va_heap,
vl_ptr> const&)
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:12563
0x15145af vect_slp_region
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:10445
0x151640b vect_slp_bbs
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:10557
0x15169b4 vect_slp_function(function*)
         ../../slp_pred_tail/gcc/tree-vect-slp.cc:10679
0x1521ad2 execute
         ../../slp_pred_tail/gcc/tree-vectorizer.cc:1570

It materializes BB-SLP tail masks with WHILE_ULT, which x86 doesn’t support.

True.

I assume that I need to add a guard equivalent to the existing direct_internal_fn_supported_p (IFN_WHILE_ULT, ...) call in the vect_analyze_loop -> vect_analyze_loop_1 -> vect_analyze_loop_2 -> vect_verify_full_masking -> can_produce_all_loop_masks_p function.
That function is not used for basic block vectorisation.

When debugging the slp_pred_1.c test on x86 with your suggested configuration, the vect_get_partial_vector_style helper function consistently returns vect_partial_vectors_while_ult because targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode) && can_vec_mask_load_store_p (vecmode, mask_mode, is_load, NULL, elsvals). This helper was my invention. Its return type seems questionable with hindsight, but vect_slp_get_bb_mask does not fail because of the return value of vect_get_partial_vector_style; it fails because vect_slp_get_bb_mask does not distinguish between the WHILE_ULT and AVX512 scenarios.

vect_slp_record_bb_mask sets SLP_TREE_PARTIAL_VECTORS_STYLE to vect_partial_vectors_while_ult, regardless of whether AVX512-style masks or WHILE_ULT masks are required. The assertion in vect_slp_get_bb_mask that SLP_TREE_PARTIAL_VECTORS_STYLE == vect_partial_vectors_while_ult is probably a contributing factor because it appears to mean that it is safe to use vect_gen_while_ssa_name, which is untrue.

I propose this (modulo other review comments):

1. Rename vect_get_partial_vector_style as vect_get_load_store_partial_vector_style.

2. Create a new three-value enumerated type, e.g. vect_load_store_partial_vectors_(none|len|mask).

3. Use vect_load_store_partial_vectors_style as the return value of vect_get_load_store_partial_vector_style.

4. Add a missing BB-SLP analysis step that sets the value of SLP_TREE_PARTIAL_VECTORS_STYLE based on direct_internal_fn_supported_p (IFN_WHILE_ULT, ...) and members of the SLP node. (I think Richard wants the existing code to be reused here.)

5. In vect_slp_get_bb_mask, generate the kind of mask specified by SLP_TREE_PARTIAL_VECTORS_STYLE instead of assuming WHILE_ULT.

After manually using a constant mask for avx512, I encountered another
performance issue.
if I change slp_pred_1.c to
void
f (uint8_t *x)
{
   x[0] += 1;
   x[1] += 2;
   x[2] += 1;
   x[3] += 2;
   x[4] += 1;
   x[5] += 2;
   x[6] += 1;
   x[7] += 2;
   x[8] += 1;
   x[9] += 2;
   x[10] += 1;
   x[11] += 2;
   x[12] += 1;
   x[13] += 2;
   x[14] += 1;
   x[15] += 4;
}

(This modified version of slp_pred_1.c looks similar to slp_pred_3.c)

with -march=sapphirerapids -O3, it generates

  <bb 2> [local count: 1073741824]:
  vectp.4_51 = x_34(D);
  vect__1.5_52 = .MASK_LOAD (vectp.4_51, 8B, { -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0,
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0 });
  vect__2.6_53 = vect__1.5_52 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
1, 2, 1, 4, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,
4 };
  _1 = *x_34(D);

That looks equivalent to the GIMPLE generated for -march=armv8.2-a+sve:

  <bb 2> [local count: 1073741824]:
  vectp.4_51 = x_34(D);
  slp_mask_52 = .WHILE_ULT (0, 16, { 0, ... });
  vect__1.5_53 = .MASK_LOAD (vectp.4_51, 8B, slp_mask_52, { 0, ... });
vect__2.6_54 = vect__1.5_53 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... };
  _1 = *x_34(D);

When the target is AArch64, the masked load of 'vector([16,16]) unsigned char' is eventually optimized into an unmasked load of 128 bits, therefore the generated assembly language has no masks in it:

        adrp    x1, .LANCHOR0
        ldr     q31, [x0]
        ldr     q30, [x1, #:lo12:.LANCHOR0]
        add     z30.b, z30.b, z31.b
        str     q30, [x0]

But a 128-bit vector w/o mask should be used here instead of using
256-bit vector + mask off upper 128-bit.

  <bb 2> [local count: 1073741824]:
  vectp.4_51 = x_34(D);
  vect__1.5_52 = MEM <vector(16) unsigned char> [(uint8_t *)vectp.4_51];
  vect__2.6_53 = vect__1.5_52 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
1, 2, 1, 4 };
  _1 = *x_34(D);
  _2 = _1 + 1;
  _3 = MEM[(uint8_t *)x_34(D) + 1B];

Similarly, for original slp-pred-1.c, a 128-bit vector should be used
with a mask instead of 256-bit vector.
I think the best solution would be to determine the smallest type whose number of subparts is at least 1 << ceil_log2(group_size) first, before checking that mask/length can be used to fix up usage of that type if it is oversized.

Thanks,
--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/

Reply via email to