On Wed, 15 Oct 2025, Tamar Christina wrote:
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: 14 October 2025 13:26
> > To: [email protected]
> > Cc: Tamar Christina <[email protected]>; [email protected]
> > Subject: [PATCH 1/3][RFC] Implement bool reduction vectorization
> >
> > Currently we mess up here in two places. One is pattern recognition
> > which computes a mask-precision for a bool reduction PHI that's
> > inconsistent with that of the latch definition. This is solved by
> > iterating the mask-precision computation. The second is that the
> > reduction epilogue generation and the code querying support for it
> > isn't ready for mask inputs. This should be fixed as well, mainly
> > by doing all the epilogue processing on a data type again, for now
> > at least. We probably want reduc_mask_{and,ior,xor}_scal optabs
> > so we can go the direct IFN path on masks if the target supports
> > that.
> >
> > [2/3] adds these optabs and [3/3] is a way to use them but with no
> > actual target implementation (I stubbed a x86 one for one case to
> > get the code miscompiled^Wexercised).
> >
> > I wonder if there's any feedback on the epilogue handling, esp. how
> > SVE or RVV would be able to handle vector mask reduction to a
> > scalar bool in the epilogue. Esp. whether you think there is already
> > sufficient functionality that I just didn't see.
>
> For SVE & and the | case would use the CC registers.
>
> or_reduc:
> ptrue p3.b, all
> ptest p3, p0.b
> cset w0, any
> ret
>
> and_reduc:
> ptrue p3.b, all
> nots p3.b, p3/z, p0.b
> cset w0, none
> ret
>
> and the ^ case we'd see if the number of active predicate lanes
> is a multiple of two.
>
> xor_reduc:
> ptrue p3.b, all
> cntp x0, p3, p0.b
> and w0, w0, 1
> ret
>
> At least for SVE the open-coded versions of this would be
>
> Or: pred != {0,..} ? 1 : 0
> And: ~pred != {0,..} ? 0 : 1
> Xor: .IFN_CNT (pred) & 1 ? 1 : 0
>
> So using a scalar COND_EXPR instead of a vector one.
Hmm, indeed. I did consider a COND_EXPR but wasn't sure about
target support for that. Looks like only SSE4 has it, and the
knowledge we're dealing with an actual bool (aka -1/0) vector
helps. For the 'and' case ~x != 0 gets simpified to x == { -1, ... }
and then we get
and:
.LFB1:
.cfi_startproc
pcmpeqd %xmm1, %xmm1
pxor %xmm1, %xmm0
ptest %xmm0, %xmm0
setne %al
ret
typedef int v4si __attribute__((vector_size(16)));
_Bool __GIMPLE ior (v4si x)
{
_Bool r;
_Bool _1;
_1 = x != _Literal (v4si) { _Literal (int) 0, _Literal (int) 0, _Literal
(int) 0, _Literal (int) 0 };
r = _1 ? _Literal (bool) 1 : _Literal (bool) 0;
return r;
}
_Bool __GIMPLE and (v4si x)
{
_Bool r;
_Bool _1;
v4si _2;
_2 = ~x;
_1 = _2 != _Literal (v4si) { _Literal (int) 0, _Literal (int) 0,
_Literal (int) 0, _Literal (int) 0 };
r = _1 ? _Literal (bool) 1 : _Literal (bool) 0;
return r;
}
> If we define popcnt to be the number of active lanes. This would
> of course only work for TYPE_PRECISION (..) == 1 booleans, but for
> others I think we can just divide by precision? Or use the new
> .IFN_COUNT_ACTIVE
> that I proposed before for early break use.
That would work as well. So on x86 with SSE/AVX we'd like to use
movmsk which truns a vector int mask into a gpr bitmask where we then
can use gpr instructions for the rest.
So I'm leaning towards additional optabs to help targets. For the
low-precision case I was reminded we have this for vec_pack_sbool_trunc
already (also sbool is better than mask), and pass the output number of
lanes in an extra CONST_INT argument.
So, reduc_sbool_{and,ior,xor}_scal it would be then. The COND_EXPR
path could be used as another fallback, I guess it's cheaper than punning
to an integer type and reducing that.
Thanks,
Richard.
> Adv. SIMD would work similarly. E.g. OR becomes
>
> or_reduct:
> umaxp v0.4s, v0.4s, v0.4s
> fmov x0, d0
> cmp x0, 0
> cset w0, ne
> ret
>
> >
> > The first patch passed bootstrap & regtest on x86_64-unknown-linux-gnu.
> > It plugs quite some holes where we didn't think of masks in reduction
> > handling and at least on x86 it works with epilogue vectorization as
> > well.
> >
> > Thanks,
> > Richard.
> >
> > PR tree-optimization/101639
> > PR tree-optimization/103495
> > * tree-vect-patterns.cc (vect_determine_mask_precision):
> > Return whether the mask precision changed.
> > (vect_determine_precisions): Iterate mask precision computation.
> > * tree-vect-loop.cc (get_initial_defs_for_reduction): Properly
> > convert non-mask initial values to a mask initial def for
> > the reduction.
> > (vect_create_epilog_for_reduction): For a mask input convert
> > it to a corresponding data type. Use a regular conversion
> > for the final convert to the scalar code type.
> > (vectorizable_reduction): Support mask reductions. Verify
> > we can compute a data vector from the mask result.
> >
> > * gcc.dg/vect/vect-reduc-bool-1.c: New testcase.
> > * gcc.dg/vect/vect-reduc-bool-2.c: New testcase.
> > * gcc.dg/vect/vect-reduc-bool-3.c: New testcase.
> > ---
> > gcc/testsuite/gcc.dg/vect/vect-reduc-bool-1.c | 32 ++++++++
> > gcc/testsuite/gcc.dg/vect/vect-reduc-bool-2.c | 32 ++++++++
> > gcc/testsuite/gcc.dg/vect/vect-reduc-bool-3.c | 32 ++++++++
> > gcc/tree-vect-loop.cc | 80 +++++++++++++-----
> > gcc/tree-vect-patterns.cc | 81 ++++++++++++-------
> > 5 files changed, 210 insertions(+), 47 deletions(-)
> > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-bool-1.c
> > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-bool-2.c
> > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-bool-3.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-bool-1.c
> > b/gcc/testsuite/gcc.dg/vect/vect-reduc-bool-1.c
> > new file mode 100644
> > index 00000000000..2460a14b664
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-bool-1.c
> > @@ -0,0 +1,32 @@
> > +#include "tree-vect.h"
> > +
> > +char p[128];
> > +
> > +bool __attribute__((noipa))
> > +f (int n)
> > +{
> > + bool r = true;
> > + for (int i = 0; i < n; ++i)
> > + r &= (p[i] != 0);
>
> All three of your testcases seem to be the same. Were they supposed to be &,
> | and ^?
>
> Thanks,
> Tamar
>
> > + return r;
> > +}
> > +
> > +int main()
> > +{
> > + check_vect ();
> > +
> > + __builtin_memset (p, 1, sizeof(p));
> > +
> > + for (int n = 0; n < 77; ++n)
> > + if (!f (n))
> > + abort ();
> > +
> > + p[0] = 0;
> > + for (int n = 1; n < 77; ++n)
> > + if (f (n))
> > + abort ();
> > +
> > + return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "optimized: loop vectorized" "vect" {
> > target {
> > vect_int && vect_condition } } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-bool-2.c
> > b/gcc/testsuite/gcc.dg/vect/vect-reduc-bool-2.c
> > new file mode 100644
> > index 00000000000..d383d3bd2f0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-bool-2.c
> > @@ -0,0 +1,32 @@
> > +#include "tree-vect.h"
> > +
> > +short p[128];
> > +
> > +bool __attribute__((noipa))
> > +f (int n)
> > +{
> > + bool r = true;
> > + for (int i = 0; i < n; ++i)
> > + r &= (p[i] != 0);
> > + return r;
> > +}
> > +
> > +int main()
> > +{
> > + check_vect ();
> > +
> > + __builtin_memset (p, 1, sizeof(p));
> > +
> > + for (int n = 0; n < 77; ++n)
> > + if (!f (n))
> > + abort ();
> > +
> > + p[0] = 0;
> > + for (int n = 1; n < 77; ++n)
> > + if (f (n))
> > + abort ();
> > +
> > + return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "optimized: loop vectorized" "vect" {
> > target {
> > vect_int && vect_condition } } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-bool-3.c
> > b/gcc/testsuite/gcc.dg/vect/vect-reduc-bool-3.c
> > new file mode 100644
> > index 00000000000..f59ffd01a87
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-bool-3.c
> > @@ -0,0 +1,32 @@
> > +#include "tree-vect.h"
> > +
> > +int p[128];
> > +
> > +bool __attribute__((noipa))
> > +f (int n)
> > +{
> > + bool r = true;
> > + for (int i = 0; i < n; ++i)
> > + r &= (p[i] != 0);
> > + return r;
> > +}
> > +
> > +int main()
> > +{
> > + check_vect ();
> > +
> > + __builtin_memset (p, 1, sizeof(p));
> > +
> > + for (int n = 0; n < 77; ++n)
> > + if (!f (n))
> > + abort ();
> > +
> > + p[0] = 0;
> > + for (int n = 1; n < 77; ++n)
> > + if (f (n))
> > + abort ();
> > +
> > + return 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "optimized: loop vectorized" "vect" {
> > target {
> > vect_int && vect_condition } } } } */
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 8ff8dc2ddda..0f890fae22c 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -4908,17 +4908,16 @@ get_initial_defs_for_reduction (loop_vec_info
> > loop_vinfo,
> > if (!TYPE_VECTOR_SUBPARTS (vector_type).is_constant (&nunits))
> > nunits = group_size;
> >
> > + tree vector_elt_type = TREE_TYPE (vector_type);
> > number_of_places_left_in_vector = nunits;
> > bool constant_p = true;
> > tree_vector_builder elts (vector_type, nunits, 1);
> > elts.quick_grow (nunits);
> > gimple_seq ctor_seq = NULL;
> > if (neutral_op
> > - && !useless_type_conversion_p (TREE_TYPE (vector_type),
> > + && !useless_type_conversion_p (vector_elt_type,
> > TREE_TYPE (neutral_op)))
> > - neutral_op = gimple_convert (&ctor_seq,
> > - TREE_TYPE (vector_type),
> > - neutral_op);
> > + neutral_op = gimple_convert (&ctor_seq, vector_elt_type, neutral_op);
> > for (j = 0; j < nunits * number_of_vectors; ++j)
> > {
> > tree op;
> > @@ -4930,11 +4929,22 @@ get_initial_defs_for_reduction (loop_vec_info
> > loop_vinfo,
> > op = neutral_op;
> > else
> > {
> > - if (!useless_type_conversion_p (TREE_TYPE (vector_type),
> > + if (!useless_type_conversion_p (vector_elt_type,
> > TREE_TYPE (initial_values[i])))
> > - initial_values[i] = gimple_convert (&ctor_seq,
> > - TREE_TYPE (vector_type),
> > - initial_values[i]);
> > + {
> > + if (VECTOR_BOOLEAN_TYPE_P (vector_type))
> > + initial_values[i] = gimple_build (&ctor_seq, COND_EXPR,
> > + vector_elt_type,
> > + initial_values[i],
> > + build_all_ones_cst
> > + (vector_elt_type),
> > + build_zero_cst
> > + (vector_elt_type));
> > + else
> > + initial_values[i] = gimple_convert (&ctor_seq,
> > + vector_elt_type,
> > + initial_values[i]);
> > + }
> > op = initial_values[i];
> > }
> >
> > @@ -5555,6 +5565,23 @@ vect_create_epilog_for_reduction (loop_vec_info
> > loop_vinfo,
> > /* Shouldn't be used beyond this point. */
> > exit_bb = nullptr;
> >
> > + /* For the actual reduction work on a bool data vector instead of a
> > + mask vector. */
> > + if (VECTOR_BOOLEAN_TYPE_P (vectype))
> > + {
> > + gcc_assert (reduc_inputs.length () == 1);
> > + vectype = get_related_vectype_for_scalar_type (loop_vinfo-
> > >vector_mode,
> > + TREE_TYPE (vectype),
> > + TYPE_VECTOR_SUBPARTS
> > + (vectype));
> > + gimple_seq stmts = NULL;
> > + reduc_inputs[0] = gimple_build (&stmts, VEC_COND_EXPR, vectype,
> > + reduc_inputs[0],
> > + build_one_cst (vectype),
> > + build_zero_cst (vectype));
> > + gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > + }
> > +
> > if (VECT_REDUC_INFO_TYPE (reduc_info) == COND_REDUCTION
> > && reduc_fn != IFN_LAST)
> > {
> > @@ -5949,8 +5976,7 @@ vect_create_epilog_for_reduction (loop_vec_info
> > loop_vinfo,
> >
> > new_temp = gimple_build (&stmts, BIT_FIELD_REF, TREE_TYPE
> > (vectype1),
> > new_temp, bitsize, bitsize_zero_node);
> > - new_temp = gimple_build (&stmts, VIEW_CONVERT_EXPR,
> > - scalar_type, new_temp);
> > + new_temp = gimple_convert (&stmts, scalar_type, new_temp);
> > gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > scalar_results.safe_push (new_temp);
> > }
> > @@ -7023,15 +7049,6 @@ vectorizable_reduction (loop_vec_info
> > loop_vinfo,
> > tree vectype_out = SLP_TREE_VECTYPE (slp_for_stmt_info);
> > VECT_REDUC_INFO_VECTYPE (reduc_info) = vectype_out;
> >
> > - /* We do not handle mask reductions correctly in the epilogue. */
> > - if (VECTOR_BOOLEAN_TYPE_P (vectype_out))
> > - {
> > - if (dump_enabled_p ())
> > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > - "mask reduction not supported.\n");
> > - return false;
> > - }
> > -
> > gimple_match_op op;
> > if (!gimple_extract_op (stmt_info->stmt, &op))
> > gcc_unreachable ();
> > @@ -7284,6 +7301,29 @@ vectorizable_reduction (loop_vec_info
> > loop_vinfo,
> >
> > VECT_REDUC_INFO_CODE (reduc_info) = orig_code;
> >
> > + /* For now see to implement the epilogue reduction on a bool data,
> > + not the mask type. */
> > + tree orig_vectype_out = vectype_out;
> > + if (VECTOR_BOOLEAN_TYPE_P (vectype_out))
> > + {
> > + vectype_out
> > + = get_related_vectype_for_scalar_type (loop_vinfo->vector_mode,
> > + TREE_TYPE (vectype_out),
> > + TYPE_VECTOR_SUBPARTS
> > + (orig_vectype_out));
> > + if (!vectype_out
> > + || maybe_ne (TYPE_VECTOR_SUBPARTS (vectype_out),
> > + TYPE_VECTOR_SUBPARTS (orig_vectype_out))
> > + || !expand_vec_cond_expr_p (vectype_out, orig_vectype_out))
> > + {
> > + if (dump_enabled_p ())
> > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > + "cannot turn mask into bool data vector for "
> > + "reduction epilogue.\n");
> > + return false;
> > + }
> > + }
> > +
> > reduction_type = VECT_REDUC_INFO_TYPE (reduc_info);
> > if (reduction_type == TREE_CODE_REDUCTION)
> > {
> > @@ -7404,6 +7444,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> > return false;
> > }
> >
> > + vectype_out = orig_vectype_out;
> > +
> > /* For SLP reductions, see if there is a neutral value we can use. */
> > tree neutral_op = NULL_TREE;
> > tree initial_value = NULL_TREE;
> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > index becee62a9f5..5b536bb13b3 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -6828,13 +6828,14 @@ possible_vector_mask_operation_p
> > (stmt_vec_info stmt_info)
> >
> > /* If STMT_INFO sets a boolean SSA_NAME, see whether we should use
> > a vector mask type instead of a normal vector type. Record the
> > - result in STMT_INFO->mask_precision. */
> > + result in STMT_INFO->mask_precision. Returns true when the
> > + precision changed. */
> >
> > -static void
> > +static bool
> > vect_determine_mask_precision (vec_info *vinfo, stmt_vec_info stmt_info)
> > {
> > if (!possible_vector_mask_operation_p (stmt_info))
> > - return;
> > + return false;
> >
> > /* If at least one boolean input uses a vector mask type,
> > pick the mask type with the narrowest elements.
> > @@ -6916,8 +6917,11 @@ vect_determine_mask_precision (vec_info *vinfo,
> > stmt_vec_info stmt_info)
> > scalar_mode mode;
> > tree vectype, mask_type;
> > if (is_a <scalar_mode> (TYPE_MODE (op0_type), &mode)
> > - && (vectype = get_vectype_for_scalar_type (vinfo, op0_type))
> > - && (mask_type = get_mask_type_for_scalar_type (vinfo,
> > op0_type))
> > + /* Do not allow this to set vinfo->vector_mode, this might
> > + disrupt the result for the next iteration. */
> > + && (vectype = get_related_vectype_for_scalar_type
> > + (vinfo->vector_mode,
> > op0_type))
> > + && (mask_type = truth_type_for (vectype))
> > && expand_vec_cmp_expr_p (vectype, mask_type, code))
> > precision = GET_MODE_BITSIZE (mode);
> > }
> > @@ -6943,19 +6947,30 @@ vect_determine_mask_precision (vec_info
> > *vinfo, stmt_vec_info stmt_info)
> > }
> > }
> >
> > - if (dump_enabled_p ())
> > + if (stmt_info->mask_precision != precision)
> > {
> > - if (precision == ~0U)
> > - dump_printf_loc (MSG_NOTE, vect_location,
> > - "using normal nonmask vectors for %G",
> > - stmt_info->stmt);
> > - else
> > - dump_printf_loc (MSG_NOTE, vect_location,
> > - "using boolean precision %d for %G",
> > - precision, stmt_info->stmt);
> > - }
> > + if (dump_enabled_p ())
> > + {
> > + if (precision == ~0U)
> > + dump_printf_loc (MSG_NOTE, vect_location,
> > + "using normal nonmask vectors for %G",
> > + stmt_info->stmt);
> > + else
> > + dump_printf_loc (MSG_NOTE, vect_location,
> > + "using boolean precision %d for %G",
> > + precision, stmt_info->stmt);
> > + }
> >
> > - stmt_info->mask_precision = precision;
> > + /* ??? We'd like to assert stmt_info->mask_precision == 0
> > + || stmt_info->mask_precision > precision, thus that we only
> > + decrease mask precisions throughout iteration, but the
> > + tcc_comparison handling above means for comparisons of bools
> > + we start with 8 but might increase in case the bools get mask
> > + precision on their own. */
> > + stmt_info->mask_precision = precision;
> > + return true;
> > + }
> > + return false;
> > }
> >
> > /* Handle vect_determine_precisions for STMT_INFO, given that we
> > @@ -6988,22 +7003,32 @@ vect_determine_precisions (vec_info *vinfo)
> >
> > DUMP_VECT_SCOPE ("vect_determine_precisions");
> >
> > - for (unsigned int i = 0; i < nbbs; i++)
> > + /* For mask precisions we have to iterate since otherwise we do not
> > + get reduction PHI precision correct. */
> > + bool changed;
> > + do
> > {
> > - basic_block bb = bbs[i];
> > - for (auto gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next
> > (&gsi))
> > - {
> > - stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi.phi ());
> > - if (stmt_info && STMT_VINFO_VECTORIZABLE (stmt_info))
> > - vect_determine_mask_precision (vinfo, stmt_info);
> > - }
> > - for (auto gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > + changed = false;
> > + for (unsigned int i = 0; i < nbbs; i++)
> > {
> > - stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi_stmt (gsi));
> > - if (stmt_info && STMT_VINFO_VECTORIZABLE (stmt_info))
> > - vect_determine_mask_precision (vinfo, stmt_info);
> > + basic_block bb = bbs[i];
> > + for (auto gsi = gsi_start_phis (bb);
> > + !gsi_end_p (gsi); gsi_next (&gsi))
> > + {
> > + stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi.phi ());
> > + if (stmt_info && STMT_VINFO_VECTORIZABLE (stmt_info))
> > + changed |= vect_determine_mask_precision (vinfo,
> > stmt_info);
> > + }
> > + for (auto gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > + {
> > + stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi_stmt (gsi));
> > + if (stmt_info && STMT_VINFO_VECTORIZABLE (stmt_info))
> > + changed |= vect_determine_mask_precision (vinfo,
> > stmt_info);
> > + }
> > }
> > }
> > + while (changed);
> > +
> > for (unsigned int i = 0; i < nbbs; i++)
> > {
> > basic_block bb = bbs[nbbs - i - 1];
> > --
> > 2.51.0
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)