> Am 07.02.2026 um 10:34 schrieb Jakub Jelinek <[email protected]>:
>
> Hi!
>
> Since r15-5563-g1c4d39ada we have an optimization to try to blend 2
> sequences of 2xVEC_PERM_EXPR + 2x binop + 1x VEC_PERM where the first two
> VEC_PERMs are permuting a single input and the last one permutes result from
> those 2 binops into 2 VEC_PERM_EXPRs from 2 inputs, 2 binops and 2 final
> VEC_PERMs.
> On the following testcase, the intended change (i.e. after patch) is
> (including DCE after it which the optimizations relies on):
> a_7 = *x_6(D);
> b_9 = *y_8(D);
> - c_10 = VEC_PERM_EXPR <a_7, a_7, { 0, 2, 0, 2 }>;
> - d_11 = VEC_PERM_EXPR <a_7, a_7, { 1, 3, 1, 3 }>;
> - e_12 = VEC_PERM_EXPR <b_9, b_9, { 0, 2, 0, 2 }>;
> - f_13 = VEC_PERM_EXPR <b_9, b_9, { 1, 3, 1, 3 }>;
> + c_10 = VEC_PERM_EXPR <a_7, b_9, { 0, 2, 4, 6 }>;
> + d_11 = VEC_PERM_EXPR <a_7, b_9, { 1, 3, 5, 7 }>;
> _1 = c_10 + d_11;
> _2 = c_10 - d_11;
> g_14 = VEC_PERM_EXPR <_1, _2, { 0, 4, 1, 5 }>;
> - _3 = e_12 + f_13;
> - _4 = e_12 - f_13;
> - h_15 = VEC_PERM_EXPR <_3, _4, { 0, 4, 1, 5 }>;
> + h_15 = VEC_PERM_EXPR <_1, _2, { 2, 6, 3, 7 }>;
> *x_6(D) = g_14;
> *y_8(D) = h_15;
> This works by first identifying the two sequences, attempting to use vect
> elem redundancies to only use at most half of the vector elements
> (in this testcase a nop because 0, 4, 1, 5 perms already use only half of
> the vector elts), remembering details of such sequences and later comparing
> them if there are at least two (up to 8 I think) and trying to merge them.
> The optimization is meant to improve SPEC x264.
> Anyway, in r15-6387-geee289131 the optimization was changed to fix some
> regressions but regressed this testcase, instead of the desirable
> { 0, 2, 4, 6 } and { 1, 3, 5, 7 } first 2 VEC_PERMs 15 branch and trunk
> uses { 0, 2, 4, 4 } and { 1, 3, 5, 5 } and on this testcase that means
> computing incorrect result.
> On this testcase, it identified the two sequences (one ending with g_14
> and one with h_15 with no changes (see above). The first one (it has
> some code to attempt to swap them if needed, but here the first one remains
> g_14) keeps using the final VEC_PERM_EXPR as is (or with whatever
> simplification recognise_vec_perm_simplify_seq performed on just that to
> reduce to at most half of nelts) and the second one is modified so that
> it uses the other elts of the two vectors.
> So, we have { 0, 4, 1, 5 } (i.e. twice first lanes and twice second lanes)
> from the first sequence and look up unused lanes (third and fourth) to
> transform the other { 0, 4, 1, 5 } to, and find that is { 2, 6, 3, 7 }.
> So far good. But the next operation is to compute the new selectors
> for the first 2 VEC_PERM_EXPRs, which are changed from single input to
> two input ones. For that, the code correctly uses the VECTOR_CST elts
> unmodified for the lanes used by the first sequence (in this
> testcase first/second lanes), so { 0, 2, X, X } and { 1, 3, X, X }
> and then need to find out what to use for the needs of the second sequence.
> Here is what it does currently:
> for (i = 0; i < nelts; i++)
> {
> bool use_seq1 = lane_assignment[i] != 2;
> unsigned int l1, l2;
>
> if (use_seq1)
> {
> /* Just reuse the selector indices. */
> tree s1 = gimple_assign_rhs3 (seq1->v_1_stmt);
> tree s2 = gimple_assign_rhs3 (seq1->v_2_stmt);
> l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, i));
> l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, i));
> }
> else
> {
> /* We moved the lanes for seq2, so we need to adjust for that. */
> tree s1 = gimple_assign_rhs3 (seq2->v_1_stmt);
> tree s2 = gimple_assign_rhs3 (seq2->v_2_stmt);
>
> unsigned int j = 0;
> for (; j < i; j++)
> {
> unsigned int sel_new;
> sel_new = seq2_stmt_sel_perm[j].to_constant ();
> sel_new %= nelts;
> if (sel_new == i)
> break;
> }
>
> /* This should not happen. Test anyway to guarantee correctness. */
> if (j == i)
> return false;
>
> l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, j));
> l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, j));
> }
>
> seq1_v_1_stmt_sel_perm.quick_push (l1 + (use_seq1 ? 0 : nelts));
> seq1_v_2_stmt_sel_perm.quick_push (l2 + (use_seq1 ? 0 : nelts));
> }
> seq2_stmt_sel_perm is the newly computed { 2, 6, 3, 7 } selector and
> seq1->v_{1,2}_stmt are def stmts of {c_10,d_11} and seq2->v_{1,2}_stmt
> are def stmts of {e_12,f_13}. For i 0 and 1 it is use_seq1 and
> correct, then for i 2 the loop checks first seq2_stmt_sel_perm[0],
> it is 2 % 4, equal to i, so picks up VECTOR_CST_ELTS (s{1,2}, 2),
> which happens to be correct in this case, for i 3 the loop loops until
> seq2_stmt_sel_perm[2] which is 3 % 4, stops and picks the wrong
> VECTOR_CST_ELTS (s{1,2}, 2) which has the same value as
> VECTOR_CST_ELTS (s{1,2}, 0), when the correct value would be in this
> case either 1 or 3 (due to the duplication).
> What the loop should do for !use_seq1 is to take the lane transformations
> into account, we've changed { 0, 4, 1, 5 } to { 2, 6, 3, 7 }, so instead
> of using lanes 0, 0, 1, 1 we now use lanes 2, 2, 3, 3 (x / 4 is about
> which input it is picked from, here + or -). So, for 2 which got remapped
> from 0 we want to use 0 and for 3 which got remapped from 1 we want to use
> 1.
> The function uses an auto_vec lane_assignment with values 0 (unused lane,
> so far or altogether), 1 (used by first sequence) and 2 (used by second
> sequence). When we store in there 2, we know exactly which lane we are
> remapping to which lane, so instead of computing it again the following
> patch stores there 2 + l_orig, such that value >= 2 means second lane
> and lane_assignment[i] - 2 in that case is the lane that got remapped to i.
> And then the last loop doesn't need to recompute anything and can just use
> the remembered transformation.
> The rest of the changes (hunks 1-5 and 7) are just random small fixes I've
> noticed while trying to understand the code. The real fix is
> - lane_assignment[lane] = 2;
> + lane_assignment[lane] = 2 + l_orig;
> and
> - bool use_seq1 = lane_assignment[i] != 2;
> + bool use_seq1 = lane_assignment[i] < 2;
> and the rest of the last hunk. Also, the last loop was kind of assuming
> VEC_PERM_EXPR canonicalization happened and for single input perm the
> selector elts are never >= nelts, I've added %= nelts just to be sure.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
Ok
Thanks,
Richard
> 2026-02-07 Jakub Jelinek <[email protected]>
>
> PR tree-optimization/123672
> * tree-ssa-forwprop.cc (recognise_vec_perm_simplify_seq): Use std::swap
> instead of fetching gimple_assign_rhs{1,2} again. Change type of lanes
> vector from auto_vec<unsigned int> to auto_vec<bool> and store true
> instead of 1 into it. Fix comment typo and formatting fix.
> (can_blend_vec_perm_simplify_seqs_p): Put end of comment on the same
> line as the last sentence in it.
> (calc_perm_vec_perm_simplify_seqs): Change lane_assignment type from
> auto_vec<int> to auto_vec<unsigned> and store 2 + l_orig into it
> instead of true. Fix comment typo and formatting fix. Set use_seq1
> to line_assignment[i] < 2 instead of line_assignment[i] != 2. Replace
> bogus computation of index for !use_seq with using
> line_assignment[i] - 2. Set l1 to l1 % nelts and similarly for l2.
>
> * gcc.dg/pr123672.c: New test.
>
> --- gcc/tree-ssa-forwprop.cc.jj 2026-02-05 11:14:57.189729922 +0100
> +++ gcc/tree-ssa-forwprop.cc 2026-02-06 19:27:25.408278359 +0100
> @@ -4617,8 +4617,7 @@ recognise_vec_perm_simplify_seq (gassign
> if (commutative_tree_code (gimple_assign_rhs_code (v_x_stmt)))
> {
> /* Keep v_x_1 the first operand for non-commutative operators. */
> - v_x_1 = gimple_assign_rhs2 (v_x_stmt);
> - v_x_2 = gimple_assign_rhs1 (v_x_stmt);
> + std::swap (v_x_1, v_x_2);
> if (v_x_1 != v_y_1 || v_x_2 != v_y_2)
> return false;
> }
> @@ -4661,7 +4660,7 @@ recognise_vec_perm_simplify_seq (gassign
>
> /* Create the new selector. */
> vec_perm_builder new_sel_perm (nelts, nelts, 1);
> - auto_vec<unsigned int> lanes (nelts);
> + auto_vec<bool> lanes (nelts);
> lanes.quick_grow_cleared (nelts);
> for (unsigned int i = 0; i < nelts; i++)
> {
> @@ -4687,7 +4686,7 @@ recognise_vec_perm_simplify_seq (gassign
> new_sel_perm.quick_push (l + offs * nelts);
>
> /* Mark lane as used. */
> - lanes[l] = 1;
> + lanes[l] = true;
> }
>
> /* Count how many lanes are need. */
> @@ -4699,12 +4698,12 @@ recognise_vec_perm_simplify_seq (gassign
> if (cnt > nelts / 2)
> return false;
>
> - /* Check if the resulting permuation is cheap. */
> + /* Check if the resulting permutation is cheap. */
> vec_perm_indices new_indices (new_sel_perm, 2, nelts);
> tree vectype = TREE_TYPE (gimple_assign_lhs (stmt));
> machine_mode vmode = TYPE_MODE (vectype);
> if (!can_vec_perm_const_p (vmode, vmode, new_indices, false))
> - return false;
> + return false;
>
> *seq = XNEW (struct _vec_perm_simplify_seq);
> (*seq)->stmt = stmt;
> @@ -4794,8 +4793,7 @@ can_blend_vec_perm_simplify_seqs_p (vec_
> seq1->v_x_stmt and seq1->v_y_stmt are before it.
>
> Note, that we don't need to check the BBs here, because all
> - statements of both sequences have to be in the same BB.
> - */
> + statements of both sequences have to be in the same BB. */
>
> tree seq2_v_in = gimple_assign_rhs1 (seq2->v_1_stmt);
> if (TREE_CODE (seq2_v_in) != SSA_NAME)
> @@ -4843,7 +4841,7 @@ calc_perm_vec_perm_simplify_seqs (vec_pe
> {
> unsigned int i;
> unsigned int nelts = seq1->nelts;
> - auto_vec<int> lane_assignment;
> + auto_vec<unsigned int> lane_assignment;
> lane_assignment.create (nelts);
>
> /* Mark all lanes as free. */
> @@ -4855,7 +4853,7 @@ calc_perm_vec_perm_simplify_seqs (vec_pe
> unsigned int l = TREE_INT_CST_LOW (VECTOR_CST_ELT (seq1->new_sel, i));
> l %= nelts;
> lane_assignment[l] = 1;
> -}
> + }
>
> /* Allocate lanes for seq2 and calculate selector for seq2->stmt. */
> vec_perm_builder seq2_stmt_sel_perm (nelts, nelts, 1);
> @@ -4896,14 +4894,14 @@ calc_perm_vec_perm_simplify_seqs (vec_pe
> }
>
> /* Allocate lane. */
> - lane_assignment[lane] = 2;
> + lane_assignment[lane] = 2 + l_orig;
> new_sel = lane + offs * nelts;
> }
>
> seq2_stmt_sel_perm.quick_push (new_sel);
> }
>
> - /* Check if the resulting permuation is cheap. */
> + /* Check if the resulting permutation is cheap. */
> seq2_stmt_indices->new_vector (seq2_stmt_sel_perm, 2, nelts);
> tree vectype = TREE_TYPE (gimple_assign_lhs (seq2->stmt));
> machine_mode vmode = TYPE_MODE (vectype);
> @@ -4915,7 +4913,7 @@ calc_perm_vec_perm_simplify_seqs (vec_pe
> vec_perm_builder seq1_v_2_stmt_sel_perm (nelts, nelts, 1);
> for (i = 0; i < nelts; i++)
> {
> - bool use_seq1 = lane_assignment[i] != 2;
> + bool use_seq1 = lane_assignment[i] < 2;
> unsigned int l1, l2;
>
> if (use_seq1)
> @@ -4931,25 +4929,12 @@ calc_perm_vec_perm_simplify_seqs (vec_pe
> /* We moved the lanes for seq2, so we need to adjust for that. */
> tree s1 = gimple_assign_rhs3 (seq2->v_1_stmt);
> tree s2 = gimple_assign_rhs3 (seq2->v_2_stmt);
> -
> - unsigned int j = 0;
> - for (; j < i; j++)
> - {
> - unsigned int sel_new;
> - sel_new = seq2_stmt_sel_perm[j].to_constant ();
> - sel_new %= nelts;
> - if (sel_new == i)
> - break;
> - }
> -
> - /* This should not happen. Test anyway to guarantee correctness. */
> - if (j == i)
> - return false;
> -
> - l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, j));
> - l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, j));
> + l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, lane_assignment[i] - 2));
> + l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, lane_assignment[i] - 2));
> }
>
> + l1 %= nelts;
> + l2 %= nelts;
> seq1_v_1_stmt_sel_perm.quick_push (l1 + (use_seq1 ? 0 : nelts));
> seq1_v_2_stmt_sel_perm.quick_push (l2 + (use_seq1 ? 0 : nelts));
> }
> --- gcc/testsuite/gcc.dg/pr123672.c.jj 2026-02-06 19:51:07.648379517 +0100
> +++ gcc/testsuite/gcc.dg/pr123672.c 2026-02-06 19:50:37.082892783 +0100
> @@ -0,0 +1,33 @@
> +/* PR tree-optimization/123672 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-forwprop1-details" } */
> +/* { dg-additional-options "-msse2" { target i?86-*-* x86_64-*-* } } */
> +/* { dg-final { scan-tree-dump "Vec perm simplify sequences have been
> blended" "forwprop1" { target { aarch64*-*-* i?86-*-* x86_64-*-* } } } } */
> +
> +typedef int V __attribute__((vector_size (4 * sizeof (int))));
> +
> +[[gnu::noipa]] void
> +foo (V *x, V *y)
> +{
> + V a = *x;
> + V b = *y;
> + V c = __builtin_shufflevector (a, a, 0, 2, 0, 2);
> + V d = __builtin_shufflevector (a, a, 1, 3, 1, 3);
> + V e = __builtin_shufflevector (b, b, 0, 2, 0, 2);
> + V f = __builtin_shufflevector (b, b, 1, 3, 1, 3);
> + V g = __builtin_shufflevector (c + d, c - d, 0, 4, 1, 5);
> + V h = __builtin_shufflevector (e + f, e - f, 0, 4, 1, 5);
> + *x = g;
> + *y = h;
> +}
> +
> +int
> +main ()
> +{
> + V a = { 1, 21, 2, 32 };
> + V b = { 3, 43, 4, 54 };
> + foo (&a, &b);
> + if (a[0] != 22 || a[1] != -20 || a[2] != 34 || a[3] != -30
> + || b[0] != 46 || b[1] != -40 || b[2] != 58 || b[3] != -50)
> + __builtin_abort ();
> +}
>
> Jakub
>