https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123672

--- Comment #11 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <[email protected]>:

https://gcc.gnu.org/g:7f4476239b1f8337a88844fb6dd98a9b1906c1d7

commit r16-7384-g7f4476239b1f8337a88844fb6dd98a9b1906c1d7
Author: Jakub Jelinek <[email protected]>
Date:   Sat Feb 7 18:45:13 2026 +0100

    forwprop: Fix up calc_perm_vec_perm_simplify_seqs [PR123672]

    Since r15-5563-g1c4d39ada we have an optimization to try to blend 2
    sequences of 2xVEC_PERM_EXPR + 2x binop + 1x VEC_PERM where the first two
    VEC_PERMs are permuting a single input and the last one permutes result
from
    those 2 binops into 2 VEC_PERM_EXPRs from 2 inputs, 2 binops and 2 final
    VEC_PERMs.
    On the following testcase, the intended change (i.e. after patch) is
    (including DCE after it which the optimizations relies on):
       a_7 = *x_6(D);
       b_9 = *y_8(D);
    -  c_10 = VEC_PERM_EXPR <a_7, a_7, { 0, 2, 0, 2 }>;
    -  d_11 = VEC_PERM_EXPR <a_7, a_7, { 1, 3, 1, 3 }>;
    -  e_12 = VEC_PERM_EXPR <b_9, b_9, { 0, 2, 0, 2 }>;
    -  f_13 = VEC_PERM_EXPR <b_9, b_9, { 1, 3, 1, 3 }>;
    +  c_10 = VEC_PERM_EXPR <a_7, b_9, { 0, 2, 4, 6 }>;
    +  d_11 = VEC_PERM_EXPR <a_7, b_9, { 1, 3, 5, 7 }>;
       _1 = c_10 + d_11;
       _2 = c_10 - d_11;
       g_14 = VEC_PERM_EXPR <_1, _2, { 0, 4, 1, 5 }>;
    -  _3 = e_12 + f_13;
    -  _4 = e_12 - f_13;
    -  h_15 = VEC_PERM_EXPR <_3, _4, { 0, 4, 1, 5 }>;
    +  h_15 = VEC_PERM_EXPR <_1, _2, { 2, 6, 3, 7 }>;
       *x_6(D) = g_14;
       *y_8(D) = h_15;
    This works by first identifying the two sequences, attempting to use vect
    elem redundancies to only use at most half of the vector elements
    (in this testcase a nop because 0, 4, 1, 5 perms already use only half of
    the vector elts), remembering details of such sequences and later comparing
    them if there are at least two (up to 8 I think) and trying to merge them.
    The optimization is meant to improve SPEC x264.
    Anyway, in r15-6387-geee289131 the optimization was changed to fix some
    regressions but regressed this testcase, instead of the desirable
    { 0, 2, 4, 6 } and { 1, 3, 5, 7 } first 2 VEC_PERMs 15 branch and trunk
    uses { 0, 2, 4, 4 } and { 1, 3, 5, 5 } and on this testcase that means
    computing incorrect result.
    On this testcase, it identified the two sequences (one ending with g_14
    and one with h_15 with no changes (see above).  The first one (it has
    some code to attempt to swap them if needed, but here the first one remains
    g_14) keeps using the final VEC_PERM_EXPR as is (or with whatever
    simplification recognise_vec_perm_simplify_seq performed on just that to
    reduce to at most half of nelts) and the second one is modified so that
    it uses the other elts of the two vectors.
    So, we have { 0, 4, 1, 5 } (i.e. twice first lanes and twice second lanes)
    from the first sequence and look up unused lanes (third and fourth) to
    transform the other { 0, 4, 1, 5 } to, and find that is { 2, 6, 3, 7 }.
    So far good.  But the next operation is to compute the new selectors
    for the first 2 VEC_PERM_EXPRs, which are changed from single input to
    two input ones.  For that, the code correctly uses the VECTOR_CST elts
    unmodified for the lanes used by the first sequence (in this
    testcase first/second lanes), so { 0, 2, X, X } and { 1, 3, X, X }
    and then need to find out what to use for the needs of the second sequence.
    Here is what it does currently:
      for (i = 0; i < nelts; i++)
        {
          bool use_seq1 = lane_assignment[i] != 2;
          unsigned int l1, l2;

          if (use_seq1)
            {
              /* Just reuse the selector indices.  */
              tree s1 = gimple_assign_rhs3 (seq1->v_1_stmt);
              tree s2 = gimple_assign_rhs3 (seq1->v_2_stmt);
              l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, i));
              l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, i));
            }
          else
            {
              /* We moved the lanes for seq2, so we need to adjust for that. 
*/
              tree s1 = gimple_assign_rhs3 (seq2->v_1_stmt);
              tree s2 = gimple_assign_rhs3 (seq2->v_2_stmt);

              unsigned int j = 0;
              for (; j < i; j++)
                {
                  unsigned int sel_new;
                  sel_new = seq2_stmt_sel_perm[j].to_constant ();
                  sel_new %= nelts;
                  if (sel_new == i)
                    break;
                }

              /* This should not happen.  Test anyway to guarantee correctness.
 */
              if (j == i)
                return false;

              l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, j));
              l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, j));
            }

          seq1_v_1_stmt_sel_perm.quick_push (l1 + (use_seq1 ? 0 : nelts));
          seq1_v_2_stmt_sel_perm.quick_push (l2 + (use_seq1 ? 0 : nelts));
        }
    seq2_stmt_sel_perm is the newly computed { 2, 6, 3, 7 } selector and
    seq1->v_{1,2}_stmt are def stmts of {c_10,d_11} and seq2->v_{1,2}_stmt
    are def stmts of {e_12,f_13}.  For i 0 and 1 it is use_seq1 and
    correct, then for i 2 the loop checks first seq2_stmt_sel_perm[0],
    it is 2 % 4, equal to i, so picks up VECTOR_CST_ELTS (s{1,2}, 2),
    which happens to be correct in this case, for i 3 the loop loops until
    seq2_stmt_sel_perm[2] which is 3 % 4, stops and picks the wrong
    VECTOR_CST_ELTS (s{1,2}, 2) which has the same value as
    VECTOR_CST_ELTS (s{1,2}, 0), when the correct value would be in this
    case either 1 or 3 (due to the duplication).
    What the loop should do for !use_seq1 is to take the lane transformations
    into account, we've changed { 0, 4, 1, 5 } to { 2, 6, 3, 7 }, so instead
    of using lanes 0, 0, 1, 1 we now use lanes 2, 2, 3, 3 (x / 4 is about
    which input it is picked from, here + or -).  So, for 2 which got remapped
    from 0 we want to use 0 and for 3 which got remapped from 1 we want to use
    1.
    The function uses an auto_vec lane_assignment with values 0 (unused lane,
    so far or altogether), 1 (used by first sequence) and 2 (used by second
    sequence).  When we store in there 2, we know exactly which lane we are
    remapping to which lane, so instead of computing it again the following
    patch stores there 2 + l_orig, such that value >= 2 means second lane
    and lane_assignment[i] - 2 in that case is the lane that got remapped to i.
    And then the last loop doesn't need to recompute anything and can just use
    the remembered transformation.
    The rest of the changes (hunks 1-5 and 7) are just random small fixes I've
    noticed while trying to understand the code.  The real fix is
    - lane_assignment[lane] = 2;
    + lane_assignment[lane] = 2 + l_orig;
    and
    - bool use_seq1 = lane_assignment[i] != 2;
    + bool use_seq1 = lane_assignment[i] < 2;
    and the rest of the last hunk.  Also, the last loop was kind of assuming
    VEC_PERM_EXPR canonicalization happened and for single input perm the
    selector elts are never >= nelts, I've added %= nelts just to be sure.

    2026-02-07  Jakub Jelinek  <[email protected]>

            PR tree-optimization/123672
            * tree-ssa-forwprop.cc (recognise_vec_perm_simplify_seq): Use
std::swap
            instead of fetching gimple_assign_rhs{1,2} again.  Change type of
lanes
            vector from auto_vec<unsigned int> to auto_vec<bool> and store true
            instead of 1 into it.  Fix comment typo and formatting fix.
            (can_blend_vec_perm_simplify_seqs_p): Put end of comment on the
same
            line as the last sentence in it.
            (calc_perm_vec_perm_simplify_seqs): Change lane_assignment type
from
            auto_vec<int> to auto_vec<unsigned> and store 2 + l_orig into it
            instead of true.  Fix comment typo and formatting fix.  Set
use_seq1
            to line_assignment[i] < 2 instead of line_assignment[i] != 2. 
Replace
            bogus computation of index for !use_seq with using
            line_assignment[i] - 2.  Set l1 to l1 % nelts and similarly for l2.

            * gcc.dg/pr123672.c: New test.

Reply via email to