On Tue, Feb 17, 2026 at 4:32 PM Richard Biener
<[email protected]> wrote:
>
> On Tue, Feb 17, 2026 at 8:37 AM Jakub Jelinek <[email protected]> wrote:
> >
> > Hi!
> >
> > With r16-531 we've regressed
> > FAIL: gcc.target/i386/pr108938-3.c scan-assembler-times bswap[\t ]+ 3
> > on ia32 and also made 2 separate regressions in the same testcase
> > on x86_64-linux (which in scan-assembler-times cancel out; previously
> > we were generating one 64-bit bswap + rotate in one function and
> > one 32-bit bswap + rotate in another function, now we emit 2 32-bit bswaps
> > in the first one and really horrible code in the second one).
> >
> > The following patch fixes the latter function by emitting 32-bit bswap
> > + 32-bit rotate on both ia32 and x86_64.  This fixes the
> > above FAIL (and introduces
> > FAIL: gcc.target/i386/pr108938-3.c scan-assembler-times bswap[\t ]+ 2
> > on x86_64).
> >
> > The problem is that the vectorizer now uses VEC_PACK_TRUNC_EXPR and
> > bswap/store_merging was only able to handle vectors in a CONSTRUCTOR.
> >
> > The patch adds handling of VEC_PACK_TRUNC_EXPR if its operands are
> > CONSTRUCTORs.  Without a testcase, I wasn't confident enough to write
> > BYTES_BIG_ENDIAN support, for CONSTRUCTOR { A, B, C, D } we make
> > DCBA out of it on little endian but ABCD on big endian and that would
> > need to be combined with picking up the most significant halves of each
> > element.
>
> I think vector lane order is little endian, but I'm often confused by details
> (like the _HI/LO codes being subject to endianess, but ISTR not BIT_FIELD_REF
> lane extracts).  So I agree, a FIXME is safer at this point.
>
> > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> LGTM.
>
> I had hoped both CTOR and VEC_PACK_TRUNC_EXPR could be handled
> by the generic expression handling machinery, but indeed we also have to
> support them as roots.  At least if/when we want to enhance bswap to detect
> permutes of one/two vectors, generating a VEC_PERM instead of just bswap
> or nop.  That's of course for the future.
>
> Thanks for working on this,
> Richard.
>
> > I'll look incrementally at the other function.
> >
> > 2026-02-17  Jakub Jelinek  <[email protected]>
> >
> >         PR target/120233
> >         * gimple-ssa-store-merging.cc (find_bswap_or_nop_2): New function.
> >         (find_bswap_or_nop): Move CONSTRUCTOR handling to above function,
> >         call it instead of find_bswap_or_nop_1.
> >         (bswap_replace): Handle VEC_PACK_TRUNC_EXPR like CONSTRUCTOR.
> >         (maybe_optimize_vector_constructor): Likewise.
> >         (pass_optimize_bswap::execute): Likewise.
> >         (get_status_for_store_merging): Likewise.
> >         (pass_store_merging::execute): Likewise.
> >
> > --- gcc/gimple-ssa-store-merging.cc.jj  2026-01-02 09:56:10.184336228 +0100
> > +++ gcc/gimple-ssa-store-merging.cc     2026-02-16 23:09:55.047593495 +0100
> > @@ -814,6 +814,145 @@ find_bswap_or_nop_1 (gimple *stmt, struc
> >    return NULL;
> >  }
> >
> > +/* Like find_bswap_or_nop_1, but also handles CONSTRUCTOR and
> > +   VEC_PACK_TRUNC_EXPR.  */
> > +
> > +gimple *
> > +find_bswap_or_nop_2 (gimple *stmt, struct symbolic_number *n, int limit)
> > +{
> > +  if (gimple *ret = find_bswap_or_nop_1 (stmt, n, limit))
> > +    return ret;
> > +
> > +  if (!limit
> > +      || !is_gimple_assign (stmt)
> > +      || BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN
> > +      || stmt_can_throw_internal (cfun, stmt))
> > +    return NULL;
> > +
> > +  tree rhs1 = gimple_assign_rhs1 (stmt);
> > +  if (gimple_assign_rhs_code (stmt) == VEC_PACK_TRUNC_EXPR
> > +      && TREE_CODE (rhs1) == SSA_NAME)
> > +    {
> > +      if (BYTES_BIG_ENDIAN)
> > +       return NULL; /* For now... */
> > +      gimple *rhs1_stmt = SSA_NAME_DEF_STMT (rhs1);
> > +      tree rhs2 = gimple_assign_rhs2 (stmt);
> > +      if (TREE_CODE (rhs2) != SSA_NAME)
> > +       return NULL;
> > +      gimple *rhs2_stmt = SSA_NAME_DEF_STMT (rhs2);
> > +      struct symbolic_number n1, n2;
> > +
> > +      gimple *source_stmt1 = find_bswap_or_nop_2 (rhs1_stmt, &n1, limit - 
> > 1);
> > +      if (!source_stmt1)
> > +       return NULL;
> > +      gimple *source_stmt2 = find_bswap_or_nop_2 (rhs2_stmt, &n2, limit - 
> > 1);
> > +      if (!source_stmt2)
> > +       return NULL;
> > +
> > +      if (TYPE_PRECISION (n1.type) != TYPE_PRECISION (n2.type))
> > +       return NULL;
> > +      if (n1.vuse != n2.vuse)
> > +       return NULL;
> > +      auto nn2 = n2.n;
> > +      n2.n = 0;
> > +      /* We need to take into account number of elements of each vector,
> > +        which perform_symbolic_merge doesn't know.  So, handle it as
> > +        two separate BIT_IOR_EXPR merges, each time with one operand
> > +        with changed mastk to all 0s, and then merge here.  */
> > +      gimple *source_stmt
> > +       = perform_symbolic_merge (source_stmt1, &n1, source_stmt2, &n2, n,
> > +                                 BIT_IOR_EXPR);
> > +      if (!source_stmt)
> > +       return NULL;
> > +      n2.n = nn2;
> > +      auto nn1 = n->n;
> > +      n1.n = 0;
> > +      gimple *source_stmt3
> > +       = perform_symbolic_merge (source_stmt1, &n1, source_stmt2, &n2, n,
> > +                                 BIT_IOR_EXPR);
> > +      gcc_assert (source_stmt == source_stmt3);
> > +      nn2 = n->n;
> > +      tree lhs = gimple_assign_lhs (stmt);
> > +      int eltsize
> > +       = TYPE_PRECISION (TREE_TYPE (TREE_TYPE (lhs))) / BITS_PER_UNIT;
> > +      int nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (lhs)).to_constant ();
> > +      int size = eltsize * nelts;
> > +      int hsize = size / 2;
> > +      n->n = 0;
> > +      if (!BYTES_BIG_ENDIAN)
> > +       for (int i = 0; i < size; i++)
> > +         n->n |= ((((i < hsize ? nn1 : nn2)
> > +                    >> (((i % hsize) / eltsize * 2 * eltsize
> > +                         + (i % eltsize))) * BITS_PER_MARKER) & 
> > MARKER_MASK)
> > +                  << (i * BITS_PER_MARKER));
> > +      else
> > +       gcc_unreachable ();
> > +      return source_stmt;
> > +    }
> > +
> > +  if (gimple_assign_rhs_code (stmt) != CONSTRUCTOR)
> > +    return NULL;
> > +
> > +  tree type_size = TYPE_SIZE_UNIT (TREE_TYPE (gimple_get_lhs (stmt)));
> > +  unsigned HOST_WIDE_INT sz = tree_to_uhwi (type_size) * BITS_PER_UNIT;
> > +  if (sz != 16 && sz != 32 && sz != 64)
> > +    return NULL;
> > +  if (CONSTRUCTOR_NELTS (rhs1) == 0)
> > +    return NULL;
> > +  tree eltype = TREE_TYPE (TREE_TYPE (rhs1));
> > +  unsigned HOST_WIDE_INT eltsz = int_size_in_bytes (eltype) * 
> > BITS_PER_UNIT;
> > +  if (TYPE_PRECISION (eltype) != eltsz)
> > +    return NULL;
> > +  constructor_elt *elt;
> > +  unsigned int i;
> > +  tree type = build_nonstandard_integer_type (sz, 1);
> > +  gimple *ins_stmt = NULL;
> > +  FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (rhs1), i, elt)
> > +    {
> > +      if (TREE_CODE (elt->value) != SSA_NAME
> > +         || !INTEGRAL_TYPE_P (TREE_TYPE (elt->value)))
> > +       return NULL;
> > +      struct symbolic_number n1;
> > +      gimple *source_stmt
> > +       = find_bswap_or_nop_1 (SSA_NAME_DEF_STMT (elt->value), &n1, limit - 
> > 1);
> > +
> > +      if (!source_stmt)
> > +       return NULL;
> > +
> > +      n1.type = type;
> > +      if (!n1.base_addr)
> > +       n1.range = sz / BITS_PER_UNIT;
> > +
> > +      if (i == 0)
> > +       {
> > +         ins_stmt = source_stmt;
> > +         *n = n1;
> > +       }
> > +      else
> > +       {
> > +         if (n->vuse != n1.vuse)
> > +           return NULL;
> > +
> > +         struct symbolic_number n0 = *n;
> > +
> > +         if (!BYTES_BIG_ENDIAN)
> > +           {
> > +             if (!do_shift_rotate (LSHIFT_EXPR, &n1, i * eltsz))
> > +               return NULL;
> > +           }
> > +         else if (!do_shift_rotate (LSHIFT_EXPR, &n0, eltsz))
> > +           return NULL;
> > +         ins_stmt
> > +           = perform_symbolic_merge (ins_stmt, &n0, source_stmt,
> > +                                     &n1, n, BIT_IOR_EXPR);
> > +
> > +         if (!ins_stmt)
> > +           return NULL;
> > +       }
> > +    }
> > +  return ins_stmt;
> > +}
> > +
> >  /* Helper for find_bswap_or_nop and try_coalesce_bswap to compute
> >     *CMPXCHG, *CMPNOP and adjust *N.  */
> >
> > @@ -950,72 +1089,9 @@ find_bswap_or_nop (gimple *stmt, struct
> >       in libgcc, and for initial shift/and operation of the src operand.  */
> >    int limit = tree_to_uhwi (type_size);
> >    limit += 2 * (1 + (int) ceil_log2 ((unsigned HOST_WIDE_INT) limit));
> > -  gimple *ins_stmt = find_bswap_or_nop_1 (stmt, n, limit);
> > -
> > +  gimple *ins_stmt = find_bswap_or_nop_2 (stmt, n, limit);
> >    if (!ins_stmt)
> > -    {
> > -      if (gimple_assign_rhs_code (stmt) != CONSTRUCTOR
> > -         || BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN)
> > -       return NULL;
> > -      unsigned HOST_WIDE_INT sz = tree_to_uhwi (type_size) * BITS_PER_UNIT;
> > -      if (sz != 16 && sz != 32 && sz != 64)
> > -       return NULL;
> > -      tree rhs = gimple_assign_rhs1 (stmt);
> > -      if (CONSTRUCTOR_NELTS (rhs) == 0)
> > -       return NULL;
> > -      tree eltype = TREE_TYPE (TREE_TYPE (rhs));
> > -      unsigned HOST_WIDE_INT eltsz
> > -       = int_size_in_bytes (eltype) * BITS_PER_UNIT;
> > -      if (TYPE_PRECISION (eltype) != eltsz)
> > -       return NULL;
> > -      constructor_elt *elt;
> > -      unsigned int i;
> > -      tree type = build_nonstandard_integer_type (sz, 1);
> > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (rhs), i, elt)
> > -       {
> > -         if (TREE_CODE (elt->value) != SSA_NAME
> > -             || !INTEGRAL_TYPE_P (TREE_TYPE (elt->value)))
> > -           return NULL;
> > -         struct symbolic_number n1;
> > -         gimple *source_stmt
> > -           = find_bswap_or_nop_1 (SSA_NAME_DEF_STMT (elt->value), &n1,
> > -                                  limit - 1);
> > -
> > -         if (!source_stmt)
> > -           return NULL;
> > -
> > -         n1.type = type;
> > -         if (!n1.base_addr)
> > -           n1.range = sz / BITS_PER_UNIT;
> > -
> > -         if (i == 0)
> > -           {
> > -             ins_stmt = source_stmt;
> > -             *n = n1;
> > -           }
> > -         else
> > -           {
> > -             if (n->vuse != n1.vuse)
> > -               return NULL;
> > -
> > -             struct symbolic_number n0 = *n;
> > -
> > -             if (!BYTES_BIG_ENDIAN)
> > -               {
> > -                 if (!do_shift_rotate (LSHIFT_EXPR, &n1, i * eltsz))
> > -                   return NULL;
> > -               }
> > -             else if (!do_shift_rotate (LSHIFT_EXPR, &n0, eltsz))
> > -               return NULL;
> > -             ins_stmt
> > -               = perform_symbolic_merge (ins_stmt, &n0, source_stmt, &n1, 
> > n,
> > -                                         BIT_IOR_EXPR);
> > -
> > -             if (!ins_stmt)
> > -               return NULL;
> > -           }
> > -       }
> > -    }
> > +    return NULL;
> >
> >    uint64_t cmpxchg, cmpnop;
> >    uint64_t orig_range = n->range * BITS_PER_UNIT;
> > @@ -1177,7 +1253,8 @@ bswap_replace (gimple_stmt_iterator gsi,
> >    if (cur_stmt)
> >      {
> >        tgt = gimple_assign_lhs (cur_stmt);
> > -      if (gimple_assign_rhs_code (cur_stmt) == CONSTRUCTOR
> > +      if ((gimple_assign_rhs_code (cur_stmt) == CONSTRUCTOR
> > +          || gimple_assign_rhs_code (cur_stmt) == VEC_PACK_TRUNC_EXPR)
> >           && tgt
> >           && VECTOR_TYPE_P (TREE_TYPE (tgt)))
> >         conv_code = VIEW_CONVERT_EXPR;
> > @@ -1457,8 +1534,8 @@ bswap_replace (gimple_stmt_iterator gsi,
> >    return tgt;
> >  }
> >
> > -/* Try to optimize an assignment CUR_STMT with CONSTRUCTOR on the rhs
> > -   using bswap optimizations.  CDI_DOMINATORS need to be
> > +/* Try to optimize an assignment CUR_STMT with 
> > CONSTRUCTOR/VEC_PACK_TRUNC_EXPR
> > +   on the rhs using bswap optimizations.  CDI_DOMINATORS need to be
> >     computed on entry.  Return true if it has been optimized and
> >     TODO_update_ssa is needed.  */
> >
> > @@ -1469,8 +1546,15 @@ maybe_optimize_vector_constructor (gimpl
> >    struct symbolic_number n;
> >    bool bswap;
> >
> > -  gcc_assert (is_gimple_assign (cur_stmt)
> > -             && gimple_assign_rhs_code (cur_stmt) == CONSTRUCTOR);
> > +  gcc_assert (is_gimple_assign (cur_stmt));
> > +  switch (gimple_assign_rhs_code (cur_stmt))
> > +    {
> > +    case CONSTRUCTOR:
> > +    case VEC_PACK_TRUNC_EXPR:
> > +      break;
> > +    default:
> > +      gcc_unreachable ();
> > +    }
> >
> >    tree rhs = gimple_assign_rhs1 (cur_stmt);
> >    if (!VECTOR_TYPE_P (TREE_TYPE (rhs))
> > @@ -1613,6 +1697,7 @@ pass_optimize_bswap::execute (function *
> >             case PLUS_EXPR:
> >               break;
> >             case CONSTRUCTOR:
> > +           case VEC_PACK_TRUNC_EXPR:
> >               {
> >                 tree rhs = gimple_assign_rhs1 (cur_stmt);
> >                 if (VECTOR_TYPE_P (TREE_TYPE (rhs))
> > @@ -5491,7 +5576,8 @@ get_status_for_store_merging (basic_bloc
> >         break;
> >
> >        if (is_gimple_assign (stmt)
> > -         && gimple_assign_rhs_code (stmt) == CONSTRUCTOR)
> > +         && (gimple_assign_rhs_code (stmt) == CONSTRUCTOR
> > +             || gimple_assign_rhs_code (stmt) == VEC_PACK_TRUNC_EXPR))
> >         {
> >           tree rhs = gimple_assign_rhs1 (stmt);
> >           if (VECTOR_TYPE_P (TREE_TYPE (rhs))
> > @@ -5578,7 +5664,8 @@ pass_store_merging::execute (function *f
> >             }
> >
> >           if (is_gimple_assign (stmt)
> > -             && gimple_assign_rhs_code (stmt) == CONSTRUCTOR
> > +             && (gimple_assign_rhs_code (stmt) == CONSTRUCTOR
> > +                 || gimple_assign_rhs_code (stmt) == VEC_PACK_TRUNC_EXPR)
> >               && maybe_optimize_vector_constructor (stmt))
> >             continue;
> >
> >
> >         Jakub
> >

commit 4b71cafc8447e09ee41aff02acb5b26e8b112466
Author: Jakub Jelinek <[email protected]>
Date:   Tue Feb 17 11:43:43 2026 +0100

    bswap: Handle VEC_PACK_TRUNC_EXPR [PR120233]

compiles

void
foo2 (char* a, short* __restrict b)
{
  a[0] = b[0] >> 8;
  a[1] = b[0];
  a[2] = b[1] >> 8;
  a[3] = b[1];
}

into

movl (%rsi), %eax
bswap %eax
roll $16, %eax
movl %eax, (%rdi)
ret

instead of

movzwl (%rsi), %eax
movzwl 2(%rsi), %edx
movl %eax, %ecx
sall $16, %eax
sarw $8, %cx
movzwl %cx, %ecx
orl %ecx, %eax
movd %eax, %xmm0
movl %edx, %eax
sall $16, %edx
sarw $8, %ax
movdqa %xmm0, %xmm2
movzwl %ax, %eax
orl %eax, %edx
movd %edx, %xmm1
punpcklbw %xmm1, %xmm2
punpcklbw %xmm1, %xmm0
pshufd $65, %xmm2, %xmm2
punpcklbw %xmm2, %xmm0
movd %xmm0, (%rdi)
ret

Update gcc.target/i386/pr108938-3.c to also scan 3 bswaps for x86-64.

PR target/120233
* gcc.target/i386/pr108938-3.c: Also scan 3 bswaps for x86-64.

I am checking it in.

-- 
H.J.
From bc2f99ef2b6551b89886ef41d5f3554c715534b0 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <[email protected]>
Date: Thu, 19 Feb 2026 08:03:35 +0800
Subject: [PATCH] x86: Update gcc.target/i386/pr108938-3.c

commit 4b71cafc8447e09ee41aff02acb5b26e8b112466
Author: Jakub Jelinek <[email protected]>
Date:   Tue Feb 17 11:43:43 2026 +0100

    bswap: Handle VEC_PACK_TRUNC_EXPR [PR120233]

compiles

void
foo2 (char* a, short* __restrict b)
{
  a[0] = b[0] >> 8;
  a[1] = b[0];
  a[2] = b[1] >> 8;
  a[3] = b[1];
}

into

	movl	(%rsi), %eax
	bswap	%eax
	roll	$16, %eax
	movl	%eax, (%rdi)
	ret

instead of

	movzwl	(%rsi), %eax
	movzwl	2(%rsi), %edx
	movl	%eax, %ecx
	sall	$16, %eax
	sarw	$8, %cx
	movzwl	%cx, %ecx
	orl	%ecx, %eax
	movd	%eax, %xmm0
	movl	%edx, %eax
	sall	$16, %edx
	sarw	$8, %ax
	movdqa	%xmm0, %xmm2
	movzwl	%ax, %eax
	orl	%eax, %edx
	movd	%edx, %xmm1
	punpcklbw	%xmm1, %xmm2
	punpcklbw	%xmm1, %xmm0
	pshufd	$65, %xmm2, %xmm2
	punpcklbw	%xmm2, %xmm0
	movd	%xmm0, (%rdi)
	ret

Update gcc.target/i386/pr108938-3.c to also scan 3 bswaps for x86-64.

	PR target/120233
	* gcc.target/i386/pr108938-3.c: Also scan 3 bswaps for x86-64.

Signed-off-by: H.J. Lu <[email protected]>
---
 gcc/testsuite/gcc.target/i386/pr108938-3.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c b/gcc/testsuite/gcc.target/i386/pr108938-3.c
index 47293d49bb9..cd87314e57f 100644
--- a/gcc/testsuite/gcc.target/i386/pr108938-3.c
+++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c
@@ -1,7 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -ftree-vectorize -mno-movbe -msse2 -mno-avx" } */
-/* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 } } */
 
 void
 foo1 (char* a, unsigned int* __restrict b)
-- 
2.53.0

Reply via email to