[gcc r15-4878] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI (X)

Kyrylo Tkachov via Gcc-cvs Mon, 04 Nov 2024 01:21:14 -0800

https://gcc.gnu.org/g:f1d16cd9236e0d59c04018e2dccc09dd736bf1df


commit r15-4878-gf1d16cd9236e0d59c04018e2dccc09dd736bf1df
Author: Kyrylo Tkachov <ktkac...@nvidia.com>
Date:   Thu Oct 17 06:39:57 2024 -0700

    simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI (X)
    
    With recent patch to improve detection of vector rotates at RTL level
    combine now tries matching a V8HImode rotate by 8 in the example in the
    testcase.  We can teach AArch64 to emit a REV16 instruction for such a 
rotate
    but really this operation corresponds to the RTL code BSWAP, for which we
    already have the right patterns.  BSWAP is arguably a simpler representation
    than ROTATE here because it has only one operand, so let's teach 
simplify-rtx
    to generate it.
    
    With this patch the testcase now generates the simplest form:
    .L2:
            ldr     q31, [x1, x0]
            rev16   v31.16b, v31.16b
            str     q31, [x0, x2]
            add     x0, x0, 16
            cmp     x0, 2048
            bne     .L2
    
    instead of the previous:
    .L2:
            ldr     q31, [x1, x0]
            shl     v30.8h, v31.8h, 8
            usra    v30.8h, v31.8h, 8
            str     q30, [x0, x2]
            add     x0, x0, 16
            cmp     x0, 2048
            bne     .L2
    
    IMO ideally the bswap detection would have been done during vectorisation
    time and used the expanders for that, but teaching simplify-rtx to do this
    transformation is fairly straightforward and, unlike at tree level, we have
    the native RTL BSWAP code.  This change is not enough to generate the
    equivalent sequence in SVE, but that is something that should be tackled
    separately.
    
    Bootstrapped and tested on aarch64-none-linux-gnu.
    
    Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
    
    gcc/
    
            * simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
            Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI).
    
    gcc/testsuite/
    
            * gcc.target/aarch64/rot_to_bswap.c: New test.

Diff:
---
 gcc/simplify-rtx.cc                             |  8 ++++++++
 gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c | 23 +++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 0ff72638d85f..751c908113ef 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -4328,6 +4328,14 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
                                      mode, op0, new_amount_rtx);
        }
 #endif
+      /* ROTATE/ROTATERT:HI (X:HI, 8) is BSWAP:HI (X).  Other combinations
+        such as SImode with a count of 16 do not correspond to RTL BSWAP
+        semantics.  */
+      tem = unwrap_const_vec_duplicate (trueop1);
+      if (GET_MODE_UNIT_BITSIZE (mode) == (2 * BITS_PER_UNIT)
+         && CONST_INT_P (tem) && INTVAL (tem) == BITS_PER_UNIT)
+       return simplify_gen_unary (BSWAP, mode, op0, mode);
+
       /* FALLTHRU */
     case ASHIFTRT:
       if (trueop1 == CONST0_RTX (mode))
diff --git a/gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c 
b/gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c
new file mode 100644
index 000000000000..f5b002da8853
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/rot_to_bswap.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param aarch64-autovec-preference=asimd-only" } */
+
+#pragma GCC target "+nosve"
+
+
+#define N 1024
+
+unsigned short in_s[N];
+unsigned short out_s[N];
+
+void
+foo16 (void)
+{
+  for (unsigned i = 0; i < N; i++)
+  {
+    unsigned short x = in_s[i];
+    out_s[i] = (x >> 8) | (x << 8);
+  }
+}
+
+/* { dg-final { scan-assembler {\trev16\tv([123])?[0-9]\.16b, 
v([123])?[0-9]\.16b} } } */
+

[gcc r15-4878] simplify-rtx: Simplify ROTATE:HI (X:HI, 8) into BSWAP:HI (X)

Reply via email to