The following loop

char b = 41;
int main() {
  signed char a[31];
#pragma GCC novector
  for (int c = 0; c < 31; ++c)
    a[c] = c * c + c % 5;
  {
    signed char *d = a;
#pragma GCC novector
    for (int c = 0; c < 31; ++c, b += -16)
      d[c] += b;
  }
  for (int c = 0; c < 31; ++c) {
    signed char e = c * c + c % 5 + 41 + c * -16;
    if (a[c] != e)
      __builtin_abort();
  }
}

compiled with -O2 -ftree-vectorize -msve-vector-bits=256 -march=armv8.2-a+sve

generates

        ptrue   p6.b, vl32
        add     x2, x2, :lo12:.LC0
        add     w5, w5, 16
        ld1rw   z25.s, p6/z, [x2]
        strb    w5, [x6, #:lo12:.LANCHOR0]
        mov     w0, 0
        mov     p7.b, p6.b
        mov     w2, 31
        index   z30.s, #0, #1
        mov     z26.s, #5
        mov     z27.b, #41
.L6:
        mov     z29.d, z30.d
        movprfx z28, z30
        add     z28.b, z28.b, #240
        mad     z29.b, p6/m, z28.b, z27.b
        mov     w3, w0
        movprfx z31, z30
        smulh   z31.s, p6/m, z31.s, z25.s
        add     w0, w0, 8
        asr     z31.s, z31.s, #1
        msb     z31.s, p6/m, z26.s, z30.s
        add     z31.b, z31.b, z29.b
        ld1b    z29.s, p7/z, [x1]
        cmpne   p7.b, p7/z, z31.b, z29.b
        b.any   .L15
        add     x1, x1, 8
        add     z30.s, z30.s, #8
        whilelo p7.s, w0, w2
        b.any   .L6

Which uses a predicate for the first iteration where all bits are 1. i.e. all
lanes active. This causes the result of the cmpne to set the wrong CC flags.
The second iteration uses

        whilelo p7.s, w0, w2

which gives the correct mask layout going forward.

This is due to the CSE'ing code that tries to share predicates as much as
possible.  In aarch64_expand_mov_immediate we do during predicate generation

          /* Only the low bit of each .H, .S and .D element is defined,
             so we can set the upper bits to whatever we like.  If the
             predicate is all-true in MODE, prefer to set all the undefined
             bits as well, so that we can share a single .B predicate for
             all modes.  */
          if (imm == CONSTM1_RTX (mode))
            imm = CONSTM1_RTX (VNx16BImode);

which essentially maps all predicates to .b unless the predicate is created
outside the immediate expansion code.

It creates the sparse predicate for data lane VNx4QI from a VNx16QI and then
has a "conversion" operation. The conversion operation results in a simple copy:

        mov     p7.b, p6.b

because in the data model for partial vectors the upper lanes are *don't care*.
So computations using this vector are fine.  However for comparisons, or any
operations setting flags the predicate value does matter otherwise we get the
wrong flags as the above.

Additionally we don't have a way to distinguish based on the predicate alone
whether the operation is partial or not. i.e. we have no "partial predicate"
modes.

two ways to solve this:

1. restore the ptest for partial vector compares.  This would slow down the loop
   though and introduce a second ptrue .s, VL8 predicate.

2. disable the sharing of partial vector predicates.  This allows us to remove
   the ptest.  Since the ptest would introduce a second predicate here anyway
   I'm leaning towards disabling sharing between partial and full predicates.

For the patch I ended up going with 1.  The reason is that this is that
unsharing the predicate does end up being pessimistic loops that only operate on
full vectors only (which are the majority of the cases).

I also don't fully understand all the places we depend on this sharing (and
about 3600 ACLE tests fail assembler scans).  I suspect one way to possibly
deal with this is to perform the sharing on GIMPLE using the new isel hook and
make RTL constant expansion not manually force it.  Since in gimple it's easy to
follow compares from a back-edge to figure out if it's safe to share the
predicate.

I also tried changing it so that during cond_vec_cbranch_any we perform an AND
with the proper partial predicate.  But unfortunately folding doesn't realize

that the and on the latch edge is useless (e.g.  whilelo p7.s, w0, w2 makes it
a no-op) and that the AND should be moved outside the loop.  This is something
that again isel should be able to do.

I also tried changing the

        mov     p7.b, p6.b

into an AND, while this worked, folding didn't quite get that the and can be
eliminated.  And this also pessimists actual register copies.

So for now I just undo ptest elimination for partial vectors for GCC 16 and will
revisit it for GCC 17 when we extend ptest elimination.

Unless someone has any other ideas?

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

        PR target/124162
        * config/aarch64/aarch64-sve.md (cond_vec_cbranch_any,
        cond_vec_cbranch_all): Drop partial vectors support.

gcc/testsuite/ChangeLog:

        PR target/124162
        * gcc.target/aarch64/sve/vect-early-break-cbranch_16.c: New test.

---
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 
97fd95169590c8e074422c29e81043238e128a4c..019630eb8d21941b0ca718838ae6ef23e8aaedab
 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -9877,12 +9877,12 @@ (define_expand "<optab><mode>"
 ;; instead and about the ptest.
 (define_expand "<optab><mode>"
   [(set (pc)
-       (unspec:SVE_I
+       (unspec:SVE_FULL_I
          [(if_then_else
            (match_operator 0 "aarch64_comparison_operator"
              [(match_operand:<VPRED> 1 "register_operand")
-              (match_operand:SVE_I 2 "register_operand")
-              (match_operand:SVE_I 3 "aarch64_simd_reg_or_zero")])
+              (match_operand:SVE_FULL_I 2 "register_operand")
+              (match_operand:SVE_FULL_I 3 "aarch64_simd_reg_or_zero")])
            (label_ref (match_operand 4 ""))
            (pc))]
         COND_CBRANCH_CMP))]
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c 
b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c
new file mode 100644
index 
0000000000000000000000000000000000000000..9bb251f5f4766b3cb09f46b08c5f8c3dd38f613e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c
@@ -0,0 +1,22 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-Ofast --param aarch64-autovec-preference=sve-only" } */
+/* { dg-require-effective-target lp64 } */
+
+char b = 41;
+int main() {
+  signed char a[31];
+#pragma GCC novector
+  for (int c = 0; c < 31; ++c)
+    a[c] = c * c + c % 5;
+  {
+    signed char *d = a;
+#pragma GCC novector
+    for (int c = 0; c < 31; ++c, b += -16)
+      d[c] += b;
+  }
+  for (int c = 0; c < 31; ++c) {
+    signed char e = c * c + c % 5 + 41 + c * -16;
+    if (a[c] != e)
+      __builtin_abort();
+  }
+}


-- 
diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
index 97fd95169590c8e074422c29e81043238e128a4c..019630eb8d21941b0ca718838ae6ef23e8aaedab 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -9877,12 +9877,12 @@ (define_expand "<optab><mode>"
 ;; instead and about the ptest.
 (define_expand "<optab><mode>"
   [(set (pc)
-	(unspec:SVE_I
+	(unspec:SVE_FULL_I
 	  [(if_then_else
 	    (match_operator 0 "aarch64_comparison_operator"
 	      [(match_operand:<VPRED> 1 "register_operand")
-	       (match_operand:SVE_I 2 "register_operand")
-	       (match_operand:SVE_I 3 "aarch64_simd_reg_or_zero")])
+	       (match_operand:SVE_FULL_I 2 "register_operand")
+	       (match_operand:SVE_FULL_I 3 "aarch64_simd_reg_or_zero")])
 	    (label_ref (match_operand 4 ""))
 	    (pc))]
 	 COND_CBRANCH_CMP))]
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c
new file mode 100644
index 0000000000000000000000000000000000000000..9bb251f5f4766b3cb09f46b08c5f8c3dd38f613e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c
@@ -0,0 +1,22 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-Ofast --param aarch64-autovec-preference=sve-only" } */
+/* { dg-require-effective-target lp64 } */
+
+char b = 41;
+int main() {
+  signed char a[31];
+#pragma GCC novector
+  for (int c = 0; c < 31; ++c)
+    a[c] = c * c + c % 5;
+  {
+    signed char *d = a;
+#pragma GCC novector
+    for (int c = 0; c < 31; ++c, b += -16)
+      d[c] += b;
+  }
+  for (int c = 0; c < 31; ++c) {
+    signed char e = c * c + c % 5 + 41 + c * -16;
+    if (a[c] != e)
+      __builtin_abort();
+  }
+}

Reply via email to