https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124162
--- Comment #4 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Tamar Christina <[email protected]>: https://gcc.gnu.org/g:cd13223e7760d55601f440f4605db33bafd1a6d2 commit r16-7753-gcd13223e7760d55601f440f4605db33bafd1a6d2 Author: Tamar Christina <[email protected]> Date: Fri Feb 27 07:41:20 2026 +0000 AArch64: Don't enable ptest elimination for partial vectors [PR124162] The following loop char b = 41; int main() { signed char a[31]; #pragma GCC novector for (int c = 0; c < 31; ++c) a[c] = c * c + c % 5; { signed char *d = a; #pragma GCC novector for (int c = 0; c < 31; ++c, b += -16) d[c] += b; } for (int c = 0; c < 31; ++c) { signed char e = c * c + c % 5 + 41 + c * -16; if (a[c] != e) __builtin_abort(); } } compiled with -O2 -ftree-vectorize -msve-vector-bits=256 -march=armv8.2-a+sve generates ptrue p6.b, vl32 add x2, x2, :lo12:.LC0 add w5, w5, 16 ld1rw z25.s, p6/z, [x2] strb w5, [x6, #:lo12:.LANCHOR0] mov w0, 0 mov p7.b, p6.b mov w2, 31 index z30.s, #0, #1 mov z26.s, #5 mov z27.b, #41 .L6: mov z29.d, z30.d movprfx z28, z30 add z28.b, z28.b, #240 mad z29.b, p6/m, z28.b, z27.b mov w3, w0 movprfx z31, z30 smulh z31.s, p6/m, z31.s, z25.s add w0, w0, 8 asr z31.s, z31.s, #1 msb z31.s, p6/m, z26.s, z30.s add z31.b, z31.b, z29.b ld1b z29.s, p7/z, [x1] cmpne p7.b, p7/z, z31.b, z29.b b.any .L15 add x1, x1, 8 add z30.s, z30.s, #8 whilelo p7.s, w0, w2 b.any .L6 Which uses a predicate for the first iteration where all bits are 1. i.e. all lanes active. This causes the result of the cmpne to set the wrong CC flags. The second iteration uses whilelo p7.s, w0, w2 which gives the correct mask layout going forward. This is due to the CSE'ing code that tries to share predicates as much as possible. In aarch64_expand_mov_immediate we do during predicate generation /* Only the low bit of each .H, .S and .D element is defined, so we can set the upper bits to whatever we like. If the predicate is all-true in MODE, prefer to set all the undefined bits as well, so that we can share a single .B predicate for all modes. */ if (imm == CONSTM1_RTX (mode)) imm = CONSTM1_RTX (VNx16BImode); which essentially maps all predicates to .b unless the predicate is created outside the immediate expansion code. It creates the sparse predicate for data lane VNx4QI from a VNx16QI and then has a "conversion" operation. The conversion operation results in a simple copy: mov p7.b, p6.b because in the data model for partial vectors the upper lanes are *don't care*. So computations using this vector are fine. However for comparisons, or any operations setting flags the predicate value does matter otherwise we get the wrong flags as the above. Additionally we don't have a way to distinguish based on the predicate alone whether the operation is partial or not. i.e. we have no "partial predicate" modes. two ways to solve this: 1. restore the ptest for partial vector compares. This would slow down the loop though and introduce a second ptrue .s, VL8 predicate. 2. disable the sharing of partial vector predicates. This allows us to remove the ptest. Since the ptest would introduce a second predicate here anyway I'm leaning towards disabling sharing between partial and full predicates. For the patch I ended up going with 1. The reason is that this is that unsharing the predicate does end up being pessimistic loops that only operate on full vectors only (which are the majority of the cases). I also don't fully understand all the places we depend on this sharing (and about 3600 ACLE tests fail assembler scans). I suspect one way to possibly deal with this is to perform the sharing on GIMPLE using the new isel hook and make RTL constant expansion not manually force it. Since in gimple it's easy to follow compares from a back-edge to figure out if it's safe to share the predicate. I also tried changing it so that during cond_vec_cbranch_any we perform an AND with the proper partial predicate. But unfortunately folding doesn't realize that the and on the latch edge is useless (e.g. whilelo p7.s, w0, w2 makes it a no-op) and that the AND should be moved outside the loop. This is something that again isel should be able to do. I also tried changing the mov p7.b, p6.b into an AND, while this worked, folding didn't quite get that the and can be eliminated. And this also pessimists actual register copies. So for now I just undo ptest elimination for partial vectors for GCC 16 and will revisit it for GCC 17 when we extend ptest elimination. gcc/ChangeLog: PR target/124162 * config/aarch64/aarch64-sve.md (cond_vec_cbranch_any, cond_vec_cbranch_all): Drop partial vectors support. gcc/testsuite/ChangeLog: PR target/124162 * gcc.target/aarch64/sve/vect-early-break-cbranch_16.c: New test.
