https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125242

            Bug ID: 125242
           Summary: Potencial regression in the vectorization capability
                    for loops containing an early exit condition (break)
                    when targeting SVE
           Product: gcc
           Version: 16.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bug_hunters at yeah dot net
  Target Milestone: ---

It seems there might be a regression in the vectorization capability for loops
containing an early exit condition (break) when targeting SVE. In GCC 15.2.0,
the provided test case is successfully vectorized using 16‑byte vectors, as
shown in the logs. However, in GCC 16.1.0, vectorization is missed due to the
following diagnostic:
> can't safely apply code motion to dependencies to vectorize the early exit. 
> patt_25 = x_9 < 0.0; may trap.

The generated scalar code (shown below) confirms that the loop is no longer
vectorized, which could lead to a noticeable performance drop on SVE‑capable
hardware, especially for loops with a large trip count where the early exit is
rarely taken.

We kindly ask the GCC development team whether there is any planned improvement
or possible tweak in GCC 16 to re‑enable vectorization for such patterns. For
instance, could the early exit be safely speculated (e.g., by masking or
predication) when the condition is known not to trap? We understand the
complexity of code motion around potentially trapping operations, but we would
greatly appreciate any insights or suggestions for a workaround. We also wonder
if this regression is intentional or if it might be addressed in a future
release.

Thank you very much for your continuous efforts on GCC, and we look forward to
your advice.

---

Test case:
```
#include <stdint.h>

void foo(int32_t N, const float * a, float * out) 
{
    float x;
    float z;

    for (int i = 0; i < 102400; i += 1) {
        x = a[i];
        if (x < 0.0f) {
            break;
        }
        z = x + x;
        out[i] = z;
    }
}
```

Compilation options
```
-O3 -S -march=armv9-a+sve -ftree-vectorize -fopt-info-vec-all
```

gcc 15.2.0 logs (https://godbolt.org/z/n3qdb63s6)
```
<source>:8:23: optimized: loop vectorized using 16 byte vectors
<source>:8:23: optimized:  loop versioned for vectorization because of possible
aliasing
<source>:3:6: note: vectorized 1 loops in function.
<source>:3:6: note: ***** Analysis failed with vector mode VNx4SF
<source>:3:6: note: ***** The result for vector mode VNx16QI would be the same
<source>:3:6: note: ***** The result for vector mode VNx8QI would be the same
<source>:3:6: note: ***** The result for vector mode VNx4QI would be the same
<source>:3:6: note: ***** Re-trying analysis with vector mode VNx2QI
<source>:3:6: note: ***** Analysis failed with vector mode VNx2QI
<source>:3:6: note: ***** Re-trying analysis with vector mode V16QI
<source>:3:6: note: ***** Analysis failed with vector mode V16QI
<source>:3:6: note: ***** The result for vector mode V8QI would be the same
<source>:3:6: note: ***** The result for vector mode V4HI would be the same
<source>:3:6: note: ***** Re-trying analysis with vector mode V2SI
<source>:3:6: note: ***** Analysis failed with vector mode V2SI
Compiler returned: 0
```

gcc 16.1.0 logs (https://godbolt.org/z/G7fe3j7ff)
```
<source>:8:23: missed: couldn't vectorize loop
<source>:3:6: missed: can't safely apply code motion to dependencies to
vectorize the early exit. patt_25 = x_9 < 0.0;
 may trap.
<source>:3:6: note: vectorized 0 loops in function.
<source>:16:1: note: ***** Analysis failed with vector mode VNx4SF
<source>:16:1: note: ***** Skipping vector mode VNx16QI, which would repeat the
analysis for VNx4SF
Compiler returned: 0
```

gcc 16.1.0 outputs
```
foo(int, float const*, float*):
        mov     x0, 0
.L3:
        ldr     s31, [x1, x0]
        fcmpe   s31, #0.0
        bmi     .L1
        fadd    s31, s31, s31
        str     s31, [x2, x0]
        add     x0, x0, 4
        cmp     x0, 409600
        bne     .L3
.L1:
        ret
```

Reply via email to