https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120996

--- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am looking at the current code generation for the reduced testcase in comment
#8 and from the looks of it trunk should be faster than 15.2.0.

GCC15:
```
        fcmpe   s31, #0.0
        bmi     .L7
        fmsub   s0, s9, s31, s8
.L7:
        fcmpe   s31, s15
        bmi     .L8
        movi    v31.2s, #0
        fmul    s0, s0, s31
.L8:
        fmul    s0, s0, s13
```

trunk:
```
        fcmpe   s31, #0.0
        bmi     .L4
        fcmpe   s31, s9
        fmsub   s31, s10, s31, s8
        bmi     .L27
        movi    v30.2s, #0
        fmul    s31, s31, s30
.L27:
        fmul    s0, s31, s15
.L4:

```

Not taken path is the same, all instructions.
shortest path is better.
medium path (not taken, taken). seems to be the same; 6.

The only difference in my mine is maybe alignment of where the branch goes.

(after the .p2align 5,,15):
.L7 is on the 16 instruction boundary.
While L4 is 22 instruction boundary.
L8 is 20 while L27 is on the 21 boundary (this one).

I wonder if the problem is due to alignment of L27 here.

Which case this is all by accident and micro-arch is harder to predict of what
is going on.

Reply via email to