https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120996
--- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am looking at the current code generation for the reduced testcase in comment
#8 and from the looks of it trunk should be faster than 15.2.0.
GCC15:
```
fcmpe s31, #0.0
bmi .L7
fmsub s0, s9, s31, s8
.L7:
fcmpe s31, s15
bmi .L8
movi v31.2s, #0
fmul s0, s0, s31
.L8:
fmul s0, s0, s13
```
trunk:
```
fcmpe s31, #0.0
bmi .L4
fcmpe s31, s9
fmsub s31, s10, s31, s8
bmi .L27
movi v30.2s, #0
fmul s31, s31, s30
.L27:
fmul s0, s31, s15
.L4:
```
Not taken path is the same, all instructions.
shortest path is better.
medium path (not taken, taken). seems to be the same; 6.
The only difference in my mine is maybe alignment of where the branch goes.
(after the .p2align 5,,15):
.L7 is on the 16 instruction boundary.
While L4 is 22 instruction boundary.
L8 is 20 while L27 is on the 21 boundary (this one).
I wonder if the problem is due to alignment of L27 here.
Which case this is all by accident and micro-arch is harder to predict of what
is going on.