https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115029
--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So the only difference that might make a difference between GCC 13 and 14 is
how { 4, 4, 4, 4 } and { -2113396605, -2113396605, -2113396605, -2113396605 }
are formed in the front part of stress_cpu_fft . I suspect since the loops are
small enough, the front part of stress_cpu_fft is taking enough time to make a
difference.
And it looks like depending on the micro-arch, loading from memory (L1 most
likely in this case) is slightly faster than creating the value in the GPRs and
into a the vector register.