https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
Wilco <wdijkstr at arm dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wdijkstr at arm dot com --- Comment #6 from Wilco <wdijkstr at arm dot com> --- I ran the assembler examples on A57 hardware with identical input. The FMADD code is ~20% faster irrespectively of the size of the input. This is not a surprise given that the FMADD latency is lower than the FADD and FMUL latency. The alignment of the loop or scheduling don't matter at all as the FMADD latency dominates by far - with serious optimization this code could run 4-5 times as fast and would only be limited by memory bandwidth on datasets larger than L2. So this particular example shows issues in LLVM, not in GCC.