https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438
Hongtao Liu <liuhongt at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |liuhongt at gcc dot gnu.org --- Comment #3 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- (In reply to Andrew Pinski from comment #1) > >this may cause significant performance regression of some nested loops. > > I suspect it depends on the micro-arch for the x86 target. > > What are you running the test on? > > .p2align 6 > .L3: > > I notice GCC aligns only the inner loop to 64 byte boundary while clang/LLVM > aligns each loop (inner and outer) loops to 16 byte boundary. It aligns inner loop to cacheline which depends on the micro-arch. Yes, we're worried about such regression before, but didn't observe any in SPEC and other workroads. Maybe we can provide some option to control the optimization, by default still turn it on.(also try to implement some heuristic to prevent regressions from such inner loops)