--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 43896
r254011 with peeling disabled
The other differences look like RA/scheduling in the end the stack frame in the
new rev. is 32 bytes larger (up from $4800 to $4832). Disabling the 2nd
scheduling pass doesn't have any nice effects btw.
All the spills in the code certainly makes for bad code so I'm not sure that
trying to fix things by re-introducing the peeling for alignment somehow
makes most sense...
Looking for an opportunity to distribute the loop might make more sense,
eventually more explicitely "spilling" shared intermediate results to
memory in distribution. The source is quite unwieldly and dependences
are not obvious here.