https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #19 from Wilco <wdijkstr at arm dot com> --- (In reply to Evandro from comment #16) > (In reply to Wilco from comment #15) > > Using -Ofast is not any different from -O3 -ffast-math when compiling > > non-Fortran code. As comment 10 shows, both loops are vectorized, however > > LLVM unrolls twice and uses multiple accumulators while GCC doesn't. > > You're right. LLVM produces: > > .LBB0_1: // %vector.body > // =>This Inner Loop Header: Depth=1 > add x11, x9, x8 > add x12, x10, x8 > ldp q2, q3, [x11] > ldp q4, q5, [x12] > add x8, x8, #32 // =32 > fmla v0.2d, v2.2d, v4.2d > fmla v1.2d, v3.2d, v5.2d > cmp x8, #128, lsl #12 // =524288 > b.ne .LBB0_1 > > And GCC: > > .L3: > ldr q2, [x2, x0] > add w1, w1, 1 > ldr q1, [x3, x0] > cmp w1, w4 > add x0, x0, 16 > fmla v0.2d, v2.2d, v1.2d > bcc .L3 > > > I still don't see what this has to do with A57. You should open a generic > > bug about GCC not applying basic loop optimizations with -O3 (in fact > > limited unrolling is useful even for -O2). > > Indeed, but I think that there's still a code-generation opportunity for A57 > here. > > Note above that the registers are loaded in pairs by LLVM, while GCC, when > it unrolls the loop, more aggressively BTW, each vector is loaded > individually: Load/store pair optimization should be committed soon: https://gcc.gnu.org/ml/gcc-patches/2014-10/msg02005.html > .L3: > ldr q28, [x15, x16] > add x17, x16, 16 > ldr q29, [x14, x16] > add x0, x16, 32 > ldr q30, [x15, x17] > add x18, x16, 48 > ldr q31, [x14, x17] > add x1, x16, 64 > ... > fmla v27.2d, v28.2d, v29.2d > ... > fmla v27.2d, v30.2d, v31.2d > ... # Rest of 8x unroll > bcc .L3 > > It also goes without saying that this code could also benefit from the > post-increment addressing mode. Yes I've noticed bad addressing like that and fixes are in progress. It's an issue in iv-opt - even without post-increment enabled the obvious addressing mode to use is immediate offset.