[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

wdijkstr at arm dot com Wed, 22 Oct 2014 16:59:45 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503


--- Comment #19 from Wilco <wdijkstr at arm dot com> ---
(In reply to Evandro from comment #16)
> (In reply to Wilco from comment #15)
> > Using -Ofast is not any different from -O3 -ffast-math when compiling
> > non-Fortran code. As comment 10 shows, both loops are vectorized, however
> > LLVM unrolls twice and uses multiple accumulators while GCC doesn't.
> 
> You're right.  LLVM produces:
> 
> .LBB0_1:                                // %vector.body
>                                         // =>This Inner Loop Header: Depth=1
>         add      x11, x9, x8
>         add      x12, x10, x8
>         ldp      q2, q3, [x11]
>         ldp      q4, q5, [x12]
>         add      x8, x8, #32             // =32
>         fmla     v0.2d, v2.2d, v4.2d
>         fmla     v1.2d, v3.2d, v5.2d
>         cmp      x8, #128, lsl #12      // =524288
>         b.ne    .LBB0_1
> 
> And GCC:
> 
> .L3:
>         ldr     q2, [x2, x0]
>         add     w1, w1, 1
>         ldr     q1, [x3, x0]
>         cmp     w1, w4
>         add     x0, x0, 16
>         fmla    v0.2d, v2.2d, v1.2d
>         bcc     .L3
> 
> > I still don't see what this has to do with A57. You should open a generic
> > bug about GCC not applying basic loop optimizations with -O3 (in fact
> > limited unrolling is useful even for -O2).
> 
> Indeed, but I think that there's still a code-generation opportunity for A57
> here.
> 
> Note above that the registers are loaded in pairs by LLVM, while GCC, when
> it unrolls the loop, more aggressively BTW, each vector is loaded
> individually:

Load/store pair optimization should be committed soon:
https://gcc.gnu.org/ml/gcc-patches/2014-10/msg02005.html

> .L3:
>         ldr     q28, [x15, x16]
>         add     x17, x16, 16
>         ldr     q29, [x14, x16]
>         add     x0, x16, 32
>         ldr     q30, [x15, x17]
>         add     x18, x16, 48
>         ldr     q31, [x14, x17]
>         add     x1, x16, 64
>         ...
>         fmla    v27.2d, v28.2d, v29.2d
>         ...
>         fmla    v27.2d, v30.2d, v31.2d
>         ...     # Rest of 8x unroll
>         bcc     .L3
> 
> It also goes without saying that this code could also benefit from the
> post-increment addressing mode.

Yes I've noticed bad addressing like that and fixes are in progress. It's an
issue in iv-opt - even without post-increment enabled the obvious addressing
mode to use is immediate offset.

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

Reply via email to