[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Thomas Preud'homme changed: What|Removed |Added Status|WAITING |RESOLVED CC||thopre01 at gcc dot gnu.org Resolution|--- |FIXED --- Comment #26 from Thomas Preud'homme --- Fixed as of r222512
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #25 from Thomas Preud'homme thopre01 at gcc dot gnu.org --- Author: thopre01 Date: Tue Apr 28 08:10:44 2015 New Revision: 222512 URL: https://gcc.gnu.org/viewcvs?rev=222512root=gccview=rev Log: 2015-04-28 Thomas Preud'homme thomas.preudho...@arm.com gcc/ PR target/63503 * config.gcc: Add cortex-a57-fma-steering.o to extra_objs for aarch64-*-*. * config/aarch64/t-aarch64: Add a rule for cortex-a57-fma-steering.o. * config/aarch64/aarch64.h (AARCH64_FL_USE_FMA_STEERING_PASS): Define. (AARCH64_TUNE_FMA_STEERING): Likewise. * config/aarch64/aarch64-cores.def: Set AARCH64_FL_USE_FMA_STEERING_PASS for cores with dynamic steering of FMUL/FMADD instructions. * config/aarch64/aarch64.c (aarch64_register_fma_steering): Declare. (aarch64_override_options): Include cortex-a57-fma-steering.h. Call aarch64_register_fma_steering () if AARCH64_TUNE_FMA_STEERING is true. * config/aarch64/cortex-a57-fma-steering.h: New file. * config/aarch64/cortex-a57-fma-steering.c: Likewise. Added: trunk/gcc/config/aarch64/cortex-a57-fma-steering.c trunk/gcc/config/aarch64/cortex-a57-fma-steering.h Modified: trunk/gcc/ChangeLog trunk/gcc/config.gcc trunk/gcc/config/aarch64/aarch64-cores.def trunk/gcc/config/aarch64/aarch64.c trunk/gcc/config/aarch64/aarch64.h trunk/gcc/config/aarch64/t-aarch64
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #21 from Evandro e.menezes at samsung dot com --- (In reply to ramana.radhakrish...@arm.com from comment #20) What's the kind of performance delta you see if you managed to unroll the loop just a wee bit ? Probably not much looking at the code produced here. Comparing the cycle counts on Juno when running the program from the matrix multiplication test above built with -Ofast and unrolling: -fno-unroll-loops: 592000 -funroll-loops --param max-unroll-times=2: 594000 -funroll-loops --param max-unroll-times=4: 592000 -funroll-loops: 59 (implies --param max-unroll-times=8) -funroll-loops --param max-unroll-times=16: 581000 It seems to me that without effective iv-opt in place, loops have to be unrolled too aggressively to make any difference in this case, greatly sacrificing code size.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #22 from Wilco wdijkstr at arm dot com --- (In reply to Evandro from comment #21) (In reply to ramana.radhakrish...@arm.com from comment #20) What's the kind of performance delta you see if you managed to unroll the loop just a wee bit ? Probably not much looking at the code produced here. Comparing the cycle counts on Juno when running the program from the matrix multiplication test above built with -Ofast and unrolling: -fno-unroll-loops: 592000 -funroll-loops --param max-unroll-times=2: 594000 -funroll-loops --param max-unroll-times=4: 592000 -funroll-loops: 59 (implies --param max-unroll-times=8) -funroll-loops --param max-unroll-times=16: 581000 It seems to me that without effective iv-opt in place, loops have to be unrolled too aggressively to make any difference in this case, greatly sacrificing code size. Unrolling alone isn't good enough in sum reductions. As I mentioned before, GCC doesn't enable any of the useful loop optimizations by default. So add -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again these are all generic GCC issues.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #23 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #22) Unrolling alone isn't good enough in sum reductions. As I mentioned before, GCC doesn't enable any of the useful loop optimizations by default. So add -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again these are all generic GCC issues. Adding -fvariable-expansion-in-unroller when using -funroll-loops results in practically the same code being emitted.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #24 from Wilco wdijkstr at arm dot com --- (In reply to Evandro from comment #23) (In reply to Wilco from comment #22) Unrolling alone isn't good enough in sum reductions. As I mentioned before, GCC doesn't enable any of the useful loop optimizations by default. So add -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again these are all generic GCC issues. Adding -fvariable-expansion-in-unroller when using -funroll-loops results in practically the same code being emitted. Correct, all it does is cut the dependency chain of the accumulates. But that's enough to get the speedup.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #20 from ramana.radhakrishnan at arm dot com ramana.radhakrishnan at arm dot com --- On 23/10/14 00:28, e.menezes at samsung dot com wrote: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #16 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #15) Using -Ofast is not any different from -O3 -ffast-math when compiling non-Fortran code. As comment 10 shows, both loops are vectorized, however LLVM unrolls twice and uses multiple accumulators while GCC doesn't. You're right. LLVM produces: .LBB0_1:// %vector.body // =This Inner Loop Header: Depth=1 add x11, x9, x8 add x12, x10, x8 ldp q2, q3, [x11] ldp q4, q5, [x12] add x8, x8, #32 // =32 fmla v0.2d, v2.2d, v4.2d fmla v1.2d, v3.2d, v5.2d cmp x8, #128, lsl #12 // =524288 b.ne.LBB0_1 And GCC: .L3: ldr q2, [x2, x0] add w1, w1, 1 ldr q1, [x3, x0] cmp w1, w4 add x0, x0, 16 fmlav0.2d, v2.2d, v1.2d bcc .L3 I still don't see what this has to do with A57. You should open a generic bug about GCC not applying basic loop optimizations with -O3 (in fact limited unrolling is useful even for -O2). Indeed, but I think that there's still a code-generation opportunity for A57 here. What you mention is a general code generation improvement for AArch64. There's nothing Cortex-A57 specific about it. In the AArch64 backend, we think architecture and then micro-architecture. Note above that the registers are loaded in pairs by LLVM, while GCC, when it unrolls the loop, more aggressively BTW, each vector is loaded individually: .L3: ldr q28, [x15, x16] add x17, x16, 16 ldr q29, [x14, x16] add x0, x16, 32 ldr q30, [x15, x17] add x18, x16, 48 ldr q31, [x14, x17] add x1, x16, 64 ... fmlav27.2d, v28.2d, v29.2d ... fmlav27.2d, v30.2d, v31.2d ... # Rest of 8x unroll bcc .L3 It also goes without saying that this code could also benefit from the post-increment addressing mode. What's the kind of performance delta you see if you managed to unroll the loop just a wee bit ? Probably not much looking at the code produced here. Ramana
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #13 from Wilco wdijkstr at arm dot com --- (In reply to Andrew Pinski from comment #11) (In reply to Wilco from comment #10) The loops shown are not the correct inner loops for those options - with -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the question is why GCC doesn't unroll vectorized loops like LLVM? Because unrolling is not enabled at -O3. Try adding -funroll-loops. Isn't it odd that GCC doesn't even do the most basic unrolling at its maximum optimization setting? But it does do vectorization? Note -funroll-loops is not sufficient either, you need -fvariable-expansion-in-unroller as well for this particular loop which also isn't enabled at -O3. Plus setting the associated param to 4 or 8. So GCC is certainly capable of generating quality code for this example, it just doesn't do so by default - unlike LLVM.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #14 from Evandro Menezes e.menezes at samsung dot com --- Compiling the test-case above with just -O2, I can reproduce the code I mentioned initially and easily measure the cycle count to run it on target using perf. The binary created by GCC runs in about 447000 user cycles and the one created by LLVM, in about 499000 user cycles. IOW, fused multiply-add is a win on A57. Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with LLVM, both using -Ofast, GCC fails to vectorize the loop in gemm_block_kernel, while LLVM does. I should've done a more detailed analysis in this issue before submitting this bug, sorry.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #15 from Wilco wdijkstr at arm dot com --- (In reply to Evandro Menezes from comment #14) Compiling the test-case above with just -O2, I can reproduce the code I mentioned initially and easily measure the cycle count to run it on target using perf. The binary created by GCC runs in about 447000 user cycles and the one created by LLVM, in about 499000 user cycles. IOW, fused multiply-add is a win on A57. Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with LLVM, both using -Ofast, GCC fails to vectorize the loop in gemm_block_kernel, while LLVM does. I should've done a more detailed analysis in this issue before submitting this bug, sorry. Using -Ofast is not any different from -O3 -ffast-math when compiling non-Fortran code. As comment 10 shows, both loops are vectorized, however LLVM unrolls twice and uses multiple accumulators while GCC doesn't. I still don't see what this has to do with A57. You should open a generic bug about GCC not applying basic loop optimizations with -O3 (in fact limited unrolling is useful even for -O2).
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #16 from Evandro e.menezes at samsung dot com --- (In reply to Wilco from comment #15) Using -Ofast is not any different from -O3 -ffast-math when compiling non-Fortran code. As comment 10 shows, both loops are vectorized, however LLVM unrolls twice and uses multiple accumulators while GCC doesn't. You're right. LLVM produces: .LBB0_1:// %vector.body // =This Inner Loop Header: Depth=1 add x11, x9, x8 add x12, x10, x8 ldp q2, q3, [x11] ldp q4, q5, [x12] add x8, x8, #32 // =32 fmla v0.2d, v2.2d, v4.2d fmla v1.2d, v3.2d, v5.2d cmp x8, #128, lsl #12 // =524288 b.ne.LBB0_1 And GCC: .L3: ldr q2, [x2, x0] add w1, w1, 1 ldr q1, [x3, x0] cmp w1, w4 add x0, x0, 16 fmlav0.2d, v2.2d, v1.2d bcc .L3 I still don't see what this has to do with A57. You should open a generic bug about GCC not applying basic loop optimizations with -O3 (in fact limited unrolling is useful even for -O2). Indeed, but I think that there's still a code-generation opportunity for A57 here. Note above that the registers are loaded in pairs by LLVM, while GCC, when it unrolls the loop, more aggressively BTW, each vector is loaded individually: .L3: ldr q28, [x15, x16] add x17, x16, 16 ldr q29, [x14, x16] add x0, x16, 32 ldr q30, [x15, x17] add x18, x16, 48 ldr q31, [x14, x17] add x1, x16, 64 ... fmlav27.2d, v28.2d, v29.2d ... fmlav27.2d, v30.2d, v31.2d ... # Rest of 8x unroll bcc .L3 It also goes without saying that this code could also benefit from the post-increment addressing mode.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #17 from Evandro e.menezes at samsung dot com --- Created attachment 33785 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33785action=edit Simple matrix multiplication
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Evandro e.menezes at samsung dot com changed: What|Removed |Added Attachment #33774|0 |1 is obsolete|| --- Comment #18 from Evandro e.menezes at samsung dot com --- Created attachment 33786 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33786action=edit Simple test-case
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #19 from Wilco wdijkstr at arm dot com --- (In reply to Evandro from comment #16) (In reply to Wilco from comment #15) Using -Ofast is not any different from -O3 -ffast-math when compiling non-Fortran code. As comment 10 shows, both loops are vectorized, however LLVM unrolls twice and uses multiple accumulators while GCC doesn't. You're right. LLVM produces: .LBB0_1:// %vector.body // =This Inner Loop Header: Depth=1 add x11, x9, x8 add x12, x10, x8 ldp q2, q3, [x11] ldp q4, q5, [x12] add x8, x8, #32 // =32 fmla v0.2d, v2.2d, v4.2d fmla v1.2d, v3.2d, v5.2d cmp x8, #128, lsl #12 // =524288 b.ne.LBB0_1 And GCC: .L3: ldr q2, [x2, x0] add w1, w1, 1 ldr q1, [x3, x0] cmp w1, w4 add x0, x0, 16 fmlav0.2d, v2.2d, v1.2d bcc .L3 I still don't see what this has to do with A57. You should open a generic bug about GCC not applying basic loop optimizations with -O3 (in fact limited unrolling is useful even for -O2). Indeed, but I think that there's still a code-generation opportunity for A57 here. Note above that the registers are loaded in pairs by LLVM, while GCC, when it unrolls the loop, more aggressively BTW, each vector is loaded individually: Load/store pair optimization should be committed soon: https://gcc.gnu.org/ml/gcc-patches/2014-10/msg02005.html .L3: ldr q28, [x15, x16] add x17, x16, 16 ldr q29, [x14, x16] add x0, x16, 32 ldr q30, [x15, x17] add x18, x16, 48 ldr q31, [x14, x17] add x1, x16, 64 ... fmlav27.2d, v28.2d, v29.2d ... fmlav27.2d, v30.2d, v31.2d ... # Rest of 8x unroll bcc .L3 It also goes without saying that this code could also benefit from the post-increment addressing mode. Yes I've noticed bad addressing like that and fixes are in progress. It's an issue in iv-opt - even without post-increment enabled the obvious addressing mode to use is immediate offset.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #10 from Wilco wdijkstr at arm dot com --- The loops shown are not the correct inner loops for those options - with -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the question is why GCC doesn't unroll vectorized loops like LLVM? GCC: .L24: ldrq3, [x13, x5] addx6, x6, 1 ldrq2, [x16, x5] cmpx6, x12 addx5, x5, 16 fmlav1.2d, v3.2d, v2.2d bcc.L24 LLVM: .LBB2_12: ldurq2, [x8, #-16] ldrq3, [x8], #32 ldurq4, [x21, #-16] ldrq5, [x21], #32 fmlav1.2d, v2.2d, v4.2d fmlav0.2d, v3.2d, v5.2d subx30, x30, #4// =4 cbnzx30, .LBB2_12
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #11 from Andrew Pinski pinskia at gcc dot gnu.org --- (In reply to Wilco from comment #10) The loops shown are not the correct inner loops for those options - with -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the question is why GCC doesn't unroll vectorized loops like LLVM? Because unrolling is not enabled at -O3. Try adding -funroll-loops.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #12 from Evandro Menezes e.menezes at samsung dot com --- Created attachment 33774 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33774action=edit Simple test-case
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #8 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Ramana Radhakrishnan from comment #7) As Evandro doesn't mention flags it's hard to say whether there really is a problem here or not. Both GCC and LLVM were given -O3 -ffast-math.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #9 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Wilco from comment #6) I ran the assembler examples on A57 hardware with identical input. The FMADD code is ~20% faster irrespectively of the size of the input. This is not a surprise given that the FMADD latency is lower than the FADD and FMUL latency. I ran the same Geekbench binaries on A53 and the result is about the same between the GCC and the LLVM code, if with a slight ( 1%) advantage for GCC.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Wilco wdijkstr at arm dot com changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #6 from Wilco wdijkstr at arm dot com --- I ran the assembler examples on A57 hardware with identical input. The FMADD code is ~20% faster irrespectively of the size of the input. This is not a surprise given that the FMADD latency is lower than the FADD and FMUL latency. The alignment of the loop or scheduling don't matter at all as the FMADD latency dominates by far - with serious optimization this code could run 4-5 times as fast and would only be limited by memory bandwidth on datasets larger than L2. So this particular example shows issues in LLVM, not in GCC.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Ramana Radhakrishnan ramana at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |WAITING Last reconfirmed||2014-10-10 Ever confirmed|0 |1 --- Comment #7 from Ramana Radhakrishnan ramana at gcc dot gnu.org --- (In reply to Wilco from comment #6) I ran the assembler examples on A57 hardware with identical input. The FMADD code is ~20% faster irrespectively of the size of the input. This is not a surprise given that the FMADD latency is lower than the FADD and FMUL latency. The alignment of the loop or scheduling don't matter at all as the FMADD latency dominates by far - with serious optimization this code could run 4-5 times as fast and would only be limited by memory bandwidth on datasets larger than L2. So this particular example shows issues in LLVM, not in GCC. The difference as to why LLVM puts out an fma vs we don't is probably because of default language standards. GCC defaults to GNU89 while LLVM defaults to C99. If you used -std=c99 with GCC as well you'd get the same sequence as LLVM. As Evandro doesn't mention flags it's hard to say whether there really is a problem here or not. I only know of a separate gotcha with fmadds which is unfortunate but that's not relevant to this discussion. http://comments.gmane.org/gmane.comp.compilers.llvm.cvs/200282 This probably needs more analysis than the current state.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #1 from Andrew Pinski pinskia at gcc dot gnu.org --- This might be true for A57 but for our chip (ThunderX), using fused multiply-add is better. The other question here are there denormals happening? That might cause some performance differences between using fmadd and fmul/fadd. On most normal processors using fused multiply-add is an improvement also. Can you attach the preprocessed source and what options you are using?
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #2 from Andrew Pinski pinskia at gcc dot gnu.org --- The other option it is the fusion of the cmp and branch which is causing the improvement. Can you manually edit the assembly and swap the cmp and fmadd in the GCC output and try again?
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #3 from Evandro Menezes e.menezes at samsung dot com --- (In reply to Andrew Pinski from comment #1) The other question here are there denormals happening? That might cause some performance differences between using fmadd and fmul/fadd. Nope, no denormals.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #4 from Evandro Menezes e.menezes at samsung dot com --- Here's a simplified code to reproduce these results: double sum(double *A, double *B, int n) { int i; double res = 0; for (i = 0; i n; i++) res += A [i] * B [i]; return res; }
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #5 from Andrew Pinski pinskia at gcc dot gnu.org --- Also how sure are you that it is the fused multiply-add and not the scheduling of the instructions? As I mentioned, try swapping the cmp and fmadd; you might get a performance boost.