[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2016-03-07 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Thomas Preud'homme  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 CC||thopre01 at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #26 from Thomas Preud'homme  ---
Fixed as of r222512

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2015-04-28 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #25 from Thomas Preud'homme thopre01 at gcc dot gnu.org ---
Author: thopre01
Date: Tue Apr 28 08:10:44 2015
New Revision: 222512

URL: https://gcc.gnu.org/viewcvs?rev=222512root=gccview=rev
Log:
2015-04-28  Thomas Preud'homme  thomas.preudho...@arm.com

gcc/
PR target/63503
* config.gcc: Add cortex-a57-fma-steering.o to extra_objs for
aarch64-*-*.
* config/aarch64/t-aarch64: Add a rule for cortex-a57-fma-steering.o.
* config/aarch64/aarch64.h (AARCH64_FL_USE_FMA_STEERING_PASS): Define.
(AARCH64_TUNE_FMA_STEERING): Likewise.
* config/aarch64/aarch64-cores.def: Set
AARCH64_FL_USE_FMA_STEERING_PASS for cores with dynamic steering of
FMUL/FMADD instructions.
* config/aarch64/aarch64.c (aarch64_register_fma_steering): Declare.
(aarch64_override_options): Include cortex-a57-fma-steering.h. Call
aarch64_register_fma_steering () if AARCH64_TUNE_FMA_STEERING is true.
* config/aarch64/cortex-a57-fma-steering.h: New file.
* config/aarch64/cortex-a57-fma-steering.c: Likewise.

Added:
trunk/gcc/config/aarch64/cortex-a57-fma-steering.c
trunk/gcc/config/aarch64/cortex-a57-fma-steering.h
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config.gcc
trunk/gcc/config/aarch64/aarch64-cores.def
trunk/gcc/config/aarch64/aarch64.c
trunk/gcc/config/aarch64/aarch64.h
trunk/gcc/config/aarch64/t-aarch64


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #21 from Evandro e.menezes at samsung dot com ---
(In reply to ramana.radhakrish...@arm.com from comment #20)
 What's the kind of performance delta you see if you managed to unroll 
 the loop just a wee bit ? Probably not much looking at the code produced 
 here.

Comparing the cycle counts on Juno when running the program from the matrix
multiplication test above built with -Ofast and unrolling:

-fno-unroll-loops: 592000
-funroll-loops --param max-unroll-times=2: 594000
-funroll-loops --param max-unroll-times=4: 592000
-funroll-loops: 59 (implies --param max-unroll-times=8)
-funroll-loops --param max-unroll-times=16: 581000

It seems to me that without effective iv-opt in place, loops have to be
unrolled too aggressively to make any difference in this case, greatly
sacrificing code size.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #22 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro from comment #21)
 (In reply to ramana.radhakrish...@arm.com from comment #20)
  What's the kind of performance delta you see if you managed to unroll 
  the loop just a wee bit ? Probably not much looking at the code produced 
  here.
 
 Comparing the cycle counts on Juno when running the program from the matrix
 multiplication test above built with -Ofast and unrolling:
 
 -fno-unroll-loops: 592000
 -funroll-loops --param max-unroll-times=2: 594000
 -funroll-loops --param max-unroll-times=4: 592000
 -funroll-loops: 59 (implies --param max-unroll-times=8)
 -funroll-loops --param max-unroll-times=16: 581000
 
 It seems to me that without effective iv-opt in place, loops have to be
 unrolled too aggressively to make any difference in this case, greatly
 sacrificing code size.

Unrolling alone isn't good enough in sum reductions. As I mentioned before, GCC
doesn't enable any of the useful loop optimizations by default. So add
-fvariable-expansion-in-unroller to get a good speedup with unrolling. Again
these are all generic GCC issues.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #23 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #22)
 Unrolling alone isn't good enough in sum reductions. As I mentioned before,
 GCC doesn't enable any of the useful loop optimizations by default. So add
 -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again
 these are all generic GCC issues.

Adding -fvariable-expansion-in-unroller when using -funroll-loops results in
practically the same code being emitted.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-28 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #24 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro from comment #23)
 (In reply to Wilco from comment #22)
  Unrolling alone isn't good enough in sum reductions. As I mentioned before,
  GCC doesn't enable any of the useful loop optimizations by default. So add
  -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again
  these are all generic GCC issues.
 
 Adding -fvariable-expansion-in-unroller when using -funroll-loops results in
 practically the same code being emitted.

Correct, all it does is cut the dependency chain of the accumulates. But that's
enough to get the speedup.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-23 Thread ramana.radhakrishnan at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #20 from ramana.radhakrishnan at arm dot com ramana.radhakrishnan 
at arm dot com ---
On 23/10/14 00:28, e.menezes at samsung dot com wrote:
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

 --- Comment #16 from Evandro e.menezes at samsung dot com ---
 (In reply to Wilco from comment #15)
 Using -Ofast is not any different from -O3 -ffast-math when compiling
 non-Fortran code. As comment 10 shows, both loops are vectorized, however
 LLVM unrolls twice and uses multiple accumulators while GCC doesn't.

 You're right.  LLVM produces:

 .LBB0_1:// %vector.body
  // =This Inner Loop Header: Depth=1
  add  x11, x9, x8
  add  x12, x10, x8
  ldp  q2, q3, [x11]
  ldp  q4, q5, [x12]
  add  x8, x8, #32 // =32
  fmla v0.2d, v2.2d, v4.2d
  fmla v1.2d, v3.2d, v5.2d
  cmp  x8, #128, lsl #12  // =524288
  b.ne.LBB0_1

 And GCC:

 .L3:
  ldr q2, [x2, x0]
  add w1, w1, 1
  ldr q1, [x3, x0]
  cmp w1, w4
  add x0, x0, 16
  fmlav0.2d, v2.2d, v1.2d
  bcc .L3

 I still don't see what this has to do with A57. You should open a generic
 bug about GCC not applying basic loop optimizations with -O3 (in fact
 limited unrolling is useful even for -O2).

 Indeed, but I think that there's still a code-generation opportunity for A57
 here.

What you mention is a general code generation improvement for AArch64.

There's nothing Cortex-A57 specific about it. In the AArch64 backend, we 
think architecture and then micro-architecture.


 Note above that the registers are loaded in pairs by LLVM, while GCC, when it
 unrolls the loop, more aggressively BTW, each vector is loaded individually:

 .L3:
  ldr q28, [x15, x16]
  add x17, x16, 16
  ldr q29, [x14, x16]
  add x0, x16, 32
  ldr q30, [x15, x17]
  add x18, x16, 48
  ldr q31, [x14, x17]
  add x1, x16, 64
  ...
  fmlav27.2d, v28.2d, v29.2d
  ...
  fmlav27.2d, v30.2d, v31.2d
  ... # Rest of 8x unroll
  bcc .L3

 It also goes without saying that this code could also benefit from the
 post-increment addressing mode.


What's the kind of performance delta you see if you managed to unroll 
the loop just a wee bit ? Probably not much looking at the code produced 
here.

Ramana




[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #13 from Wilco wdijkstr at arm dot com ---
(In reply to Andrew Pinski from comment #11)
 (In reply to Wilco from comment #10)
  The loops shown are not the correct inner loops for those options - with
  -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the
  question is why GCC doesn't unroll vectorized loops like LLVM?
 
 Because unrolling is not enabled at -O3.  Try adding -funroll-loops.

Isn't it odd that GCC doesn't even do the most basic unrolling at its maximum
optimization setting? But it does do vectorization?

Note -funroll-loops is not sufficient either, you need
-fvariable-expansion-in-unroller as well for this particular loop which also
isn't enabled at -O3. Plus setting the associated param to 4 or 8.

So GCC is certainly capable of generating quality code for this example, it
just doesn't do so by default - unlike LLVM.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #14 from Evandro Menezes e.menezes at samsung dot com ---
Compiling the test-case above with just -O2, I can reproduce the code I
mentioned initially and easily measure the cycle count to run it on target
using perf.

The binary created by GCC runs in about 447000 user cycles and the one created
by LLVM, in about 499000 user cycles.  IOW, fused multiply-add is a win on A57.

Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with
LLVM, both using -Ofast, GCC fails to vectorize the loop in
gemm_block_kernel, while LLVM does.

I should've done a more detailed analysis in this issue before submitting this
bug, sorry.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #15 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro Menezes from comment #14)
 Compiling the test-case above with just -O2, I can reproduce the code I
 mentioned initially and easily measure the cycle count to run it on target
 using perf.
 
 The binary created by GCC runs in about 447000 user cycles and the one
 created by LLVM, in about 499000 user cycles.  IOW, fused multiply-add is a
 win on A57.
 
 Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with
 LLVM, both using -Ofast, GCC fails to vectorize the loop in
 gemm_block_kernel, while LLVM does.
   
 I should've done a more detailed analysis in this issue before submitting
 this bug, sorry.

Using -Ofast is not any different from -O3 -ffast-math when compiling
non-Fortran code. As comment 10 shows, both loops are vectorized, however LLVM
unrolls twice and uses multiple accumulators while GCC doesn't.

I still don't see what this has to do with A57. You should open a generic bug
about GCC not applying basic loop optimizations with -O3 (in fact limited
unrolling is useful even for -O2).


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #16 from Evandro e.menezes at samsung dot com ---
(In reply to Wilco from comment #15)
 Using -Ofast is not any different from -O3 -ffast-math when compiling
 non-Fortran code. As comment 10 shows, both loops are vectorized, however
 LLVM unrolls twice and uses multiple accumulators while GCC doesn't.

You're right.  LLVM produces:

.LBB0_1:// %vector.body
// =This Inner Loop Header: Depth=1
add  x11, x9, x8
add  x12, x10, x8
ldp  q2, q3, [x11]
ldp  q4, q5, [x12]
add  x8, x8, #32 // =32
fmla v0.2d, v2.2d, v4.2d
fmla v1.2d, v3.2d, v5.2d
cmp  x8, #128, lsl #12  // =524288
b.ne.LBB0_1

And GCC:

.L3:
ldr q2, [x2, x0]
add w1, w1, 1
ldr q1, [x3, x0]
cmp w1, w4
add x0, x0, 16
fmlav0.2d, v2.2d, v1.2d
bcc .L3

 I still don't see what this has to do with A57. You should open a generic
 bug about GCC not applying basic loop optimizations with -O3 (in fact
 limited unrolling is useful even for -O2).

Indeed, but I think that there's still a code-generation opportunity for A57
here.

Note above that the registers are loaded in pairs by LLVM, while GCC, when it
unrolls the loop, more aggressively BTW, each vector is loaded individually:

.L3:
ldr q28, [x15, x16]
add x17, x16, 16
ldr q29, [x14, x16]
add x0, x16, 32
ldr q30, [x15, x17]
add x18, x16, 48
ldr q31, [x14, x17]
add x1, x16, 64
...
fmlav27.2d, v28.2d, v29.2d
...
fmlav27.2d, v30.2d, v31.2d
... # Rest of 8x unroll
bcc .L3

It also goes without saying that this code could also benefit from the
post-increment addressing mode.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #17 from Evandro e.menezes at samsung dot com ---
Created attachment 33785
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33785action=edit
Simple matrix multiplication


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Evandro e.menezes at samsung dot com changed:

   What|Removed |Added

  Attachment #33774|0   |1
is obsolete||

--- Comment #18 from Evandro e.menezes at samsung dot com ---
Created attachment 33786
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33786action=edit
Simple test-case


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-22 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #19 from Wilco wdijkstr at arm dot com ---
(In reply to Evandro from comment #16)
 (In reply to Wilco from comment #15)
  Using -Ofast is not any different from -O3 -ffast-math when compiling
  non-Fortran code. As comment 10 shows, both loops are vectorized, however
  LLVM unrolls twice and uses multiple accumulators while GCC doesn't.
 
 You're right.  LLVM produces:
 
 .LBB0_1:// %vector.body
 // =This Inner Loop Header: Depth=1
 add  x11, x9, x8
 add  x12, x10, x8
 ldp  q2, q3, [x11]
 ldp  q4, q5, [x12]
 add  x8, x8, #32 // =32
 fmla v0.2d, v2.2d, v4.2d
 fmla v1.2d, v3.2d, v5.2d
 cmp  x8, #128, lsl #12  // =524288
 b.ne.LBB0_1
 
 And GCC:
 
 .L3:
 ldr q2, [x2, x0]
 add w1, w1, 1
 ldr q1, [x3, x0]
 cmp w1, w4
 add x0, x0, 16
 fmlav0.2d, v2.2d, v1.2d
 bcc .L3
 
  I still don't see what this has to do with A57. You should open a generic
  bug about GCC not applying basic loop optimizations with -O3 (in fact
  limited unrolling is useful even for -O2).
 
 Indeed, but I think that there's still a code-generation opportunity for A57
 here.
 
 Note above that the registers are loaded in pairs by LLVM, while GCC, when
 it unrolls the loop, more aggressively BTW, each vector is loaded
 individually:

Load/store pair optimization should be committed soon:
https://gcc.gnu.org/ml/gcc-patches/2014-10/msg02005.html

 .L3:
 ldr q28, [x15, x16]
 add x17, x16, 16
 ldr q29, [x14, x16]
 add x0, x16, 32
 ldr q30, [x15, x17]
 add x18, x16, 48
 ldr q31, [x14, x17]
 add x1, x16, 64
 ...
 fmlav27.2d, v28.2d, v29.2d
 ...
 fmlav27.2d, v30.2d, v31.2d
 ... # Rest of 8x unroll
 bcc .L3
 
 It also goes without saying that this code could also benefit from the
 post-increment addressing mode.

Yes I've noticed bad addressing like that and fixes are in progress. It's an
issue in iv-opt - even without post-increment enabled the obvious addressing
mode to use is immediate offset.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-21 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #10 from Wilco wdijkstr at arm dot com ---
The loops shown are not the correct inner loops for those options - with
-ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the
question is why GCC doesn't unroll vectorized loops like LLVM?

GCC:

.L24:
ldrq3, [x13, x5]
addx6, x6, 1
ldrq2, [x16, x5]
cmpx6, x12
addx5, x5, 16
fmlav1.2d, v3.2d, v2.2d
bcc.L24

LLVM:

.LBB2_12:
ldurq2, [x8, #-16]
ldrq3, [x8], #32
ldurq4, [x21, #-16]
ldrq5, [x21], #32
fmlav1.2d, v2.2d, v4.2d
fmlav0.2d, v3.2d, v5.2d
subx30, x30, #4// =4
cbnzx30, .LBB2_12


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-21 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #11 from Andrew Pinski pinskia at gcc dot gnu.org ---
(In reply to Wilco from comment #10)
 The loops shown are not the correct inner loops for those options - with
 -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the
 question is why GCC doesn't unroll vectorized loops like LLVM?

Because unrolling is not enabled at -O3.  Try adding -funroll-loops.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-21 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #12 from Evandro Menezes e.menezes at samsung dot com ---
Created attachment 33774
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33774action=edit
Simple test-case


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #8 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Ramana Radhakrishnan from comment #7)
 As Evandro doesn't mention flags it's hard to say whether there really is a
 problem here or not.

Both GCC and LLVM were given -O3 -ffast-math.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-14 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #9 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Wilco from comment #6)
 I ran the assembler examples on A57 hardware with identical input. The FMADD
 code is ~20% faster irrespectively of the size of the input. This is not a
 surprise given that the FMADD latency is lower than the FADD and FMUL
 latency.

I ran the same Geekbench binaries on A53 and the result is about the same
between the GCC and the LLVM code, if with a slight ( 1%) advantage for GCC.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-10 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Wilco wdijkstr at arm dot com changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #6 from Wilco wdijkstr at arm dot com ---
I ran the assembler examples on A57 hardware with identical input. The FMADD
code is ~20% faster irrespectively of the size of the input. This is not a
surprise given that the FMADD latency is lower than the FADD and FMUL latency.

The alignment of the loop or scheduling don't matter at all as the FMADD
latency dominates by far - with serious optimization this code could run 4-5
times as fast and would only be limited by memory bandwidth on datasets larger
than L2.

So this particular example shows issues in LLVM, not in GCC.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-10 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Ramana Radhakrishnan ramana at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2014-10-10
 Ever confirmed|0   |1

--- Comment #7 from Ramana Radhakrishnan ramana at gcc dot gnu.org ---

(In reply to Wilco from comment #6)
 I ran the assembler examples on A57 hardware with identical input. The FMADD
 code is ~20% faster irrespectively of the size of the input. This is not a
 surprise given that the FMADD latency is lower than the FADD and FMUL
 latency.
 
 The alignment of the loop or scheduling don't matter at all as the FMADD
 latency dominates by far - with serious optimization this code could run 4-5
 times as fast and would only be limited by memory bandwidth on datasets
 larger than L2.
 
 So this particular example shows issues in LLVM, not in GCC.

The difference as to why LLVM puts out an fma vs we don't is probably because
of default language standards. GCC defaults to GNU89 while LLVM defaults to
C99. If you used -std=c99 with GCC as well you'd get the same sequence as LLVM.

As Evandro doesn't mention flags it's hard to say whether there really is a
problem here or not.

I only know of a separate gotcha with fmadds which is unfortunate but that's
not relevant to this discussion. 

http://comments.gmane.org/gmane.comp.compilers.llvm.cvs/200282

This probably needs more analysis than the current state.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #1 from Andrew Pinski pinskia at gcc dot gnu.org ---
This might be true for A57 but for our chip (ThunderX), using fused
multiply-add is better.

The other question here are there denormals happening?  That might cause some
performance differences between using fmadd and fmul/fadd.

On most normal processors using fused multiply-add is an improvement also.

Can you attach the preprocessed source and what options you are using?


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #2 from Andrew Pinski pinskia at gcc dot gnu.org ---
The other option it is the fusion of the cmp and branch which is causing the
improvement.

Can you manually edit the assembly and swap the cmp and fmadd in the GCC output
and try again?


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #3 from Evandro Menezes e.menezes at samsung dot com ---
(In reply to Andrew Pinski from comment #1)
 The other question here are there denormals happening?  That might cause
 some performance differences between using fmadd and fmul/fadd.

Nope, no denormals.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread e.menezes at samsung dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #4 from Evandro Menezes e.menezes at samsung dot com ---
Here's a simplified code to reproduce these results:

double sum(double *A, double *B, int n) 
{
  int i;
  double res = 0;

  for (i = 0; i  n; i++)
res += A [i] * B [i];

  return res;
}


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-09 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #5 from Andrew Pinski pinskia at gcc dot gnu.org ---
Also how sure are you that it is the fused multiply-add and not the scheduling
of the instructions?  As I mentioned, try swapping the cmp and fmadd; you might
get a performance boost.