On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <rackd...@gmail.com> wrote:
> Hi, > First of all, does LLVM essentially fma or muladd expressions like > `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use > `muladd` and `fma` on these types of instructions (is there a macro for > making this easier)? > Yes, LLVM will use fma machine instructions -- but only if they lead to the same round-off error as using separate multiply and add instructions. If you do not care about the details of conforming to the IEEE standard, then you can use the `@fastmath` macro that enables several optimizations, including this one. This is described in the manual < http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations >. Secondly, I am wondering if my setup is no applying these operations > correctly. Here's my test code: > > f(x) = 2.0x + 3.0 > g(x) = muladd(x,2.0, 3.0) > h(x) = fma(x,2.0, 3.0) > > @code_llvm f(4.0) > @code_llvm g(4.0) > @code_llvm h(4.0) > > @code_native f(4.0) > @code_native g(4.0) > @code_native h(4.0) > > *Computer 1* > > Julia Version 0.5.0-rc4+0 > Commit 9c76c3e* (2016-09-09 01:43 UTC) > Platform Info: > System: Linux (x86_64-redhat-linux) > CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz > WORD_SIZE: 64 > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) > LAPACK: libopenblasp.so.0 > LIBM: libopenlibm > LLVM: libLLVM-3.7.1 (ORCJIT, broadwell) > This looks good, the "broadwell" architecture that LLVM uses should imply the respective optimizations. Try with `@fastmath`. -erik > (the COPR nightly on CentOS7) with > > [crackauc@crackauc2 ~]$ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 16 > On-line CPU(s) list: 0-15 > Thread(s) per core: 1 > Core(s) per socket: 8 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 79 > Model name: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz > Stepping: 1 > CPU MHz: 1200.000 > BogoMIPS: 6392.58 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 25600K > NUMA node0 CPU(s): 0-7 > NUMA node1 CPU(s): 8-15 > > > > I get the output > > define double @julia_f_72025(double) #0 { > top: > %1 = fmul double %0, 2.000000e+00 > %2 = fadd double %1, 3.000000e+00 > ret double %2 > } > > define double @julia_g_72027(double) #0 { > top: > %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, > double 3.000000e+00) > ret double %1 > } > > define double @julia_h_72029(double) #0 { > top: > %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double > 3.000000e+00) > ret double %1 > } > .text > Filename: fmatest.jl > pushq %rbp > movq %rsp, %rbp > Source line: 1 > addsd %xmm0, %xmm0 > movabsq $139916162906520, %rax # imm = 0x7F40C5303998 > addsd (%rax), %xmm0 > popq %rbp > retq > nopl (%rax,%rax) > .text > Filename: fmatest.jl > pushq %rbp > movq %rsp, %rbp > Source line: 2 > addsd %xmm0, %xmm0 > movabsq $139916162906648, %rax # imm = 0x7F40C5303A18 > addsd (%rax), %xmm0 > popq %rbp > retq > nopl (%rax,%rax) > .text > Filename: fmatest.jl > pushq %rbp > movq %rsp, %rbp > movabsq $139916162906776, %rax # imm = 0x7F40C5303A98 > Source line: 3 > movsd (%rax), %xmm1 # xmm1 = mem[0],zero > movabsq $139916162906784, %rax # imm = 0x7F40C5303AA0 > movsd (%rax), %xmm2 # xmm2 = mem[0],zero > movabsq $139925776008800, %rax # imm = 0x7F43022C8660 > popq %rbp > jmpq *%rax > nopl (%rax) > > It looks like explicit muladd or not ends up at the same native code, but > is that native code actually doing an fma? The fma native is different, but > from a discussion on the Gitter it seems that might be a software FMA? This > computer is setup with the BIOS setting as LAPACK optimized or something > like that, so is that messing with something? > > *Computer 2* > > Julia Version 0.6.0-dev.557 > Commit c7a4897 (2016-09-08 17:50 UTC) > Platform Info: > System: NT (x86_64-w64-mingw32) > CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz > WORD_SIZE: 64 > BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) > LAPACK: libopenblas64_ > LIBM: libopenlibm > LLVM: libLLVM-3.7.1 (ORCJIT, haswell) > > > on a 4770k i7, Windows 10, I get the output > > ; Function Attrs: uwtable > define double @julia_f_66153(double) #0 { > top: > %1 = fmul double %0, 2.000000e+00 > %2 = fadd double %1, 3.000000e+00 > ret double %2 > } > > ; Function Attrs: uwtable > define double @julia_g_66157(double) #0 { > top: > %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, > double 3.000000e+00) > ret double %1 > } > > ; Function Attrs: uwtable > define double @julia_h_66158(double) #0 { > top: > %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double > 3.000000e+00) > ret double %1 > } > .text > Filename: console > pushq %rbp > movq %rsp, %rbp > Source line: 1 > addsd %xmm0, %xmm0 > movabsq $534749456, %rax # imm = 0x1FDFA110 > addsd (%rax), %xmm0 > popq %rbp > retq > nopl (%rax,%rax) > .text > Filename: console > pushq %rbp > movq %rsp, %rbp > Source line: 2 > addsd %xmm0, %xmm0 > movabsq $534749584, %rax # imm = 0x1FDFA190 > addsd (%rax), %xmm0 > popq %rbp > retq > nopl (%rax,%rax) > .text > Filename: console > pushq %rbp > movq %rsp, %rbp > movabsq $534749712, %rax # imm = 0x1FDFA210 > Source line: 3 > movsd dcabs164_(%rax), %xmm1 # xmm1 = mem[0],zero > movabsq $534749720, %rax # imm = 0x1FDFA218 > movsd (%rax), %xmm2 # xmm2 = mem[0],zero > movabsq $fma, %rax > popq %rbp > jmpq *%rax > nop > > This seems to be similar to the first result. > > -- Erik Schnetter <schnet...@gmail.com> http://www.perimeterinstitute.ca/personal/eschnetter/