On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <rackd...@gmail.com>
wrote:

> Hi,
>   First of all, does LLVM essentially fma or muladd expressions like
> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use
> `muladd` and `fma` on these types of instructions (is there a macro for
> making this easier)?
>

Yes, LLVM will use fma machine instructions -- but only if they lead to the
same round-off error as using separate multiply and add instructions. If
you do not care about the details of conforming to the IEEE standard, then
you can use the `@fastmath` macro that enables several optimizations,
including this one. This is described in the manual <
http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations
>.


  Secondly, I am wondering if my setup is no applying these operations
> correctly. Here's my test code:
>
> f(x) = 2.0x + 3.0
> g(x) = muladd(x,2.0, 3.0)
> h(x) = fma(x,2.0, 3.0)
>
> @code_llvm f(4.0)
> @code_llvm g(4.0)
> @code_llvm h(4.0)
>
> @code_native f(4.0)
> @code_native g(4.0)
> @code_native h(4.0)
>
> *Computer 1*
>
> Julia Version 0.5.0-rc4+0
> Commit 9c76c3e* (2016-09-09 01:43 UTC)
> Platform Info:
>   System: Linux (x86_64-redhat-linux)
>   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>   WORD_SIZE: 64
>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>   LAPACK: libopenblasp.so.0
>   LIBM: libopenlibm
>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>

This looks good, the "broadwell" architecture that LLVM uses should imply
the respective optimizations. Try with `@fastmath`.

-erik





> (the COPR nightly on CentOS7) with
>
> [crackauc@crackauc2 ~]$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                16
> On-line CPU(s) list:   0-15
> Thread(s) per core:    1
> Core(s) per socket:    8
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 79
> Model name:            Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
> Stepping:              1
> CPU MHz:               1200.000
> BogoMIPS:              6392.58
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              25600K
> NUMA node0 CPU(s):     0-7
> NUMA node1 CPU(s):     8-15
>
>
>
> I get the output
>
> define double @julia_f_72025(double) #0 {
> top:
>   %1 = fmul double %0, 2.000000e+00
>   %2 = fadd double %1, 3.000000e+00
>   ret double %2
> }
>
> define double @julia_g_72027(double) #0 {
> top:
>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00,
> double 3.000000e+00)
>   ret double %1
> }
>
> define double @julia_h_72029(double) #0 {
> top:
>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double
> 3.000000e+00)
>   ret double %1
> }
> .text
> Filename: fmatest.jl
> pushq %rbp
> movq %rsp, %rbp
> Source line: 1
> addsd %xmm0, %xmm0
> movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
> addsd (%rax), %xmm0
> popq %rbp
> retq
> nopl (%rax,%rax)
> .text
> Filename: fmatest.jl
> pushq %rbp
> movq %rsp, %rbp
> Source line: 2
> addsd %xmm0, %xmm0
> movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
> addsd (%rax), %xmm0
> popq %rbp
> retq
> nopl (%rax,%rax)
> .text
> Filename: fmatest.jl
> pushq %rbp
> movq %rsp, %rbp
> movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
> Source line: 3
> movsd (%rax), %xmm1           # xmm1 = mem[0],zero
> movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
> movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
> popq %rbp
> jmpq *%rax
> nopl (%rax)
>
> It looks like explicit muladd or not ends up at the same native code, but
> is that native code actually doing an fma? The fma native is different, but
> from a discussion on the Gitter it seems that might be a software FMA? This
> computer is setup with the BIOS setting as LAPACK optimized or something
> like that, so is that messing with something?
>
> *Computer 2*
>
> Julia Version 0.6.0-dev.557
> Commit c7a4897 (2016-09-08 17:50 UTC)
> Platform Info:
>   System: NT (x86_64-w64-mingw32)
>   CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
>   WORD_SIZE: 64
>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>   LAPACK: libopenblas64_
>   LIBM: libopenlibm
>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
>
>
> on a 4770k i7, Windows 10, I get the output
>
> ; Function Attrs: uwtable
> define double @julia_f_66153(double) #0 {
> top:
>   %1 = fmul double %0, 2.000000e+00
>   %2 = fadd double %1, 3.000000e+00
>   ret double %2
> }
>
> ; Function Attrs: uwtable
> define double @julia_g_66157(double) #0 {
> top:
>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00,
> double 3.000000e+00)
>   ret double %1
> }
>
> ; Function Attrs: uwtable
> define double @julia_h_66158(double) #0 {
> top:
>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double
> 3.000000e+00)
>   ret double %1
> }
> .text
> Filename: console
> pushq %rbp
> movq %rsp, %rbp
> Source line: 1
> addsd %xmm0, %xmm0
> movabsq $534749456, %rax        # imm = 0x1FDFA110
> addsd (%rax), %xmm0
> popq %rbp
> retq
> nopl (%rax,%rax)
> .text
> Filename: console
> pushq %rbp
> movq %rsp, %rbp
> Source line: 2
> addsd %xmm0, %xmm0
> movabsq $534749584, %rax        # imm = 0x1FDFA190
> addsd (%rax), %xmm0
> popq %rbp
> retq
> nopl (%rax,%rax)
> .text
> Filename: console
> pushq %rbp
> movq %rsp, %rbp
> movabsq $534749712, %rax        # imm = 0x1FDFA210
> Source line: 3
> movsd dcabs164_(%rax), %xmm1  # xmm1 = mem[0],zero
> movabsq $534749720, %rax        # imm = 0x1FDFA218
> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
> movabsq $fma, %rax
> popq %rbp
> jmpq *%rax
> nop
>
> This seems to be similar to the first result.
>
>


-- 
Erik Schnetter <schnet...@gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/

Reply via email to