Still no FMA? julia> k(x) = @fastmath 2.4x + 3.0 WARNING: Method definition k(Any) in module Main at REPL[14]:1 overwritten at REPL[23]:1. k (generic function with 1 method)
julia> @code_llvm k(4.0) ; Function Attrs: uwtable define double @julia_k_66737(double) #0 { top: %1 = fmul fast double %0, 2.400000e+00 %2 = fadd fast double %1, 3.000000e+00 ret double %2 } julia> @code_native k(4.0) .text Filename: REPL[23] pushq %rbp movq %rsp, %rbp movabsq $568231032, %rax # imm = 0x21DE8478 Source line: 1 vmulsd (%rax), %xmm0, %xmm0 movabsq $568231040, %rax # imm = 0x21DE8480 vaddsd (%rax), %xmm0, %xmm0 popq %rbp retq nopw %cs:(%rax,%rax) On Wednesday, September 21, 2016 at 6:29:44 PM UTC-7, Erik Schnetter wrote: > > On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas <rack...@gmail.com > <javascript:>> wrote: > >> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now >> I get results where g and h apply muladd/fma in the native code, but a new >> function k which is `@fastmath` inside of f does not apply muladd/fma. >> >> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910 >> >> Should I open an issue? >> > > In your case, LLVM apparently thinks that `x + x + 3` is faster to > calculate than `2x+3`. If you use a less round number than `2` multiplying > `x`, you might see a different behaviour. > > -erik > > > Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding >> for some reason, so I may need to just build from source. >> >> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter >> wrote: >>> >>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <rack...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> First of all, does LLVM essentially fma or muladd expressions like >>>> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use >>>> `muladd` and `fma` on these types of instructions (is there a macro for >>>> making this easier)? >>>> >>> >>> Yes, LLVM will use fma machine instructions -- but only if they lead to >>> the same round-off error as using separate multiply and add instructions. >>> If you do not care about the details of conforming to the IEEE standard, >>> then you can use the `@fastmath` macro that enables several optimizations, >>> including this one. This is described in the manual < >>> http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations >>> >. >>> >>> >>> Secondly, I am wondering if my setup is no applying these operations >>>> correctly. Here's my test code: >>>> >>>> f(x) = 2.0x + 3.0 >>>> g(x) = muladd(x,2.0, 3.0) >>>> h(x) = fma(x,2.0, 3.0) >>>> >>>> @code_llvm f(4.0) >>>> @code_llvm g(4.0) >>>> @code_llvm h(4.0) >>>> >>>> @code_native f(4.0) >>>> @code_native g(4.0) >>>> @code_native h(4.0) >>>> >>>> *Computer 1* >>>> >>>> Julia Version 0.5.0-rc4+0 >>>> Commit 9c76c3e* (2016-09-09 01:43 UTC) >>>> Platform Info: >>>> System: Linux (x86_64-redhat-linux) >>>> CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz >>>> WORD_SIZE: 64 >>>> BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) >>>> LAPACK: libopenblasp.so.0 >>>> LIBM: libopenlibm >>>> LLVM: libLLVM-3.7.1 (ORCJIT, broadwell) >>>> >>> >>> This looks good, the "broadwell" architecture that LLVM uses should >>> imply the respective optimizations. Try with `@fastmath`. >>> >>> -erik >>> >>> >>> >>> >>> >>>> (the COPR nightly on CentOS7) with >>>> >>>> [crackauc@crackauc2 ~]$ lscpu >>>> Architecture: x86_64 >>>> CPU op-mode(s): 32-bit, 64-bit >>>> Byte Order: Little Endian >>>> CPU(s): 16 >>>> On-line CPU(s) list: 0-15 >>>> Thread(s) per core: 1 >>>> Core(s) per socket: 8 >>>> Socket(s): 2 >>>> NUMA node(s): 2 >>>> Vendor ID: GenuineIntel >>>> CPU family: 6 >>>> Model: 79 >>>> Model name: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz >>>> Stepping: 1 >>>> CPU MHz: 1200.000 >>>> BogoMIPS: 6392.58 >>>> Virtualization: VT-x >>>> L1d cache: 32K >>>> L1i cache: 32K >>>> L2 cache: 256K >>>> L3 cache: 25600K >>>> NUMA node0 CPU(s): 0-7 >>>> NUMA node1 CPU(s): 8-15 >>>> >>>> >>>> >>>> I get the output >>>> >>>> define double @julia_f_72025(double) #0 { >>>> top: >>>> %1 = fmul double %0, 2.000000e+00 >>>> %2 = fadd double %1, 3.000000e+00 >>>> ret double %2 >>>> } >>>> >>>> define double @julia_g_72027(double) #0 { >>>> top: >>>> %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, >>>> double 3.000000e+00) >>>> ret double %1 >>>> } >>>> >>>> define double @julia_h_72029(double) #0 { >>>> top: >>>> %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double >>>> 3.000000e+00) >>>> ret double %1 >>>> } >>>> .text >>>> Filename: fmatest.jl >>>> pushq %rbp >>>> movq %rsp, %rbp >>>> Source line: 1 >>>> addsd %xmm0, %xmm0 >>>> movabsq $139916162906520, %rax # imm = 0x7F40C5303998 >>>> addsd (%rax), %xmm0 >>>> popq %rbp >>>> retq >>>> nopl (%rax,%rax) >>>> .text >>>> Filename: fmatest.jl >>>> pushq %rbp >>>> movq %rsp, %rbp >>>> Source line: 2 >>>> addsd %xmm0, %xmm0 >>>> movabsq $139916162906648, %rax # imm = 0x7F40C5303A18 >>>> addsd (%rax), %xmm0 >>>> popq %rbp >>>> retq >>>> nopl (%rax,%rax) >>>> .text >>>> Filename: fmatest.jl >>>> pushq %rbp >>>> movq %rsp, %rbp >>>> movabsq $139916162906776, %rax # imm = 0x7F40C5303A98 >>>> Source line: 3 >>>> movsd (%rax), %xmm1 # xmm1 = mem[0],zero >>>> movabsq $139916162906784, %rax # imm = 0x7F40C5303AA0 >>>> movsd (%rax), %xmm2 # xmm2 = mem[0],zero >>>> movabsq $139925776008800, %rax # imm = 0x7F43022C8660 >>>> popq %rbp >>>> jmpq *%rax >>>> nopl (%rax) >>>> >>>> It looks like explicit muladd or not ends up at the same native code, >>>> but is that native code actually doing an fma? The fma native is >>>> different, >>>> but from a discussion on the Gitter it seems that might be a software FMA? >>>> This computer is setup with the BIOS setting as LAPACK optimized or >>>> something like that, so is that messing with something? >>>> >>>> *Computer 2* >>>> >>>> Julia Version 0.6.0-dev.557 >>>> Commit c7a4897 (2016-09-08 17:50 UTC) >>>> Platform Info: >>>> System: NT (x86_64-w64-mingw32) >>>> CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz >>>> WORD_SIZE: 64 >>>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) >>>> LAPACK: libopenblas64_ >>>> LIBM: libopenlibm >>>> LLVM: libLLVM-3.7.1 (ORCJIT, haswell) >>>> >>>> >>>> on a 4770k i7, Windows 10, I get the output >>>> >>>> ; Function Attrs: uwtable >>>> define double @julia_f_66153(double) #0 { >>>> top: >>>> %1 = fmul double %0, 2.000000e+00 >>>> %2 = fadd double %1, 3.000000e+00 >>>> ret double %2 >>>> } >>>> >>>> ; Function Attrs: uwtable >>>> define double @julia_g_66157(double) #0 { >>>> top: >>>> %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, >>>> double 3.000000e+00) >>>> ret double %1 >>>> } >>>> >>>> ; Function Attrs: uwtable >>>> define double @julia_h_66158(double) #0 { >>>> top: >>>> %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double >>>> 3.000000e+00) >>>> ret double %1 >>>> } >>>> .text >>>> Filename: console >>>> pushq %rbp >>>> movq %rsp, %rbp >>>> Source line: 1 >>>> addsd %xmm0, %xmm0 >>>> movabsq $534749456, %rax # imm = 0x1FDFA110 >>>> addsd (%rax), %xmm0 >>>> popq %rbp >>>> retq >>>> nopl (%rax,%rax) >>>> .text >>>> Filename: console >>>> pushq %rbp >>>> movq %rsp, %rbp >>>> Source line: 2 >>>> addsd %xmm0, %xmm0 >>>> movabsq $534749584, %rax # imm = 0x1FDFA190 >>>> addsd (%rax), %xmm0 >>>> popq %rbp >>>> retq >>>> nopl (%rax,%rax) >>>> .text >>>> Filename: console >>>> pushq %rbp >>>> movq %rsp, %rbp >>>> movabsq $534749712, %rax # imm = 0x1FDFA210 >>>> Source line: 3 >>>> movsd dcabs164_(%rax), %xmm1 # xmm1 = mem[0],zero >>>> movabsq $534749720, %rax # imm = 0x1FDFA218 >>>> movsd (%rax), %xmm2 # xmm2 = mem[0],zero >>>> movabsq $fma, %rax >>>> popq %rbp >>>> jmpq *%rax >>>> nop >>>> >>>> This seems to be similar to the first result. >>>> >>>> >>> >>> >>> -- >>> Erik Schnetter <schn...@gmail.com> >>> http://www.perimeterinstitute.ca/personal/eschnetter/ >>> >> > > > -- > Erik Schnetter <schn...@gmail.com <javascript:>> > http://www.perimeterinstitute.ca/personal/eschnetter/ >