Still no FMA?

julia> k(x) = @fastmath 2.4x + 3.0
WARNING: Method definition k(Any) in module Main at REPL[14]:1 overwritten 
at REPL[23]:1.
k (generic function with 1 method)

julia> @code_llvm k(4.0)

; Function Attrs: uwtable
define double @julia_k_66737(double) #0 {
top:
  %1 = fmul fast double %0, 2.400000e+00
  %2 = fadd fast double %1, 3.000000e+00
  ret double %2
}

julia> @code_native k(4.0)
        .text
Filename: REPL[23]
        pushq   %rbp
        movq    %rsp, %rbp
        movabsq $568231032, %rax        # imm = 0x21DE8478
Source line: 1
        vmulsd  (%rax), %xmm0, %xmm0
        movabsq $568231040, %rax        # imm = 0x21DE8480
        vaddsd  (%rax), %xmm0, %xmm0
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)



On Wednesday, September 21, 2016 at 6:29:44 PM UTC-7, Erik Schnetter wrote:
>
> On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas <rack...@gmail.com 
> <javascript:>> wrote:
>
>> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now 
>> I get results where g and h apply muladd/fma in the native code, but a new 
>> function k which is `@fastmath` inside of f does not apply muladd/fma.
>>
>> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>>
>> Should I open an issue?
>>
>
> In your case, LLVM apparently thinks that `x + x + 3` is faster to 
> calculate than `2x+3`. If you use a less round number than `2` multiplying 
> `x`, you might see a different behaviour.
>
> -erik
>
>
> Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding 
>> for some reason, so I may need to just build from source.
>>
>> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter 
>> wrote:
>>>
>>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <rack...@gmail.com> 
>>> wrote:
>>>
>>>> Hi,
>>>>   First of all, does LLVM essentially fma or muladd expressions like 
>>>> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use 
>>>> `muladd` and `fma` on these types of instructions (is there a macro for 
>>>> making this easier)?
>>>>
>>>
>>> Yes, LLVM will use fma machine instructions -- but only if they lead to 
>>> the same round-off error as using separate multiply and add instructions. 
>>> If you do not care about the details of conforming to the IEEE standard, 
>>> then you can use the `@fastmath` macro that enables several optimizations, 
>>> including this one. This is described in the manual <
>>> http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations
>>> >.
>>>
>>>
>>>   Secondly, I am wondering if my setup is no applying these operations 
>>>> correctly. Here's my test code:
>>>>
>>>> f(x) = 2.0x + 3.0
>>>> g(x) = muladd(x,2.0, 3.0)
>>>> h(x) = fma(x,2.0, 3.0)
>>>>
>>>> @code_llvm f(4.0)
>>>> @code_llvm g(4.0)
>>>> @code_llvm h(4.0)
>>>>
>>>> @code_native f(4.0)
>>>> @code_native g(4.0)
>>>> @code_native h(4.0)
>>>>
>>>> *Computer 1*
>>>>
>>>> Julia Version 0.5.0-rc4+0
>>>> Commit 9c76c3e* (2016-09-09 01:43 UTC)
>>>> Platform Info:
>>>>   System: Linux (x86_64-redhat-linux)
>>>>   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>>>>   WORD_SIZE: 64
>>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>>>>   LAPACK: libopenblasp.so.0
>>>>   LIBM: libopenlibm
>>>>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>>>>
>>>
>>> This looks good, the "broadwell" architecture that LLVM uses should 
>>> imply the respective optimizations. Try with `@fastmath`.
>>>
>>> -erik
>>>
>>>
>>>
>>>  
>>>
>>>> (the COPR nightly on CentOS7) with 
>>>>
>>>> [crackauc@crackauc2 ~]$ lscpu
>>>> Architecture:          x86_64
>>>> CPU op-mode(s):        32-bit, 64-bit
>>>> Byte Order:            Little Endian
>>>> CPU(s):                16
>>>> On-line CPU(s) list:   0-15
>>>> Thread(s) per core:    1
>>>> Core(s) per socket:    8
>>>> Socket(s):             2
>>>> NUMA node(s):          2
>>>> Vendor ID:             GenuineIntel
>>>> CPU family:            6
>>>> Model:                 79
>>>> Model name:            Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>>>> Stepping:              1
>>>> CPU MHz:               1200.000
>>>> BogoMIPS:              6392.58
>>>> Virtualization:        VT-x
>>>> L1d cache:             32K
>>>> L1i cache:             32K
>>>> L2 cache:              256K
>>>> L3 cache:              25600K
>>>> NUMA node0 CPU(s):     0-7
>>>> NUMA node1 CPU(s):     8-15
>>>>
>>>>
>>>>
>>>> I get the output
>>>>
>>>> define double @julia_f_72025(double) #0 {
>>>> top:
>>>>   %1 = fmul double %0, 2.000000e+00
>>>>   %2 = fadd double %1, 3.000000e+00
>>>>   ret double %2
>>>> }
>>>>
>>>> define double @julia_g_72027(double) #0 {
>>>> top:
>>>>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, 
>>>> double 3.000000e+00)
>>>>   ret double %1
>>>> }
>>>>
>>>> define double @julia_h_72029(double) #0 {
>>>> top:
>>>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double 
>>>> 3.000000e+00)
>>>>   ret double %1
>>>> }
>>>> .text
>>>> Filename: fmatest.jl
>>>> pushq %rbp
>>>> movq %rsp, %rbp
>>>> Source line: 1
>>>> addsd %xmm0, %xmm0
>>>> movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
>>>> addsd (%rax), %xmm0
>>>> popq %rbp
>>>> retq
>>>> nopl (%rax,%rax)
>>>> .text
>>>> Filename: fmatest.jl
>>>> pushq %rbp
>>>> movq %rsp, %rbp
>>>> Source line: 2
>>>> addsd %xmm0, %xmm0
>>>> movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
>>>> addsd (%rax), %xmm0
>>>> popq %rbp
>>>> retq
>>>> nopl (%rax,%rax)
>>>> .text
>>>> Filename: fmatest.jl
>>>> pushq %rbp
>>>> movq %rsp, %rbp
>>>> movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
>>>> Source line: 3
>>>> movsd (%rax), %xmm1           # xmm1 = mem[0],zero
>>>> movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
>>>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>>>> movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
>>>> popq %rbp
>>>> jmpq *%rax
>>>> nopl (%rax)
>>>>
>>>> It looks like explicit muladd or not ends up at the same native code, 
>>>> but is that native code actually doing an fma? The fma native is 
>>>> different, 
>>>> but from a discussion on the Gitter it seems that might be a software FMA? 
>>>> This computer is setup with the BIOS setting as LAPACK optimized or 
>>>> something like that, so is that messing with something?
>>>>
>>>> *Computer 2*
>>>>
>>>> Julia Version 0.6.0-dev.557
>>>> Commit c7a4897 (2016-09-08 17:50 UTC)
>>>> Platform Info:
>>>>   System: NT (x86_64-w64-mingw32)
>>>>   CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
>>>>   WORD_SIZE: 64
>>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>>>>   LAPACK: libopenblas64_
>>>>   LIBM: libopenlibm
>>>>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
>>>>
>>>>
>>>> on a 4770k i7, Windows 10, I get the output
>>>>
>>>> ; Function Attrs: uwtable
>>>> define double @julia_f_66153(double) #0 {
>>>> top:
>>>>   %1 = fmul double %0, 2.000000e+00
>>>>   %2 = fadd double %1, 3.000000e+00
>>>>   ret double %2
>>>> }
>>>>
>>>> ; Function Attrs: uwtable
>>>> define double @julia_g_66157(double) #0 {
>>>> top:
>>>>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00, 
>>>> double 3.000000e+00)
>>>>   ret double %1
>>>> }
>>>>
>>>> ; Function Attrs: uwtable
>>>> define double @julia_h_66158(double) #0 {
>>>> top:
>>>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double 
>>>> 3.000000e+00)
>>>>   ret double %1
>>>> }
>>>> .text
>>>> Filename: console
>>>> pushq %rbp
>>>> movq %rsp, %rbp
>>>> Source line: 1
>>>> addsd %xmm0, %xmm0
>>>> movabsq $534749456, %rax        # imm = 0x1FDFA110
>>>> addsd (%rax), %xmm0
>>>> popq %rbp
>>>> retq
>>>> nopl (%rax,%rax)
>>>> .text
>>>> Filename: console
>>>> pushq %rbp
>>>> movq %rsp, %rbp
>>>> Source line: 2
>>>> addsd %xmm0, %xmm0
>>>> movabsq $534749584, %rax        # imm = 0x1FDFA190
>>>> addsd (%rax), %xmm0
>>>> popq %rbp
>>>> retq
>>>> nopl (%rax,%rax)
>>>> .text
>>>> Filename: console
>>>> pushq %rbp
>>>> movq %rsp, %rbp
>>>> movabsq $534749712, %rax        # imm = 0x1FDFA210
>>>> Source line: 3
>>>> movsd dcabs164_(%rax), %xmm1  # xmm1 = mem[0],zero
>>>> movabsq $534749720, %rax        # imm = 0x1FDFA218
>>>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>>>> movabsq $fma, %rax
>>>> popq %rbp
>>>> jmpq *%rax
>>>> nop
>>>>
>>>> This seems to be similar to the first result.
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Erik Schnetter <schn...@gmail.com> 
>>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>>
>>
>
>
> -- 
> Erik Schnetter <schn...@gmail.com <javascript:>> 
> http://www.perimeterinstitute.ca/personal/eschnetter/
>

Reply via email to