On Wed, Sep 21, 2016 at 9:33 PM, Yichao Yu <yyc1...@gmail.com> wrote:
> On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter <schnet...@gmail.com> wrote:
>> On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas <rackd...@gmail.com>
>> wrote:
>>>
>>> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg and now
>>> I get results where g and h apply muladd/fma in the native code, but a new
>>> function k which is `@fastmath` inside of f does not apply muladd/fma.
>>>
>>> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f3922c673a25910
>>>
>>> Should I open an issue?
>>
>>
>> In your case, LLVM apparently thinks that `x + x + 3` is faster to calculate
>> than `2x+3`. If you use a less round number than `2` multiplying `x`, you
>> might see a different behaviour.
>
> I've personally never seen llvm create fma from mul and add. We might
> not have the llvm passes enabled if LLVM is capable of doing this at
> all.

Interestingly both clang and gcc keeps the mul and add with `-Ofast
-ffast-math -mavx2` and makes it a fma with `-mavx512f`. This is true
even when the call is in a loop (since switching between sse and avx
is costly) so I'd say either the compiler is right that the fma
instruction gives no speed advantage in this case or it's a llvm/gcc
missing optimization instead of a julia one.

>
>>
>> -erik
>>
>>
>>> Note that this is on v0.6 Windows. On Linux the sysimg isn't rebuilding
>>> for some reason, so I may need to just build from source.
>>>
>>> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik Schnetter
>>> wrote:
>>>>
>>>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <rack...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>   First of all, does LLVM essentially fma or muladd expressions like
>>>>> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one explicitly use
>>>>> `muladd` and `fma` on these types of instructions (is there a macro for
>>>>> making this easier)?
>>>>
>>>>
>>>> Yes, LLVM will use fma machine instructions -- but only if they lead to
>>>> the same round-off error as using separate multiply and add instructions. 
>>>> If
>>>> you do not care about the details of conforming to the IEEE standard, then
>>>> you can use the `@fastmath` macro that enables several optimizations,
>>>> including this one. This is described in the manual
>>>> <http://docs.julialang.org/en/release-0.5/manual/performance-tips/#performance-annotations>.
>>>>
>>>>
>>>>>   Secondly, I am wondering if my setup is no applying these operations
>>>>> correctly. Here's my test code:
>>>>>
>>>>> f(x) = 2.0x + 3.0
>>>>> g(x) = muladd(x,2.0, 3.0)
>>>>> h(x) = fma(x,2.0, 3.0)
>>>>>
>>>>> @code_llvm f(4.0)
>>>>> @code_llvm g(4.0)
>>>>> @code_llvm h(4.0)
>>>>>
>>>>> @code_native f(4.0)
>>>>> @code_native g(4.0)
>>>>> @code_native h(4.0)
>>>>>
>>>>> Computer 1
>>>>>
>>>>> Julia Version 0.5.0-rc4+0
>>>>> Commit 9c76c3e* (2016-09-09 01:43 UTC)
>>>>> Platform Info:
>>>>>   System: Linux (x86_64-redhat-linux)
>>>>>   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>>>>>   WORD_SIZE: 64
>>>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>>>>>   LAPACK: libopenblasp.so.0
>>>>>   LIBM: libopenlibm
>>>>>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>>>>
>>>>
>>>> This looks good, the "broadwell" architecture that LLVM uses should imply
>>>> the respective optimizations. Try with `@fastmath`.
>>>>
>>>> -erik
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> (the COPR nightly on CentOS7) with
>>>>>
>>>>> [crackauc@crackauc2 ~]$ lscpu
>>>>> Architecture:          x86_64
>>>>> CPU op-mode(s):        32-bit, 64-bit
>>>>> Byte Order:            Little Endian
>>>>> CPU(s):                16
>>>>> On-line CPU(s) list:   0-15
>>>>> Thread(s) per core:    1
>>>>> Core(s) per socket:    8
>>>>> Socket(s):             2
>>>>> NUMA node(s):          2
>>>>> Vendor ID:             GenuineIntel
>>>>> CPU family:            6
>>>>> Model:                 79
>>>>> Model name:            Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>>>>> Stepping:              1
>>>>> CPU MHz:               1200.000
>>>>> BogoMIPS:              6392.58
>>>>> Virtualization:        VT-x
>>>>> L1d cache:             32K
>>>>> L1i cache:             32K
>>>>> L2 cache:              256K
>>>>> L3 cache:              25600K
>>>>> NUMA node0 CPU(s):     0-7
>>>>> NUMA node1 CPU(s):     8-15
>>>>>
>>>>>
>>>>>
>>>>> I get the output
>>>>>
>>>>> define double @julia_f_72025(double) #0 {
>>>>> top:
>>>>>   %1 = fmul double %0, 2.000000e+00
>>>>>   %2 = fadd double %1, 3.000000e+00
>>>>>   ret double %2
>>>>> }
>>>>>
>>>>> define double @julia_g_72027(double) #0 {
>>>>> top:
>>>>>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00,
>>>>> double 3.000000e+00)
>>>>>   ret double %1
>>>>> }
>>>>>
>>>>> define double @julia_h_72029(double) #0 {
>>>>> top:
>>>>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double
>>>>> 3.000000e+00)
>>>>>   ret double %1
>>>>> }
>>>>> .text
>>>>> Filename: fmatest.jl
>>>>> pushq %rbp
>>>>> movq %rsp, %rbp
>>>>> Source line: 1
>>>>> addsd %xmm0, %xmm0
>>>>> movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
>>>>> addsd (%rax), %xmm0
>>>>> popq %rbp
>>>>> retq
>>>>> nopl (%rax,%rax)
>>>>> .text
>>>>> Filename: fmatest.jl
>>>>> pushq %rbp
>>>>> movq %rsp, %rbp
>>>>> Source line: 2
>>>>> addsd %xmm0, %xmm0
>>>>> movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
>>>>> addsd (%rax), %xmm0
>>>>> popq %rbp
>>>>> retq
>>>>> nopl (%rax,%rax)
>>>>> .text
>>>>> Filename: fmatest.jl
>>>>> pushq %rbp
>>>>> movq %rsp, %rbp
>>>>> movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
>>>>> Source line: 3
>>>>> movsd (%rax), %xmm1           # xmm1 = mem[0],zero
>>>>> movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
>>>>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>>>>> movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
>>>>> popq %rbp
>>>>> jmpq *%rax
>>>>> nopl (%rax)
>>>>>
>>>>> It looks like explicit muladd or not ends up at the same native code,
>>>>> but is that native code actually doing an fma? The fma native is 
>>>>> different,
>>>>> but from a discussion on the Gitter it seems that might be a software FMA?
>>>>> This computer is setup with the BIOS setting as LAPACK optimized or
>>>>> something like that, so is that messing with something?
>>>>>
>>>>> Computer 2
>>>>>
>>>>> Julia Version 0.6.0-dev.557
>>>>> Commit c7a4897 (2016-09-08 17:50 UTC)
>>>>> Platform Info:
>>>>>   System: NT (x86_64-w64-mingw32)
>>>>>   CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
>>>>>   WORD_SIZE: 64
>>>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
>>>>>   LAPACK: libopenblas64_
>>>>>   LIBM: libopenlibm
>>>>>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
>>>>>
>>>>>
>>>>> on a 4770k i7, Windows 10, I get the output
>>>>>
>>>>> ; Function Attrs: uwtable
>>>>> define double @julia_f_66153(double) #0 {
>>>>> top:
>>>>>   %1 = fmul double %0, 2.000000e+00
>>>>>   %2 = fadd double %1, 3.000000e+00
>>>>>   ret double %2
>>>>> }
>>>>>
>>>>> ; Function Attrs: uwtable
>>>>> define double @julia_g_66157(double) #0 {
>>>>> top:
>>>>>   %1 = call double @llvm.fmuladd.f64(double %0, double 2.000000e+00,
>>>>> double 3.000000e+00)
>>>>>   ret double %1
>>>>> }
>>>>>
>>>>> ; Function Attrs: uwtable
>>>>> define double @julia_h_66158(double) #0 {
>>>>> top:
>>>>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00, double
>>>>> 3.000000e+00)
>>>>>   ret double %1
>>>>> }
>>>>> .text
>>>>> Filename: console
>>>>> pushq %rbp
>>>>> movq %rsp, %rbp
>>>>> Source line: 1
>>>>> addsd %xmm0, %xmm0
>>>>> movabsq $534749456, %rax        # imm = 0x1FDFA110
>>>>> addsd (%rax), %xmm0
>>>>> popq %rbp
>>>>> retq
>>>>> nopl (%rax,%rax)
>>>>> .text
>>>>> Filename: console
>>>>> pushq %rbp
>>>>> movq %rsp, %rbp
>>>>> Source line: 2
>>>>> addsd %xmm0, %xmm0
>>>>> movabsq $534749584, %rax        # imm = 0x1FDFA190
>>>>> addsd (%rax), %xmm0
>>>>> popq %rbp
>>>>> retq
>>>>> nopl (%rax,%rax)
>>>>> .text
>>>>> Filename: console
>>>>> pushq %rbp
>>>>> movq %rsp, %rbp
>>>>> movabsq $534749712, %rax        # imm = 0x1FDFA210
>>>>> Source line: 3
>>>>> movsd dcabs164_(%rax), %xmm1  # xmm1 = mem[0],zero
>>>>> movabsq $534749720, %rax        # imm = 0x1FDFA218
>>>>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>>>>> movabsq $fma, %rax
>>>>> popq %rbp
>>>>> jmpq *%rax
>>>>> nop
>>>>>
>>>>> This seems to be similar to the first result.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Erik Schnetter <schn...@gmail.com>
>>>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>>
>>
>>
>> --
>> Erik Schnetter <schnet...@gmail.com>
>> http://www.perimeterinstitute.ca/personal/eschnetter/

Reply via email to