Re: [julia-users] Is FMA/Muladd Working Here?

Erik Schnetter Fri, 23 Sep 2016 12:12:29 -0700

It should. Yes, please open an issue.

-erik


On Thu, Sep 22, 2016 at 7:46 PM, Chris Rackauckas <rackd...@gmail.com>
wrote:

> So, in the end, is `@fastmath` supposed to be adding FMA? Should I open an
> issue?
>
> On Wednesday, September 21, 2016 at 7:11:14 PM UTC-7, Yichao Yu wrote:
>>
>> On Wed, Sep 21, 2016 at 9:49 PM, Erik Schnetter <schn...@gmail.com>
>> wrote:
>> > I confirm that I can't get Julia to synthesize a `vfmadd` instruction
>> > either... Sorry for sending you on a wild goose chase.
>>
>> -march=haswell does the trick for C (both clang and gcc)
>> the necessary bit for the machine ir optimization (this is not a llvm
>> ir optimization pass) to do this is llc options -mcpu=haswell and
>> function attribute unsafe-fp-math=true.
>>
>> >
>> > -erik
>> >
>> > On Wed, Sep 21, 2016 at 9:33 PM, Yichao Yu <yyc...@gmail.com> wrote:
>> >>
>> >> On Wed, Sep 21, 2016 at 9:29 PM, Erik Schnetter <schn...@gmail.com>
>> >> wrote:
>> >> > On Wed, Sep 21, 2016 at 9:22 PM, Chris Rackauckas <rack...@gmail.com>
>>
>> >> > wrote:
>> >> >>
>> >> >> I'm not seeing `@fastmath` apply fma/muladd. I rebuilt the sysimg
>> and
>> >> >> now
>> >> >> I get results where g and h apply muladd/fma in the native code,
>> but a
>> >> >> new
>> >> >> function k which is `@fastmath` inside of f does not apply
>> muladd/fma.
>> >> >>
>> >> >>
>> >> >> https://gist.github.com/ChrisRackauckas/b239e33b4b52bcc28f39
>> 22c673a25910
>> >> >>
>> >> >> Should I open an issue?
>> >> >
>> >> >
>> >> > In your case, LLVM apparently thinks that `x + x + 3` is faster to
>> >> > calculate
>> >> > than `2x+3`. If you use a less round number than `2` multiplying
>> `x`,
>> >> > you
>> >> > might see a different behaviour.
>> >>
>> >> I've personally never seen llvm create fma from mul and add. We might
>> >> not have the llvm passes enabled if LLVM is capable of doing this at
>> >> all.
>> >>
>> >> >
>> >> > -erik
>> >> >
>> >> >
>> >> >> Note that this is on v0.6 Windows. On Linux the sysimg isn't
>> rebuilding
>> >> >> for some reason, so I may need to just build from source.
>> >> >>
>> >> >> On Wednesday, September 21, 2016 at 6:22:06 AM UTC-7, Erik
>> Schnetter
>> >> >> wrote:
>> >> >>>
>> >> >>> On Wed, Sep 21, 2016 at 1:56 AM, Chris Rackauckas <
>> rack...@gmail.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Hi,
>> >> >>>>   First of all, does LLVM essentially fma or muladd expressions
>> like
>> >> >>>> `a1*x1 + a2*x2 + a3*x3 + a4*x4`? Or is it required that one
>> >> >>>> explicitly use
>> >> >>>> `muladd` and `fma` on these types of instructions (is there a
>> macro
>> >> >>>> for
>> >> >>>> making this easier)?
>> >> >>>
>> >> >>>
>> >> >>> Yes, LLVM will use fma machine instructions -- but only if they
>> lead
>> >> >>> to
>> >> >>> the same round-off error as using separate multiply and add
>> >> >>> instructions. If
>> >> >>> you do not care about the details of conforming to the IEEE
>> standard,
>> >> >>> then
>> >> >>> you can use the `@fastmath` macro that enables several
>> optimizations,
>> >> >>> including this one. This is described in the manual
>> >> >>>
>> >> >>> <http://docs.julialang.org/en/release-0.5/manual/performance
>> -tips/#performance-annotations>.
>> >> >>>
>> >> >>>
>> >> >>>>   Secondly, I am wondering if my setup is no applying these
>> >> >>>> operations
>> >> >>>> correctly. Here's my test code:
>> >> >>>>
>> >> >>>> f(x) = 2.0x + 3.0
>> >> >>>> g(x) = muladd(x,2.0, 3.0)
>> >> >>>> h(x) = fma(x,2.0, 3.0)
>> >> >>>>
>> >> >>>> @code_llvm f(4.0)
>> >> >>>> @code_llvm g(4.0)
>> >> >>>> @code_llvm h(4.0)
>> >> >>>>
>> >> >>>> @code_native f(4.0)
>> >> >>>> @code_native g(4.0)
>> >> >>>> @code_native h(4.0)
>> >> >>>>
>> >> >>>> Computer 1
>> >> >>>>
>> >> >>>> Julia Version 0.5.0-rc4+0
>> >> >>>> Commit 9c76c3e* (2016-09-09 01:43 UTC)
>> >> >>>> Platform Info:
>> >> >>>>   System: Linux (x86_64-redhat-linux)
>> >> >>>>   CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>> >> >>>>   WORD_SIZE: 64
>> >> >>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>> >> >>>>   LAPACK: libopenblasp.so.0
>> >> >>>>   LIBM: libopenlibm
>> >> >>>>   LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
>> >> >>>
>> >> >>>
>> >> >>> This looks good, the "broadwell" architecture that LLVM uses
>> should
>> >> >>> imply
>> >> >>> the respective optimizations. Try with `@fastmath`.
>> >> >>>
>> >> >>> -erik
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>>
>> >> >>>> (the COPR nightly on CentOS7) with
>> >> >>>>
>> >> >>>> [crackauc@crackauc2 ~]$ lscpu
>> >> >>>> Architecture:          x86_64
>> >> >>>> CPU op-mode(s):        32-bit, 64-bit
>> >> >>>> Byte Order:            Little Endian
>> >> >>>> CPU(s):                16
>> >> >>>> On-line CPU(s) list:   0-15
>> >> >>>> Thread(s) per core:    1
>> >> >>>> Core(s) per socket:    8
>> >> >>>> Socket(s):             2
>> >> >>>> NUMA node(s):          2
>> >> >>>> Vendor ID:             GenuineIntel
>> >> >>>> CPU family:            6
>> >> >>>> Model:                 79
>> >> >>>> Model name:            Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
>> >> >>>> Stepping:              1
>> >> >>>> CPU MHz:               1200.000
>> >> >>>> BogoMIPS:              6392.58
>> >> >>>> Virtualization:        VT-x
>> >> >>>> L1d cache:             32K
>> >> >>>> L1i cache:             32K
>> >> >>>> L2 cache:              256K
>> >> >>>> L3 cache:              25600K
>> >> >>>> NUMA node0 CPU(s):     0-7
>> >> >>>> NUMA node1 CPU(s):     8-15
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> I get the output
>> >> >>>>
>> >> >>>> define double @julia_f_72025(double) #0 {
>> >> >>>> top:
>> >> >>>>   %1 = fmul double %0, 2.000000e+00
>> >> >>>>   %2 = fadd double %1, 3.000000e+00
>> >> >>>>   ret double %2
>> >> >>>> }
>> >> >>>>
>> >> >>>> define double @julia_g_72027(double) #0 {
>> >> >>>> top:
>> >> >>>>   %1 = call double @llvm.fmuladd.f64(double %0, double
>> 2.000000e+00,
>> >> >>>> double 3.000000e+00)
>> >> >>>>   ret double %1
>> >> >>>> }
>> >> >>>>
>> >> >>>> define double @julia_h_72029(double) #0 {
>> >> >>>> top:
>> >> >>>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00,
>> >> >>>> double
>> >> >>>> 3.000000e+00)
>> >> >>>>   ret double %1
>> >> >>>> }
>> >> >>>> .text
>> >> >>>> Filename: fmatest.jl
>> >> >>>> pushq %rbp
>> >> >>>> movq %rsp, %rbp
>> >> >>>> Source line: 1
>> >> >>>> addsd %xmm0, %xmm0
>> >> >>>> movabsq $139916162906520, %rax  # imm = 0x7F40C5303998
>> >> >>>> addsd (%rax), %xmm0
>> >> >>>> popq %rbp
>> >> >>>> retq
>> >> >>>> nopl (%rax,%rax)
>> >> >>>> .text
>> >> >>>> Filename: fmatest.jl
>> >> >>>> pushq %rbp
>> >> >>>> movq %rsp, %rbp
>> >> >>>> Source line: 2
>> >> >>>> addsd %xmm0, %xmm0
>> >> >>>> movabsq $139916162906648, %rax  # imm = 0x7F40C5303A18
>> >> >>>> addsd (%rax), %xmm0
>> >> >>>> popq %rbp
>> >> >>>> retq
>> >> >>>> nopl (%rax,%rax)
>> >> >>>> .text
>> >> >>>> Filename: fmatest.jl
>> >> >>>> pushq %rbp
>> >> >>>> movq %rsp, %rbp
>> >> >>>> movabsq $139916162906776, %rax  # imm = 0x7F40C5303A98
>> >> >>>> Source line: 3
>> >> >>>> movsd (%rax), %xmm1           # xmm1 = mem[0],zero
>> >> >>>> movabsq $139916162906784, %rax  # imm = 0x7F40C5303AA0
>> >> >>>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>> >> >>>> movabsq $139925776008800, %rax  # imm = 0x7F43022C8660
>> >> >>>> popq %rbp
>> >> >>>> jmpq *%rax
>> >> >>>> nopl (%rax)
>> >> >>>>
>> >> >>>> It looks like explicit muladd or not ends up at the same native
>> code,
>> >> >>>> but is that native code actually doing an fma? The fma native is
>> >> >>>> different,
>> >> >>>> but from a discussion on the Gitter it seems that might be a
>> software
>> >> >>>> FMA?
>> >> >>>> This computer is setup with the BIOS setting as LAPACK optimized
>> or
>> >> >>>> something like that, so is that messing with something?
>> >> >>>>
>> >> >>>> Computer 2
>> >> >>>>
>> >> >>>> Julia Version 0.6.0-dev.557
>> >> >>>> Commit c7a4897 (2016-09-08 17:50 UTC)
>> >> >>>> Platform Info:
>> >> >>>>   System: NT (x86_64-w64-mingw32)
>> >> >>>>   CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
>> >> >>>>   WORD_SIZE: 64
>> >> >>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY
>> Haswell)
>> >> >>>>   LAPACK: libopenblas64_
>> >> >>>>   LIBM: libopenlibm
>> >> >>>>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
>> >> >>>>
>> >> >>>>
>> >> >>>> on a 4770k i7, Windows 10, I get the output
>> >> >>>>
>> >> >>>> ; Function Attrs: uwtable
>> >> >>>> define double @julia_f_66153(double) #0 {
>> >> >>>> top:
>> >> >>>>   %1 = fmul double %0, 2.000000e+00
>> >> >>>>   %2 = fadd double %1, 3.000000e+00
>> >> >>>>   ret double %2
>> >> >>>> }
>> >> >>>>
>> >> >>>> ; Function Attrs: uwtable
>> >> >>>> define double @julia_g_66157(double) #0 {
>> >> >>>> top:
>> >> >>>>   %1 = call double @llvm.fmuladd.f64(double %0, double
>> 2.000000e+00,
>> >> >>>> double 3.000000e+00)
>> >> >>>>   ret double %1
>> >> >>>> }
>> >> >>>>
>> >> >>>> ; Function Attrs: uwtable
>> >> >>>> define double @julia_h_66158(double) #0 {
>> >> >>>> top:
>> >> >>>>   %1 = call double @llvm.fma.f64(double %0, double 2.000000e+00,
>> >> >>>> double
>> >> >>>> 3.000000e+00)
>> >> >>>>   ret double %1
>> >> >>>> }
>> >> >>>> .text
>> >> >>>> Filename: console
>> >> >>>> pushq %rbp
>> >> >>>> movq %rsp, %rbp
>> >> >>>> Source line: 1
>> >> >>>> addsd %xmm0, %xmm0
>> >> >>>> movabsq $534749456, %rax        # imm = 0x1FDFA110
>> >> >>>> addsd (%rax), %xmm0
>> >> >>>> popq %rbp
>> >> >>>> retq
>> >> >>>> nopl (%rax,%rax)
>> >> >>>> .text
>> >> >>>> Filename: console
>> >> >>>> pushq %rbp
>> >> >>>> movq %rsp, %rbp
>> >> >>>> Source line: 2
>> >> >>>> addsd %xmm0, %xmm0
>> >> >>>> movabsq $534749584, %rax        # imm = 0x1FDFA190
>> >> >>>> addsd (%rax), %xmm0
>> >> >>>> popq %rbp
>> >> >>>> retq
>> >> >>>> nopl (%rax,%rax)
>> >> >>>> .text
>> >> >>>> Filename: console
>> >> >>>> pushq %rbp
>> >> >>>> movq %rsp, %rbp
>> >> >>>> movabsq $534749712, %rax        # imm = 0x1FDFA210
>> >> >>>> Source line: 3
>> >> >>>> movsd dcabs164_(%rax), %xmm1  # xmm1 = mem[0],zero
>> >> >>>> movabsq $534749720, %rax        # imm = 0x1FDFA218
>> >> >>>> movsd (%rax), %xmm2           # xmm2 = mem[0],zero
>> >> >>>> movabsq $fma, %rax
>> >> >>>> popq %rbp
>> >> >>>> jmpq *%rax
>> >> >>>> nop
>> >> >>>>
>> >> >>>> This seems to be similar to the first result.
>> >> >>>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Erik Schnetter <schn...@gmail.com>
>> >> >>> http://www.perimeterinstitute.ca/personal/eschnetter/
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Erik Schnetter <schn...@gmail.com>
>> >> > http://www.perimeterinstitute.ca/personal/eschnetter/
>> >
>> >
>> >
>> >
>> > --
>> > Erik Schnetter <schn...@gmail.com>
>> > http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>


-- 
Erik Schnetter <schnet...@gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/

Re: [julia-users] Is FMA/Muladd Working Here?

Reply via email to