hey guys,
so attached you find text files with @code_native output for the
instructions
- r * x[1,:]
- cis(imexp)
- sum(imexp) * sum(conj(imexp))
for julia 0.5.
Hardware I run on is a Haswell i5 machine, a Haswell i7 machine, and a
IvyBridge i5 machine. Turned out on an Haswell i5 machine the code also
runs fast. Only the Haswell i7 machine is the slow one. This really drove
me nuts. First I thought it was the OS, then the architecture, and now its
just from i5 to i7.... Anyways, I don't know anything about x86 assembly,
but the julia 0.45 code is the same on all machines. However, for the dot
product, the 0.5 code has already 2 different instructions on the i5 vs.
the i7 (line 44&47). For the cis call also (line 149...). And the IvyBridge
i5 code is similar to the Haswell i5. I included also versioninfo() at the
top of the file. So you could just look at a vimdiff of the julia0.5
files... Can anyone make sense out of this?
The binary tarballs I will still test. If I remove the cis() call, the
difference is hard to tell, the loop is ~10times faster and more or less
all around 5ms. For the whole loop with cis() call, from i5 to i7 the
difference is ~ 50ms on i5 to 90ms on i7.
Shall I also post the julia 0.4 code?
cheers, Johannes
On Thursday, March 31, 2016 at 10:27:11 AM UTC+2, Milan Bouchet-Valat wrote:
>
> Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a écrit :
> >
> >
> > > Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner a écrit :
> > > > Sorry for not having expressed myself clearly, I meant the latest
> > > > version of fedora to work fine (24 development). I always used the
> > > > latest julia nightly available on the copr nalimilan repo. Right
> now
> > > > that is: 0.5.0-dev+3292, Commit 9d527c5*, all use
> > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
> > > >
> > > > peakflops on all machines (hardware identical) is ~1.2..1.5e11.
> > > >
> > > > Fedora 22&23 with julia 0.5 is ~50% slower then 0.4, only on fedora
> > > > 24 julia 0.5 is faster compared to julia 0.4.
> > > Could you try to find a simple code to reproduce the problem? In
> > > particular, it would be useful to check whether this comes from
> > > OpenBLAS differences or whether it also happens with pure Julia code
> > > (typical operations which depend on BLAS are matrix multiplication,
> as
> > > well as most of linear algebra). Normally, 0.4 and 0.5 should use the
> > > same BLAS, but who knows...
> > well thats what I did, and the 3 simple calls inside the loop are
> > more or less same speed. only the whole loop seems slower. See my
> > code sample fromanswer march 8th (code gets in same proportions
> > faster when exp(im .* dotprods) is replaced by cis(dotprods) ).
> > So I don't know what I can do then...
> Sorry, somehow I had missed that message. This indeed looks like a code
> generation issue in Julia/LLVM.
>
> > > Can you also confirm that all versioninfo() fields are the same for
> all
> > > three machines, both for 0.4 and 0.5? We must envision the
> possibility
> > > that the differences actually come from 0.4.
> > ohoh, right! just noticed that my fedora 24 machine was an ivy bridge
> > which works fast:
> >
> > Julia Version 0.5.0-dev+3292
> > Commit 9d527c5* (2016-03-28 06:55 UTC)
> > Platform Info:
> > System: Linux (x86_64-redhat-linux)
> > CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
> > WORD_SIZE: 64
> > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
> > LAPACK: libopenblasp.so.0
> > LIBM: libopenlibm
> > LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)
> >
> > and the other ones with fed22/23 are haswell, which work slow:
> >
> > Julia Version 0.5.0-dev+3292
> > Commit 9d527c5* (2016-03-28 06:55 UTC)
> > Platform Info:
> > System: Linux (x86_64-redhat-linux)
> > CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> > WORD_SIZE: 64
> > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
> > LAPACK: libopenblasp.so.0
> > LIBM: libopenlibm
> > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
> >
> > I just booted an fedora 23 on the ivy bridge machine and it's also
> fast.
> >
> > Now if I use julia 0.45 on both architectures:
> >
> > Julia Version 0.4.5
> > Commit 2ac304d* (2016-03-18 00:58 UTC)
> > Platform Info:
> > System: Linux (x86_64-redhat-linux)
> > CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> > WORD_SIZE: 64
> > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
> > LAPACK: libopenblasp.so.0
> > LIBM: libopenlibm
> > LLVM: libLLVM-3.3
> >
> > and:
> >
> > Julia Version 0.4.5
> > Commit 2ac304d* (2016-03-18 00:58 UTC)
> > Platform Info:
> > System: Linux (x86_64-redhat-linux)
> > CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
> > WORD_SIZE: 64
> > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
> > LAPACK: libopenblasp.so.0
> > LIBM: libopenlibm
> > LLVM: libLLVM-3.3
> >
> > there is no speed difference apart from the ~10% or so from the
> > faster haswell machine. So could perhaps be haswell hardware target
> > specific with the change from llvm 3.3 to 3.7.1? Is there anything
> > else I could provide?
> This is certainly an interesting finding. Could you paste somewhere the
> output of @code_native for your function on Sandybridge vs. Haswell,
> for both 0.4 and 0.5?
>
> It would also be useful to check whether the same difference appears if
> you use the generic binary tarballs from http://julialang.org/downloads
> .
>
> Finally, do you get the same result if you remove the call to exp()
> from the loop? (This is the only external function, so it shouldn't be
> affected by changes in Julia.)
>
>
> Regards
>
>
> > Best, Johannes
> >
> > > Regards
> >
> >
> > > > Le mercredi 16 mars 2016 à 09:25 -0700, Johannes Wagner a écrit :
> > > > > just a little update. Tested some other fedoras: Fedora 22 with
> llvm
> > > > > 3.8 is also slow with julia 0.5, whereas a fedora 24 branch with
> llvm
> > > > > 3.7 is faster on julia 0.5 compared to julia 0.4, as it should
> be
> > > > > (speedup from inner loop parts translated into speedup to whole
> > > > > function).
> > > > >
> > > > > don't know if anyone cares about that... At least the latest
> version
> > > > > seems to work fine, hope it stays like this into the final fedora
> 24
> > > > What's the "latest version"? git built from source or RPM
> nightlies?
> > > > With which LLVM version for each?
> > > >
> > > > If from the RPMs, I've switched them to LLVM 3.8 for a few days,
> and
> > > > went back to 3.7 because of a build failure. So that might explain
> the
> > > > difference. You can install the last version which built with LLVM
> 3.8
> > > > manually from here:
> > > >
> https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-23-x86_64/00167549-julia/
>
>
> > > >
> > > > It would be interesting to compare it with the latest nightly with
> 3.7.
> > > >
> > > >
> > > > Regards
> > > >
> > > >
> > > >
> > > > > > hey guys,
> > > > > > I just experienced something weird. I have some code that runs
> fine
> > > > > > on 0.43, then I updated to 0.5dev to test the new Arrays, run
> same
> > > > > > code and noticed it got about ~50% slower. Then I downgraded
> back
> > > > > > to 0.43, ran the old code, but speed remained slow. I noticed
> while
> > > > > > reinstalling 0.43, openblas-threads didn't get isntalled along
> with
> > > > > > it. So I manually installed it, but no change.
> > > > > > Does anyone has an idea what could be going on? LLVM on fedora23
> is
> > > > > > 3.7
> > > > > >
> > > > > > Cheers, Johannes
> > > > > >
>
julia> versioninfo()
Julia Version 0.5.0-dev+3372
Commit 7f177aa* (2016-04-02 12:18 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblasp.so.0
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> @code_native r * x[1,:]
.text
Filename: matmul.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r12
pushq %rbx
subq $64, %rsp
movq %rsi, %r14
movq %rdi, %rbx
movq $0, -72(%rbp)
movq $0, -64(%rbp)
movq $0, -56(%rbp)
movq $0, -48(%rbp)
movq $0, -40(%rbp)
movq $10, -88(%rbp)
movabsq $jl_tls_states, %r15
movq (%r15), %rax
movq %rax, -80(%rbp)
leaq -88(%rbp), %rax
movq %rax, (%r15)
movq 24(%rbx), %r12
Source line: 196
movabsq $jl_gc_alloc_1w, %rax
callq *%rax
movabsq $140507545133280, %rdi # imm = 0x7FCA7650E0E0
movq %rdi, -8(%rax)
movq %r12, (%rax)
movq %rax, -72(%rbp)
addq $2353776, %rdi # imm = 0x23EA70
movabsq $jl_new_array, %rcx
movq %rax, %rsi
callq *%rcx
movq %rax, -64(%rbp)
Source line: 88
movq %rbx, -56(%rbp)
movq %r14, -48(%rbp)
movabsq $"gemv!", %r8
movl $78, %esi
movq %rax, %rdi
movq %rbx, %rdx
movq %r14, %rcx
callq *%r8
movq %rax, -40(%rbp)
movq -80(%rbp), %rcx
movq %rcx, (%r15)
addq $64, %rsp
popq %rbx
popq %r12
popq %r14
popq %r15
popq %rbp
retq
nopw (%rax,%rax)
julia> @code_native cis(dotprods)
.text
Filename: operators.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $120, %rsp
movq %rdi, %r15
xorl %r14d, %r14d
movq $0, -104(%rbp)
movq $0, -96(%rbp)
movq $0, -88(%rbp)
movq $0, -80(%rbp)
movq $0, -72(%rbp)
movq $0, -64(%rbp)
movq $0, -56(%rbp)
movq $0, -48(%rbp)
movq $16, -120(%rbp)
movabsq $jl_tls_states, %rcx
movq (%rcx), %rax
movq %rax, -112(%rbp)
leaq -120(%rbp), %rax
movq %rax, (%rcx)
Source line: 476
movq 8(%r15), %rax
Source line: 83
cmpq $0, %rax
cmovgq %rax, %r14
decq %r14
jo L610
incq %r14
jo L635
leaq -80(%rbp), %r12
leaq -56(%rbp), %r13
movabsq $140454161630160, %rbx # imm = 0x7FBE086943D0
Source line: 303
movq %rbx, -56(%rbp)
movabsq $jl_box_int64, %rax
movq %r14, %rdi
callq *%rax
movq %rax, -48(%rbp)
leaq 32248(%rbx), %rdi
movabsq $140462779530752, %rax # imm = 0x7FC00A13FE00
movl $2, %edx
movq %r13, %rsi
callq *%rax
movq %rax, -104(%rbp)
leaq 31823512(%rbx), %rcx
movq %rcx, -80(%rbp)
movq %rbx, -72(%rbp)
movq %rax, -64(%rbp)
movabsq $jl_apply_generic, %rax
movl $3, %esi
movq %r12, %rdi
movq %rbx, %r12
callq *%rax
movabsq $jl_alloc_array_1d, %rcx
movq %rax, -96(%rbp)
movq (%rax), %rsi
leaq 17535392(%r12), %rdi
callq *%rcx
movq %rax, -152(%rbp)
movq %rax, -88(%rbp)
cmpq $0, %r14
je L484
xorl %r13d, %r13d
xorl %ebx, %ebx
nopl (%rax)
L320:
cmpq 8(%r15), %rbx
jae L523
movq (%r15), %rax
movsd (%rax,%rbx,8), %xmm0 # xmm0 = mem[0],zero
Source line: 320
movsd %xmm0, -128(%rbp)
leaq -272840624(%r12), %rax
callq *%rax
movsd -128(%rbp), %xmm1 # xmm1 = mem[0],zero
movsd %xmm0, -136(%rbp)
ucomisd %xmm1, %xmm1
setp %al
ucomisd %xmm0, %xmm0
setnp %cl
orb %al, %cl
testb $1, %cl
je L560
ucomisd %xmm1, %xmm1
setp -137(%rbp)
leaq -272820608(%r12), %rax
movapd %xmm1, %xmm0
callq *%rax
ucomisd %xmm0, %xmm0
setnp %al
orb -137(%rbp), %al
testb $1, %al
je L585
Source line: 303
incq %rbx
Source line: 4
movq -152(%rbp), %rax
movq (%rax), %rax
movsd %xmm0, 8(%rax,%r13)
movsd -136(%rbp), %xmm0 # xmm0 = mem[0],zero
movsd %xmm0, (%rax,%r13)
Source line: 303
addq $16, %r13
cmpq %rbx, %r14
jne L320
Source line: 4
L484:
movq -112(%rbp), %rax
movabsq $jl_tls_states, %rcx
movq %rax, (%rcx)
movq -152(%rbp), %rax
leaq -40(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
Source line: 303
L523:
movq %rsp, %rsi
addq $-16, %rsi
movq %rsi, %rsp
addq $1, %rbx
movq %rbx, (%rsi)
movabsq $jl_bounds_error_ints, %rax
movl $1, %edx
movq %r15, %rdi
callq *%rax
Source line: 320
L560:
movabsq $jl_domain_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
L585:
movabsq $jl_domain_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
Source line: 83
L610:
movabsq $jl_overflow_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
L635:
movabsq $jl_overflow_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
nopw %cs:(%rax,%rax)
julia> @code_native sum(imexp) * sum(conj(imexp))
.text
Filename: complex.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
Source line: 124
movsd (%rsi), %xmm0 # xmm0 = mem[0],zero
movsd 8(%rsi), %xmm1 # xmm1 = mem[0],zero
movsd (%rdx), %xmm2 # xmm2 = mem[0],zero
movsd 8(%rdx), %xmm3 # xmm3 = mem[0],zero
movapd %xmm0, %xmm4
mulsd %xmm2, %xmm4
movapd %xmm1, %xmm5
mulsd %xmm3, %xmm5
subsd %xmm5, %xmm4
mulsd %xmm3, %xmm0
mulsd %xmm2, %xmm1
addsd %xmm0, %xmm1
movsd %xmm1, 8(%rdi)
movsd %xmm4, (%rdi)
movq %rdi, %rax
popq %rbp
retq
nopw %cs:(%rax,%rax)
julia> versioninfo()
Julia Version 0.5.0-dev+3372
Commit 7f177aa* (2016-04-02 12:18 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblasp.so.0
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> @code_native r * x[1,:]
.text
Filename: matmul.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r12
pushq %rbx
subq $64, %rsp
movq %rsi, %r14
movq %rdi, %rbx
movq $0, -72(%rbp)
movq $0, -64(%rbp)
movq $0, -56(%rbp)
movq $0, -48(%rbp)
movq $0, -40(%rbp)
movq $10, -88(%rbp)
movabsq $jl_tls_states, %r15
movq (%r15), %rax
movq %rax, -80(%rbp)
leaq -88(%rbp), %rax
movq %rax, (%r15)
movq 24(%rbx), %r12
Source line: 196
movabsq $jl_gc_alloc_1w, %rax
callq *%rax
movabsq $140656226482736, %rdi # imm = 0x7FED146A3A30
leaq 661568(%rdi), %rcx
movq %rcx, -8(%rax)
movq %r12, (%rax)
movq %rax, -72(%rbp)
movabsq $jl_new_array, %rcx
movq %rax, %rsi
callq *%rcx
movq %rax, -64(%rbp)
Source line: 88
movq %rbx, -56(%rbp)
movq %r14, -48(%rbp)
movabsq $"gemv!", %r8
movl $78, %esi
movq %rax, %rdi
movq %rbx, %rdx
movq %r14, %rcx
callq *%r8
movq %rax, -40(%rbp)
movq -80(%rbp), %rcx
movq %rcx, (%r15)
addq $64, %rsp
popq %rbx
popq %r12
popq %r14
popq %r15
popq %rbp
retq
nopw (%rax,%rax)
julia> @code_native cis(dotprods)
.text
Filename: operators.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $136, %rsp
xorl %r14d, %r14d
movq $0, -112(%rbp)
movq $0, -104(%rbp)
movq $0, -96(%rbp)
movq $0, -88(%rbp)
movq $0, -80(%rbp)
movq $0, -72(%rbp)
movq $0, -64(%rbp)
movq $0, -56(%rbp)
movq $0, -48(%rbp)
movq $20, -136(%rbp)
movabsq $jl_tls_states, %rcx
movq (%rcx), %rax
movq %rax, -128(%rbp)
leaq -136(%rbp), %rax
movq %rax, (%rcx)
Source line: 476
movq %rdi, -120(%rbp)
movq 8(%rdi), %rax
Source line: 83
cmpq $0, %rax
cmovgq %rax, %r14
decq %r14
jo L650
movq %rdi, -168(%rbp)
incq %r14
jo L675
leaq -80(%rbp), %r15
leaq -56(%rbp), %r12
movabsq $140073459172304, %rbx # imm = 0x7F6564C6C3D0
Source line: 303
movq %rbx, -56(%rbp)
movabsq $jl_box_int64, %rax
movq %r14, %rdi
callq *%rax
movq %rax, -48(%rbp)
leaq 32552(%rbx), %rdi
movabsq $140082077063248, %rax # imm = 0x7F6766715850
movl $2, %edx
movq %r12, %rsi
callq *%rax
movq %rax, -112(%rbp)
leaq 33473648(%rbx), %rcx
movq %rcx, -80(%rbp)
movq %rbx, -72(%rbp)
movq %rax, -64(%rbp)
movabsq $jl_apply_generic, %rax
movl $3, %esi
movq %r15, %rdi
movq %rbx, %r12
callq *%rax
movabsq $jl_alloc_array_1d, %rcx
movq %rax, -104(%rbp)
movq (%rax), %rsi
leaq 10193440(%r12), %rdi
callq *%rcx
movq %rax, -160(%rbp)
movq %rax, -96(%rbp)
cmpq $0, %r14
je L527
xorl %r13d, %r13d
xorl %ebx, %ebx
nopw %cs:(%rax,%rax)
L352:
movq -168(%rbp), %rdi
movq %rdi, -88(%rbp)
cmpq 8(%rdi), %rbx
jae L566
movq (%rdi), %rax
movsd (%rax,%rbx,8), %xmm0 # xmm0 = mem[0],zero
Source line: 320
movsd %xmm0, -144(%rbp)
leaq -60767328(%r12), %rax
callq *%rax
movsd -144(%rbp), %xmm1 # xmm1 = mem[0],zero
movsd %xmm0, -152(%rbp)
ucomisd %xmm1, %xmm1
setp %al
ucomisd %xmm0, %xmm0
setnp %cl
orb %al, %cl
testb $1, %cl
je L600
ucomisd %xmm1, %xmm1
setp %r15b
leaq -60747440(%r12), %rax
movapd %xmm1, %xmm0
callq *%rax
ucomisd %xmm0, %xmm0
setnp %al
orb %r15b, %al
testb $1, %al
je L625
Source line: 303
incq %rbx
Source line: 4
movq -160(%rbp), %rax
movq (%rax), %rax
movsd %xmm0, 8(%rax,%r13)
movsd -152(%rbp), %xmm0 # xmm0 = mem[0],zero
movsd %xmm0, (%rax,%r13)
Source line: 303
addq $16, %r13
cmpq %rbx, %r14
jne L352
Source line: 4
L527:
movq -128(%rbp), %rax
movabsq $jl_tls_states, %rcx
movq %rax, (%rcx)
movq -160(%rbp), %rax
leaq -40(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
Source line: 303
L566:
movq %rsp, %rsi
addq $-16, %rsi
movq %rsi, %rsp
addq $1, %rbx
movq %rbx, (%rsi)
movabsq $jl_bounds_error_ints, %rax
movl $1, %edx
callq *%rax
Source line: 320
L600:
movabsq $jl_domain_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
L625:
movabsq $jl_domain_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
Source line: 83
L650:
movabsq $jl_overflow_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
L675:
movabsq $jl_overflow_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
nopl (%rax)
julia> @code_native sum(imexp) * sum(conj(imexp))
.text
Filename: complex.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
Source line: 124
movsd (%rsi), %xmm0 # xmm0 = mem[0],zero
movsd 8(%rsi), %xmm1 # xmm1 = mem[0],zero
movsd (%rdx), %xmm2 # xmm2 = mem[0],zero
movsd 8(%rdx), %xmm3 # xmm3 = mem[0],zero
movapd %xmm0, %xmm4
mulsd %xmm2, %xmm4
movapd %xmm1, %xmm5
mulsd %xmm3, %xmm5
subsd %xmm5, %xmm4
mulsd %xmm3, %xmm0
mulsd %xmm2, %xmm1
addsd %xmm0, %xmm1
movsd %xmm1, 8(%rdi)
movsd %xmm4, (%rdi)
movq %rdi, %rax
popq %rbp
retq
nopw %cs:(%rax,%rax)
julia> versioninfo()
Julia Version 0.5.0-dev+3390
Commit a9e7e86* (2016-04-04 12:47 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblasp.so.0
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)
julia> @code_native r * x[1,:]
.text
Filename: matmul.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r12
pushq %rbx
subq $48, %rsp
movq %rsi, %r14
movq %rdi, %r12
movq $0, -56(%rbp)
movq $0, -48(%rbp)
movq $0, -40(%rbp)
movq $6, -72(%rbp)
movabsq $jl_tls_states, %r15
movq (%r15), %rax
movq %rax, -64(%rbp)
leaq -72(%rbp), %rax
movq %rax, (%r15)
movq 24(%r12), %rbx
Source line: 196
movabsq $jl_gc_alloc_1w, %rax
callq *%rax
movabsq $140467867165008, %rdi # imm = 0x7FC139532150
movq %rdi, -8(%rax)
movq %rbx, (%rax)
movq %rax, -56(%rbp)
addq $2290240, %rdi # imm = 0x22F240
movabsq $jl_new_array, %rcx
movq %rax, %rsi
callq *%rcx
movq %rax, -48(%rbp)
Source line: 88
movabsq $"gemv!", %rbx
movl $78, %esi
movq %rax, %rdi
movq %r12, %rdx
movq %r14, %rcx
callq *%rbx
movq %rax, -40(%rbp)
movq -64(%rbp), %rcx
movq %rcx, (%r15)
addq $48, %rsp
popq %rbx
popq %r12
popq %r14
popq %r15
popq %rbp
retq
nop
julia> @code_native cis(dotprods)
.text
Filename: operators.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $120, %rsp
movq %rdi, %r15
xorl %r14d, %r14d
movq $0, -104(%rbp)
movq $0, -96(%rbp)
movq $0, -88(%rbp)
movq $0, -80(%rbp)
movq $0, -72(%rbp)
movq $0, -64(%rbp)
movq $0, -56(%rbp)
movq $0, -48(%rbp)
movq $16, -120(%rbp)
movabsq $jl_tls_states, %rcx
movq (%rcx), %rax
movq %rax, -112(%rbp)
leaq -120(%rbp), %rax
movq %rax, (%rcx)
Source line: 476
movq 8(%r15), %rax
Source line: 83
cmpq $0, %rax
cmovgq %rax, %r14
decq %r14
jo L610
incq %r14
jo L635
leaq -80(%rbp), %r12
leaq -56(%rbp), %r13
movabsq $140543751914448, %rbx # imm = 0x7FD2E46883D0
Source line: 303
movq %rbx, -56(%rbp)
movabsq $jl_box_int64, %rax
movq %r14, %rdi
callq *%rax
movq %rax, -48(%rbp)
leaq 32528(%rbx), %rdi
movabsq $140552369804176, %rax # imm = 0x7FD4E6131390
movl $2, %edx
movq %r13, %rsi
callq *%rax
movq %rax, -104(%rbp)
leaq 26527528(%rbx), %rcx
movq %rcx, -80(%rbp)
movq %rbx, -72(%rbp)
movq %rax, -64(%rbp)
movabsq $jl_apply_generic, %rax
movl $3, %esi
movq %r12, %rdi
movq %rbx, %r12
callq *%rax
movabsq $jl_alloc_array_1d, %rcx
movq %rax, -96(%rbp)
movq (%rax), %rsi
leaq 2314368(%r12), %rdi
callq *%rcx
movq %rax, -152(%rbp)
movq %rax, -88(%rbp)
cmpq $0, %r14
je L484
xorl %r13d, %r13d
xorl %ebx, %ebx
nopl (%rax)
L320:
cmpq 8(%r15), %rbx
jae L523
movq (%r15), %rax
movsd (%rax,%rbx,8), %xmm0 # xmm0 = mem[0],zero
Source line: 320
movsd %xmm0, -128(%rbp)
leaq -335968048(%r12), %rax
callq *%rax
movsd -128(%rbp), %xmm1 # xmm1 = mem[0],zero
movsd %xmm0, -136(%rbp)
ucomisd %xmm1, %xmm1
setp %al
ucomisd %xmm0, %xmm0
setnp %cl
orb %al, %cl
testb $1, %cl
je L560
ucomisd %xmm1, %xmm1
setp -137(%rbp)
leaq -335948544(%r12), %rax
movapd %xmm1, %xmm0
callq *%rax
ucomisd %xmm0, %xmm0
setnp %al
orb -137(%rbp), %al
testb $1, %al
je L585
Source line: 303
incq %rbx
Source line: 4
movq -152(%rbp), %rax
movq (%rax), %rax
movsd %xmm0, 8(%rax,%r13)
movsd -136(%rbp), %xmm0 # xmm0 = mem[0],zero
movsd %xmm0, (%rax,%r13)
Source line: 303
addq $16, %r13
cmpq %rbx, %r14
jne L320
Source line: 4
L484:
movq -112(%rbp), %rax
movabsq $jl_tls_states, %rcx
movq %rax, (%rcx)
movq -152(%rbp), %rax
leaq -40(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
Source line: 303
L523:
movq %rsp, %rsi
addq $-16, %rsi
movq %rsi, %rsp
addq $1, %rbx
movq %rbx, (%rsi)
movabsq $jl_bounds_error_ints, %rax
movl $1, %edx
movq %r15, %rdi
callq *%rax
Source line: 320
L560:
movabsq $jl_domain_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
L585:
movabsq $jl_domain_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
Source line: 83
L610:
movabsq $jl_overflow_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
L635:
movabsq $jl_overflow_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw, %rax
callq *%rax
nopw %cs:(%rax,%rax)
julia> @code_native sum(imexp) * sum(conj(imexp))
.text
Filename: complex.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
Source line: 124
movsd (%rsi), %xmm0 # xmm0 = mem[0],zero
movsd 8(%rsi), %xmm1 # xmm1 = mem[0],zero
movsd (%rdx), %xmm2 # xmm2 = mem[0],zero
movsd 8(%rdx), %xmm3 # xmm3 = mem[0],zero
movapd %xmm0, %xmm4
mulsd %xmm2, %xmm4
movapd %xmm1, %xmm5
mulsd %xmm3, %xmm5
subsd %xmm5, %xmm4
mulsd %xmm3, %xmm0
mulsd %xmm2, %xmm1
addsd %xmm0, %xmm1
movsd %xmm1, 8(%rdi)
movsd %xmm4, (%rdi)
movq %rdi, %rax
popq %rbp
retq
nopw %cs:(%rax,%rax)