hey Milan,
so consider following code:
Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git")
using TimeIt
v = rand(3)
r = rand(6000,3)
x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))'
dotprods = r * x[2,:]
imexp = cis(dotprods)
sumprod = sum(imexp) * sum(conj(imexp))
f(r, x) = r * x[2,:]
g(r, x) = r * x'
h(imexp) = sum(imexp) * sum(conj(imexp))
function s(r, x)
result = zeros(size(x,1))
for i = 1:size(x,1)
imexp = cis(r * x[i,:])
result[i]= sum(imexp) * sum(conj(imexp))
end
return result
end
@timeit zeros(size(x,1))
@timeit f(r,x)
@timeit g(r,x)
@timeit cis(dotprods)
@timeit h(imexp)
@timeit s(r,x)
@code_native f(r,x)
@code_native g(r,x)
@code_native cis(dotprods)
@code_native h(imexp)
@code_native s(r,x)
and I attached the output of the last @code_native s(r,x) as text files for
the binary tarball, as well as the latest nalimilan update. For the whole
function s, the exported code looks actually the same everywhere.
But s(r,x) is the one that is considerable slower on the i7 than the i5,
whereas all the other timed calls are more or less same speed on i5 and i7.
Here are the timings in the same order as above (all run repeatedly to not
have compile time in it for last one):
i7:
1000000 loops, best of 3: 871.68 ns per loop
10000 loops, best of 3: 10.84 µs per loop
100 loops, best of 3: 5.19 ms per loop
10000 loops, best of 3: 71.35 µs per loop
10000 loops, best of 3: 26.65 µs per loop
1 loops, best of 3: 159.99 ms per loop
i5:
100000 loops, best of 3: 1.01 µs per loop
10000 loops, best of 3: 10.93 µs per loop
100 loops, best of 3: 5.09 ms per loop
10000 loops, best of 3: 75.93 µs per loop
10000 loops, best of 3: 29.23 µs per loop
1 loops, best of 3: 103.70 ms per loop
So based on inside s(r,x) calls, the i7 should be faster, but the whole
s(r,x) is slower. Still clueless... And don't know how to further pin this
down...
cheers, Johannes
On Monday, April 4, 2016 at 10:48:40 PM UTC+2, Milan Bouchet-Valat wrote:
>
> Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a écrit :
> > hey guys,
> > so attached you find text files with @code_native output for the
> > instructions
> > - r * x[1,:]
> > - cis(imexp)
> > - sum(imexp) * sum(conj(imexp))
> >
> > for julia 0.5.
> >
> > Hardware I run on is a Haswell i5 machine, a Haswell i7 machine, and
> > a IvyBridge i5 machine. Turned out on an Haswell i5 machine the code
> > also runs fast. Only the Haswell i7 machine is the slow one. This
> > really drove me nuts. First I thought it was the OS, then the
> > architecture, and now its just from i5 to i7.... Anyways, I don't
> > know anything about x86 assembly, but the julia 0.45 code is the same
> > on all machines. However, for the dot product, the 0.5 code has
> > already 2 different instructions on the i5 vs. the i7 (line 44&47).
> > For the cis call also (line 149...). And the IvyBridge i5 code is
> > similar to the Haswell i5. I included also versioninfo() at the top
> > of the file. So you could just look at a vimdiff of the julia0.5
> > files... Can anyone make sense out of this?
> I'm definitely not an expert in assembly, but that additional leaq
> instruction on line 44, and the additional movq instructions on line
> 111, 151 and 152 really look weird
>
> Could you do the same test with the binary tarballs? If the difference
> persists, you should open an issue on GitHub to track this.
>
> BTW, please wrap the fist call in a function to ensure it is
> specialized for the arguments types, i.e.:
>
> f(r, x) = r * x[1,:]
> @code_native f(r, x)
>
> Also, please check whether you still see the difference with this code:
> g(r, x) = r * x
> @code_native g(r, x[1,:])
>
> What are the types of r and x? Could you provide a simple reproducible
> example with dummy values?
>
> > The binary tarballs I will still test. If I remove the cis() call,
> > the difference is hard to tell, the loop is ~10times faster and more
> > or less all around 5ms. For the whole loop with cis() call, from i5
> > to i7 the difference is ~ 50ms on i5 to 90ms on i7.
> >
> > Shall I also post the julia 0.4 code?
> If it's identical for all machines, I don't think it's needed.
>
>
> Regards
>
>
> > cheers, Johannes
> >
> >
> >
> > > Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a écrit :
> > > >
> > > >
> > > > > Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner a
> écrit :
> > > > > > Sorry for not having expressed myself clearly, I meant the
> latest
> > > > > > version of fedora to work fine (24 development). I always used
> the
> > > > > > latest julia nightly available on the copr nalimilan repo. Right
> now
> > > > > > that is: 0.5.0-dev+3292, Commit 9d527c5*, all use
> > > > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
> > > > > >
> > > > > > peakflops on all machines (hardware identical) is ~1.2..1.5e11.
>
> > > > > >
> > > > > > Fedora 22&23 with julia 0.5 is ~50% slower then 0.4, only on
> fedora
> > > > > > 24 julia 0.5 is faster compared to julia 0.4.
> > > > > Could you try to find a simple code to reproduce the problem? In
> > > > > particular, it would be useful to check whether this comes from
> > > > > OpenBLAS differences or whether it also happens with pure Julia
> code
> > > > > (typical operations which depend on BLAS are matrix
> multiplication, as
> > > > > well as most of linear algebra). Normally, 0.4 and 0.5 should use
> the
> > > > > same BLAS, but who knows...
> > > > well thats what I did, and the 3 simple calls inside the loop are
> > > > more or less same speed. only the whole loop seems slower. See my
> > > > code sample fromanswer march 8th (code gets in same proportions
> > > > faster when exp(im .* dotprods) is replaced by cis(dotprods) ).
> > > > So I don't know what I can do then...
> > > Sorry, somehow I had missed that message. This indeed looks like a
> code
> > > generation issue in Julia/LLVM.
> > >
> > > > > Can you also confirm that all versioninfo() fields are the same
> for all
> > > > > three machines, both for 0.4 and 0.5? We must envision the
> possibility
> > > > > that the differences actually come from 0.4.
> > > > ohoh, right! just noticed that my fedora 24 machine was an ivy
> bridge
> > > > which works fast:
> > > >
> > > > Julia Version 0.5.0-dev+3292
> > > > Commit 9d527c5* (2016-03-28 06:55 UTC)
> > > > Platform Info:
> > > > System: Linux (x86_64-redhat-linux)
> > > > CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
> > > > WORD_SIZE: 64
> > > > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
> > > > LAPACK: libopenblasp.so.0
> > > > LIBM: libopenlibm
> > > > LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)
> > > >
> > > > and the other ones with fed22/23 are haswell, which work slow:
> > > >
> > > > Julia Version 0.5.0-dev+3292
> > > > Commit 9d527c5* (2016-03-28 06:55 UTC)
> > > > Platform Info:
> > > > System: Linux (x86_64-redhat-linux)
> > > > CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> > > > WORD_SIZE: 64
> > > > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
> > > > LAPACK: libopenblasp.so.0
> > > > LIBM: libopenlibm
> > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
> > > >
> > > > I just booted an fedora 23 on the ivy bridge machine and it's also
> fast.
> > > >
> > > > Now if I use julia 0.45 on both architectures:
> > > >
> > > > Julia Version 0.4.5
> > > > Commit 2ac304d* (2016-03-18 00:58 UTC)
> > > > Platform Info:
> > > > System: Linux (x86_64-redhat-linux)
> > > > CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> > > > WORD_SIZE: 64
> > > > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
> > > > LAPACK: libopenblasp.so.0
> > > > LIBM: libopenlibm
> > > > LLVM: libLLVM-3.3
> > > >
> > > > and:
> > > >
> > > > Julia Version 0.4.5
> > > > Commit 2ac304d* (2016-03-18 00:58 UTC)
> > > > Platform Info:
> > > > System: Linux (x86_64-redhat-linux)
> > > > CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
> > > > WORD_SIZE: 64
> > > > BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
> > > > LAPACK: libopenblasp.so.0
> > > > LIBM: libopenlibm
> > > > LLVM: libLLVM-3.3
> > > >
> > > > there is no speed difference apart from the ~10% or so from the
> > > > faster haswell machine. So could perhaps be haswell hardware target
> > > > specific with the change from llvm 3.3 to 3.7.1? Is there anything
> > > > else I could provide?
> > > This is certainly an interesting finding. Could you paste somewhere
> the
> > > output of @code_native for your function on Sandybridge vs. Haswell,
> > > for both 0.4 and 0.5?
> > >
> > > It would also be useful to check whether the same difference appears
> if
> > > you use the generic binary tarballs from
> http://julialang.org/downloads
> > > .
> > >
> > > Finally, do you get the same result if you remove the call to exp()
> > > from the loop? (This is the only external function, so it shouldn't
> be
> > > affected by changes in Julia.)
> > >
> > >
> > > Regards
> > >
> > >
> > > > Best, Johannes
> > > >
> > > > > Regards
> > > >
> > > >
> > > > > > Le mercredi 16 mars 2016 à 09:25 -0700, Johannes Wagner a
> écrit :
> > > > > > > just a little update. Tested some other fedoras: Fedora 22
> with llvm
> > > > > > > 3.8 is also slow with julia 0.5, whereas a fedora 24 branch
> with llvm
> > > > > > > 3.7 is faster on julia 0.5 compared to julia 0.4, as it should
> be
> > > > > > > (speedup from inner loop parts translated into speedup to
> whole
> > > > > > > function).
> > > > > > >
> > > > > > > don't know if anyone cares about that... At least the latest
> version
> > > > > > > seems to work fine, hope it stays like this into the final
> fedora 24
> > > > > > What's the "latest version"? git built from source or RPM
> nightlies?
> > > > > > With which LLVM version for each?
> > > > > >
> > > > > > If from the RPMs, I've switched them to LLVM 3.8 for a few days,
> and
> > > > > > went back to 3.7 because of a build failure. So that might
> explain the
> > > > > > difference. You can install the last version which built with
> LLVM 3.8
> > > > > > manually from here:
> > > > > >
> https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-23-x86_64/00167549-julia/
>
>
> > > > > >
> > > > > > It would be interesting to compare it with the latest nightly
> with 3.7.
> > > > > >
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > >
> > > > > >
> > > > > > > > hey guys,
> > > > > > > > I just experienced something weird. I have some code that
> runs fine
> > > > > > > > on 0.43, then I updated to 0.5dev to test the new Arrays,
> run same
> > > > > > > > code and noticed it got about ~50% slower. Then I downgraded
> back
> > > > > > > > to 0.43, ran the old code, but speed remained slow. I
> noticed while
> > > > > > > > reinstalling 0.43, openblas-threads didn't get isntalled
> along with
> > > > > > > > it. So I manually installed it, but no change.
> > > > > > > > Does anyone has an idea what could be going on? LLVM on
> fedora23 is
> > > > > > > > 3.7
> > > > > > > >
> > > > > > > > Cheers, Johannes
> > > > > > > >
>
julia> versioninfo()
Julia Version 0.5.0-dev+3404
Commit db2c6ab* (2016-04-05 07:45 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblasp.so.0
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> @code_native s(r,x)
.text
Filename: none
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $248, %rsp
movq %rsi, -272(%rbp)
movq %rdi, -280(%rbp)
movabsq $139623632765904, %r12 # imm = 0x7EFCA90883D0
leaq -128(%rbp), %rbx
leaq -104(%rbp), %r14
movq $0, -232(%rbp)
movq $0, -224(%rbp)
movq $0, -216(%rbp)
movq $0, -208(%rbp)
movq $0, -200(%rbp)
movq $0, -192(%rbp)
movq $0, -184(%rbp)
movq $0, -176(%rbp)
movq $0, -168(%rbp)
movq $0, -160(%rbp)
movq $0, -152(%rbp)
movq $0, -144(%rbp)
movq $0, -136(%rbp)
movq $0, -128(%rbp)
movq $0, -120(%rbp)
movq $0, -112(%rbp)
movq $0, -96(%rbp)
movq $36, -248(%rbp)
movabsq $jl_tls_states, %rcx
movq (%rcx), %rax
movq %rax, -240(%rbp)
leaq -248(%rbp), %rax
movq %rax, (%rcx)
movq 24(%rsi), %r15
Source line: 303
movq %r15, -264(%rbp)
movq %r12, -104(%rbp)
movabsq $jl_box_int64, %rax
movq %r15, %rdi
callq *%rax
movq %rax, -96(%rbp)
leaq 31904(%r12), %rdi
movabsq $139632250667936, %rax # imm = 0x7EFEAAB343A0
movl $2, %edx
movq %r14, %rsi
callq *%rax
movq %rax, -232(%rbp)
leaq 29142760(%r12), %rcx
movq %rcx, -128(%rbp)
movq %r12, -120(%rbp)
movq %rax, -112(%rbp)
movabsq $jl_apply_generic, %rax
movl $3, %esi
movq %rbx, %rdi
callq *%rax
movabsq $jl_alloc_array_1d, %rcx
movq %rax, -224(%rbp)
movq (%rax), %rsi
leaq 1563008(%r12), %rdi
callq *%rcx
movq %rax, -216(%rbp)
movabsq $"fill!", %rcx
xorpd %xmm0, %xmm0
movq %rax, %rdi
callq *%rcx
movq %rax, %rbx
movq %rbx, -208(%rbp)
Source line: 83
cmpq $1, %r15
jl L916
xorl %r15d, %r15d
Source line: 5
movabsq $mapreduce, %r13
nopw %cs:(%rax,%rax)
L480:
xorl %ecx, %ecx
Source line: 131
addq $1, %r15
cmpq $1, %r15
jl L502
cmpq -264(%rbp), %r15
setle %cl
L502:
xorl %eax, %eax
andb $1, %cl
movb %cl, -250(%rbp)
movb -250(%rbp), %cl
andb $1, %cl
je L526
movb $1, %al
Source line: 132
L526:
andb $1, %al
movb %al, -249(%rbp)
movb -249(%rbp), %al
andb $1, %al
jne L612
movabsq $jl_gc_alloc_2w, %rax
callq *%rax
movq %rax, -200(%rbp)
leaq 3601136(%r12), %rcx
movq %rcx, -8(%rax)
movq %r15, (%rax)
leaq 1361216(%r12), %rcx
movq %rcx, 8(%rax)
movq -272(%rbp), %rdi
movq %rax, %rsi
movabsq $throw_boundserror, %rax
callq *%rax
Source line: 215
L612:
leaq 5600400(%r12), %rax
movq %rax, -128(%rbp)
movq -272(%rbp), %rax
movq %rax, -120(%rbp)
movq %r15, %rdi
Source line: 303
movabsq $jl_box_int64, %rax
Source line: 215
callq *%rax
movq %rax, -112(%rbp)
leaq 1361216(%r12), %rax
movq %rax, -104(%rbp)
leaq 40224456(%r12), %rdi
movl $4, %edx
leaq -128(%rbp), %rsi
movabsq $_unsafe_getindex, %rax
callq *%rax
movq %rax, -192(%rbp)
movq -280(%rbp), %rdi
movq %rax, %rsi
movabsq $"*", %rax
callq *%rax
movq %rax, -184(%rbp)
movq %rax, %rdi
movabsq $cis, %rax
callq *%rax
movq %rax, %r14
movq %r14, -176(%rbp)
Source line: 5
movq %r14, -168(%rbp)
leaq -72(%rbp), %rdi
movq %r14, %rsi
callq *%r13
movq %r14, -160(%rbp)
movq %r14, %rdi
movabsq $conj, %rax
callq *%rax
movq %rax, -152(%rbp)
leaq -88(%rbp), %rdi
Source line: 246
movq %rax, %rsi
callq *%r13
Source line: 124
movsd -72(%rbp), %xmm0 # xmm0 = mem[0],zero
movsd -88(%rbp), %xmm1 # xmm1 = mem[0],zero
movapd %xmm0, %xmm2
mulsd %xmm1, %xmm2
movsd -64(%rbp), %xmm3 # xmm3 = mem[0],zero
movsd -80(%rbp), %xmm4 # xmm4 = mem[0],zero
movapd %xmm3, %xmm5
mulsd %xmm4, %xmm5
subsd %xmm5, %xmm2
mulsd %xmm4, %xmm0
mulsd %xmm1, %xmm3
addsd %xmm0, %xmm3
movsd %xmm2, -56(%rbp)
movsd %xmm3, -48(%rbp)
movq %rbx, -144(%rbp)
movq %rbx, %rdi
leaq -56(%rbp), %rsi
movq %r15, %rdx
movabsq $"setindex!", %rax
callq *%rax
Source line: 83
cmpq %r15, -264(%rbp)
jne L480
Source line: 7
L916:
movq %rbx, -136(%rbp)
movq -240(%rbp), %rax
movabsq $jl_tls_states, %rcx
movq %rax, (%rcx)
movq %rbx, %rax
addq $248, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
nopw %cs:(%rax,%rax)
julia> versioninfo()
Julia Version 0.5.0-dev+3313
Commit 5e01b1a (2016-03-29 15:14 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> @code_native s(r,x)
.text
Filename: none
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $312, %rsp # imm = 0x138
movq %rsi, -304(%rbp)
movq %rdi, -320(%rbp)
movabsq $jl_new_array, %r13
movabsq $140660019585240, %r12 # imm = 0x7FEDF68060D8
leaq -128(%rbp), %r15
movq $0, -272(%rbp)
movq $0, -264(%rbp)
movq $0, -256(%rbp)
movq $0, -248(%rbp)
movq $0, -240(%rbp)
movq $0, -232(%rbp)
movq $0, -224(%rbp)
movq $0, -216(%rbp)
movq $0, -208(%rbp)
movq $0, -200(%rbp)
movq $0, -192(%rbp)
movq $0, -184(%rbp)
movq $0, -176(%rbp)
movq $0, -168(%rbp)
movq $0, -160(%rbp)
movq $0, -152(%rbp)
movq $0, -144(%rbp)
movq $0, -136(%rbp)
movq $0, -128(%rbp)
movq $0, -120(%rbp)
movq $0, -112(%rbp)
movq $0, -96(%rbp)
movq $46, -288(%rbp)
movabsq $jl_tls_states, %rcx
movq (%rcx), %rax
movq %rax, -280(%rbp)
leaq -288(%rbp), %rax
movq %rax, (%rcx)
movq 24(%rsi), %r14
Source line: 307
movq %r14, -312(%rbp)
leaq -2137352(%r12), %rbx
movq %rbx, -104(%rbp)
movabsq $jl_box_int64, %rax
movq %r14, %rdi
callq *%rax
movq %rax, -96(%rbp)
leaq -2101408(%r12), %rdi
movabsq $convert, %rax
movl $2, %edx
leaq -104(%rbp), %rsi
callq *%rax
movq %rax, -272(%rbp)
leaq 20617392(%r12), %rcx
movq %rcx, -128(%rbp)
movq %rbx, -120(%rbp)
movq %rax, -112(%rbp)
movabsq $jl_apply_generic, %rax
movl $3, %esi
movq %r15, %rdi
callq *%rax
movq %rax, -264(%rbp)
movq (%rax), %rsi
leaq 599832(%r12), %rdi
movq %rdi, -328(%rbp)
leaq 224(%r13), %rax
callq *%rax
movq %rax, -256(%rbp)
movabsq $"fill!", %rcx
xorpd %xmm0, %xmm0
movq %rax, %rdi
callq *%rcx
movq %rax, %rbx
movq %rbx, -336(%rbp)
movq %rbx, -248(%rbp)
Source line: 83
cmpq $1, %r14
jl L1121
xorl %r15d, %r15d
movq -320(%rbp), %rax
movq 24(%rax), %rax
Source line: 124
movq %rax, -344(%rbp)
cmpq $0, %rbx
je L1204
Source line: 5
movabsq $mapreduce, %r13
nop
L576:
xorl %ecx, %ecx
Source line: 131
addq $1, %r15
cmpq $1, %r15
jl L598
cmpq -312(%rbp), %r15
setle %cl
L598:
xorl %eax, %eax
andb $1, %cl
movb %cl, -290(%rbp)
movb -290(%rbp), %cl
andb $1, %cl
je L636
movb $1, %al
movq -304(%rbp), %rcx
movq %rcx, -240(%rbp)
Source line: 132
L636:
andb $1, %al
movb %al, -289(%rbp)
movb -289(%rbp), %al
andb $1, %al
jne L724
movq -304(%rbp), %rbx
movq %rbx, -232(%rbp)
movabsq $jl_gc_alloc_2w, %rax
callq *%rax
movq %rax, -224(%rbp)
leaq 86433224(%r12), %rcx
movq %rcx, -8(%rax)
movq %r15, (%rax)
movq %r12, 8(%rax)
movq %rbx, %rdi
movq %rax, %rsi
movabsq $throw_boundserror, %rax
callq *%rax
Source line: 215
L724:
movq -304(%rbp), %rax
movq %rax, -120(%rbp)
leaq -8296(%r12), %rax
movq %rax, -128(%rbp)
movq %r15, %rdi
Source line: 307
movabsq $jl_box_int64, %rax
Source line: 215
callq *%rax
movq %rax, -112(%rbp)
movq %r12, -104(%rbp)
leaq 22250520(%r12), %rdi
movl $4, %edx
leaq -128(%rbp), %rsi
movabsq $_unsafe_getindex, %rax
callq *%rax
movq %rax, %rbx
movq %rbx, -216(%rbp)
Source line: 196
movabsq $__pool_alloc, %rax
callq *%rax
leaq -2111064(%r12), %rcx
movq %rcx, -8(%rax)
movq -344(%rbp), %rcx
movq %rcx, (%rax)
movq %rax, -208(%rbp)
movq -328(%rbp), %rdi
movq %rax, %rsi
movabsq $jl_new_array, %rax
callq *%rax
movq %rax, -200(%rbp)
movq -320(%rbp), %rdx
Source line: 88
movq %rdx, -192(%rbp)
movl $78, %esi
movq %rax, %rdi
movq %rbx, %rcx
movabsq $"gemv!", %rax
callq *%rax
movq %rax, -184(%rbp)
movq %rax, %rdi
movabsq $cis, %rax
callq *%rax
movq %rax, %r14
movq %r14, -176(%rbp)
Source line: 5
cmpq $0, %r14
je L1577
movq %r14, -168(%rbp)
leaq -72(%rbp), %rdi
movq %r14, %rsi
callq *%r13
movq %r14, -160(%rbp)
movq %r14, %rdi
movabsq $conj, %rax
callq *%rax
movq %rax, -152(%rbp)
leaq -88(%rbp), %rdi
Source line: 246
movq %rax, %rsi
callq *%r13
Source line: 124
movsd -72(%rbp), %xmm0 # xmm0 = mem[0],zero
movsd -88(%rbp), %xmm1 # xmm1 = mem[0],zero
movsd -64(%rbp), %xmm2 # xmm2 = mem[0],zero
movsd -80(%rbp), %xmm3 # xmm3 = mem[0],zero
movapd %xmm2, %xmm4
mulsd %xmm3, %xmm4
mulsd %xmm0, %xmm3
mulsd %xmm1, %xmm0
subsd %xmm4, %xmm0
mulsd %xmm1, %xmm2
addsd %xmm3, %xmm2
movsd %xmm0, -56(%rbp)
movsd %xmm2, -48(%rbp)
movq -336(%rbp), %rbx
movq %rbx, -144(%rbp)
movq %rbx, %rdi
leaq -56(%rbp), %rsi
movq %r15, %rdx
movabsq $"setindex!", %rax
callq *%rax
Source line: 83
cmpq %r15, -312(%rbp)
jne L576
Source line: 7
L1121:
cmpq $0, %rbx
je L1175
movq %rbx, -136(%rbp)
movq -280(%rbp), %rax
movabsq $jl_tls_states, %rcx
movq %rax, (%rcx)
movq %rbx, %rax
addq $312, %rsp # imm = 0x138
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
L1175:
movabsq $jl_new_array, %rdi
addq $23823232, %rdi # imm = 0x16B8380
movabsq $jl_undefined_var_error, %rax
callq *%rax
L1204:
xorl %eax, %eax
Source line: 131
cmpq $0, -312(%rbp)
setg %cl
andb $1, %cl
movb %cl, -290(%rbp)
movb -290(%rbp), %cl
andb $1, %cl
je L1253
movb $1, %al
movq -304(%rbp), %rcx
movq %rcx, -240(%rbp)
Source line: 132
L1253:
andb $1, %al
movb %al, -289(%rbp)
movb -289(%rbp), %al
andb $1, %al
jne L1345
movq -304(%rbp), %rbx
movq %rbx, -232(%rbp)
movabsq $jl_gc_alloc_2w, %rax
callq *%rax
movq %rax, -224(%rbp)
leaq 86433224(%r12), %rcx
movq %rcx, -8(%rax)
movq $1, (%rax)
movq %r12, 8(%rax)
movabsq $throw_boundserror, %rcx
movq %rbx, %rdi
movq %rax, %rsi
callq *%rcx
Source line: 215
L1345:
movq -304(%rbp), %rax
movq %rax, -120(%rbp)
leaq -8296(%r12), %rax
movq %rax, -128(%rbp)
movl $1, %edi
Source line: 307
movabsq $jl_box_int64, %rax
Source line: 215
callq *%rax
movq %rax, -112(%rbp)
movq %r12, -104(%rbp)
leaq 22250520(%r12), %rdi
movabsq $_unsafe_getindex, %rax
movl $4, %edx
leaq -128(%rbp), %rsi
callq *%rax
movq %rax, %r14
movq %r14, -216(%rbp)
Source line: 196
movabsq $__pool_alloc, %rax
callq *%rax
leaq -2111064(%r12), %rcx
movq %rcx, -8(%rax)
movq -344(%rbp), %rcx
movq %rcx, (%rax)
movq %rax, -208(%rbp)
movq -328(%rbp), %rdi
movq %rax, %rsi
movabsq $jl_new_array, %rax
callq *%rax
movq %rax, -200(%rbp)
movq -320(%rbp), %rdx
Source line: 88
movq %rdx, -192(%rbp)
movabsq $"gemv!", %rbx
movl $78, %esi
movq %rax, %rdi
movq %r14, %rcx
callq *%rbx
movq %rax, -184(%rbp)
movabsq $cis, %rcx
movq %rax, %rdi
callq *%rcx
movq %rax, %r14
movq %r14, -176(%rbp)
Source line: 5
cmpq $0, %r14
jne L1599
L1577:
addq $-2742760, %r12 # imm = 0xFFFFFFFFFFD62618
movabsq $jl_undefined_var_error, %rax
movq %r12, %rdi
callq *%rax
L1599:
movq %r14, -168(%rbp)
movabsq $mapreduce, %rbx
leaq -72(%rbp), %rdi
movq %r14, %rsi
callq *%rbx
movq %r14, -160(%rbp)
movabsq $conj, %rax
movq %r14, %rdi
callq *%rax
movq %rax, -152(%rbp)
leaq -88(%rbp), %rdi
Source line: 246
movq %rax, %rsi
callq *%rbx
Source line: 124
movsd -72(%rbp), %xmm0 # xmm0 = mem[0],zero
movsd -88(%rbp), %xmm1 # xmm1 = mem[0],zero
movsd -64(%rbp), %xmm2 # xmm2 = mem[0],zero
movsd -80(%rbp), %xmm3 # xmm3 = mem[0],zero
movapd %xmm2, %xmm4
mulsd %xmm3, %xmm4
mulsd %xmm0, %xmm3
mulsd %xmm1, %xmm0
subsd %xmm4, %xmm0
mulsd %xmm1, %xmm2
addsd %xmm3, %xmm2
movsd %xmm0, -56(%rbp)
movsd %xmm2, -48(%rbp)
movabsq $jl_new_array, %rdi
addq $23823232, %rdi # imm = 0x16B8380
movabsq $jl_undefined_var_error, %rax
callq *%rax
nopw %cs:(%rax,%rax)
julia> versioninfo()
Julia Version 0.5.0-dev+3390
Commit a9e7e86* (2016-04-04 12:47 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblasp.so.0
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> @code_native s(r,x)
.text
Filename: none
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $248, %rsp
movq %rsi, -272(%rbp)
movq %rdi, -280(%rbp)
movabsq $140241110008784, %r12 # imm = 0x7F8C6D8B83D0
leaq -128(%rbp), %rbx
leaq -104(%rbp), %r14
movq $0, -232(%rbp)
movq $0, -224(%rbp)
movq $0, -216(%rbp)
movq $0, -208(%rbp)
movq $0, -200(%rbp)
movq $0, -192(%rbp)
movq $0, -184(%rbp)
movq $0, -176(%rbp)
movq $0, -168(%rbp)
movq $0, -160(%rbp)
movq $0, -152(%rbp)
movq $0, -144(%rbp)
movq $0, -136(%rbp)
movq $0, -128(%rbp)
movq $0, -120(%rbp)
movq $0, -112(%rbp)
movq $0, -96(%rbp)
movq $36, -248(%rbp)
movabsq $jl_tls_states, %rcx
movq (%rcx), %rax
movq %rax, -240(%rbp)
leaq -248(%rbp), %rax
movq %rax, (%rcx)
movq 24(%rsi), %r15
Source line: 303
movq %r15, -264(%rbp)
movq %r12, -104(%rbp)
movabsq $jl_box_int64, %rax
movq %r15, %rdi
callq *%rax
movq %rax, -96(%rbp)
leaq 32248(%r12), %rdi
movabsq $140249727909376, %rax # imm = 0x7F8E6F363E00
movl $2, %edx
movq %r14, %rsi
callq *%rax
movq %rax, -232(%rbp)
leaq 31823512(%r12), %rcx
movq %rcx, -128(%rbp)
movq %r12, -120(%rbp)
movq %rax, -112(%rbp)
movabsq $jl_apply_generic, %rax
movl $3, %esi
movq %rbx, %rdi
callq *%rax
movabsq $jl_alloc_array_1d, %rcx
movq %rax, -224(%rbp)
movq (%rax), %rsi
leaq 535328(%r12), %rdi
callq *%rcx
movq %rax, -216(%rbp)
movabsq $"fill!", %rcx
xorpd %xmm0, %xmm0
movq %rax, %rdi
callq *%rcx
movq %rax, %rbx
movq %rbx, -208(%rbp)
Source line: 83
cmpq $1, %r15
jl L916
xorl %r15d, %r15d
Source line: 5
movabsq $mapreduce, %r13
nopw %cs:(%rax,%rax)
L480:
xorl %ecx, %ecx
Source line: 131
addq $1, %r15
cmpq $1, %r15
jl L502
cmpq -264(%rbp), %r15
setle %cl
L502:
xorl %eax, %eax
andb $1, %cl
movb %cl, -250(%rbp)
movb -250(%rbp), %cl
andb $1, %cl
je L526
movb $1, %al
Source line: 132
L526:
andb $1, %al
movb %al, -249(%rbp)
movb -249(%rbp), %al
andb $1, %al
jne L612
movabsq $jl_gc_alloc_2w, %rax
callq *%rax
movq %rax, -200(%rbp)
leaq 6511328(%r12), %rcx
movq %rcx, -8(%rax)
movq %r15, (%rax)
leaq 34784(%r12), %rcx
movq %rcx, 8(%rax)
movq -272(%rbp), %rdi
movq %rax, %rsi
movabsq $throw_boundserror, %rax
callq *%rax
Source line: 215
L612:
leaq 4163848(%r12), %rax
movq %rax, -128(%rbp)
movq -272(%rbp), %rax
movq %rax, -120(%rbp)
movq %r15, %rdi
Source line: 303
movabsq $jl_box_int64, %rax
Source line: 215
callq *%rax
movq %rax, -112(%rbp)
leaq 34784(%r12), %rax
movq %rax, -104(%rbp)
leaq 24790232(%r12), %rdi
movl $4, %edx
leaq -128(%rbp), %rsi
movabsq $_unsafe_getindex, %rax
callq *%rax
movq %rax, -192(%rbp)
movq -280(%rbp), %rdi
movq %rax, %rsi
movabsq $"*", %rax
callq *%rax
movq %rax, -184(%rbp)
movq %rax, %rdi
movabsq $cis, %rax
callq *%rax
movq %rax, %r14
movq %r14, -176(%rbp)
Source line: 5
movq %r14, -168(%rbp)
leaq -72(%rbp), %rdi
movq %r14, %rsi
callq *%r13
movq %r14, -160(%rbp)
movq %r14, %rdi
movabsq $conj, %rax
callq *%rax
movq %rax, -152(%rbp)
leaq -88(%rbp), %rdi
Source line: 246
movq %rax, %rsi
callq *%r13
Source line: 124
movsd -72(%rbp), %xmm0 # xmm0 = mem[0],zero
movsd -88(%rbp), %xmm1 # xmm1 = mem[0],zero
movapd %xmm0, %xmm2
mulsd %xmm1, %xmm2
movsd -64(%rbp), %xmm3 # xmm3 = mem[0],zero
movsd -80(%rbp), %xmm4 # xmm4 = mem[0],zero
movapd %xmm3, %xmm5
mulsd %xmm4, %xmm5
subsd %xmm5, %xmm2
mulsd %xmm4, %xmm0
mulsd %xmm1, %xmm3
addsd %xmm0, %xmm3
movsd %xmm2, -56(%rbp)
movsd %xmm3, -48(%rbp)
movq %rbx, -144(%rbp)
movq %rbx, %rdi
leaq -56(%rbp), %rsi
movq %r15, %rdx
movabsq $"setindex!", %rax
callq *%rax
Source line: 83
cmpq %r15, -264(%rbp)
jne L480
Source line: 7
L916:
movq %rbx, -136(%rbp)
movq -240(%rbp), %rax
movabsq $jl_tls_states, %rcx
movq %rax, (%rcx)
movq %rbx, %rax
addq $248, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
nopw %cs:(%rax,%rax)
julia> versioninfo()
Julia Version 0.5.0-dev+3313
Commit 5e01b1a (2016-03-29 15:14 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> @code_native s(r,x)
.text
Filename: none
Source line: 0
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $248, %rsp
movq %rsi, -272(%rbp)
movq %rdi, -280(%rbp)
movabsq $140465848796112, %r12 # imm = 0x7FC0C10543D0
leaq -128(%rbp), %rbx
leaq -104(%rbp), %r14
movq $0, -232(%rbp)
movq $0, -224(%rbp)
movq $0, -216(%rbp)
movq $0, -208(%rbp)
movq $0, -200(%rbp)
movq $0, -192(%rbp)
movq $0, -184(%rbp)
movq $0, -176(%rbp)
movq $0, -168(%rbp)
movq $0, -160(%rbp)
movq $0, -152(%rbp)
movq $0, -144(%rbp)
movq $0, -136(%rbp)
movq $0, -128(%rbp)
movq $0, -120(%rbp)
movq $0, -112(%rbp)
movq $0, -96(%rbp)
movq $36, -248(%rbp)
movabsq $jl_tls_states, %rcx
movq (%rcx), %rax
movq %rax, -240(%rbp)
leaq -248(%rbp), %rax
movq %rax, (%rcx)
movq 24(%rsi), %r15
Source line: 303
movq %r15, -264(%rbp)
movq %r12, -104(%rbp)
movabsq $jl_box_int64, %rax
movq %r15, %rdi
callq *%rax
movq %rax, -96(%rbp)
leaq 32016(%r12), %rdi
movabsq $140474466699184, %rax # imm = 0x7FC2C2B007B0
movl $2, %edx
movq %r14, %rsi
callq *%rax
movq %rax, -232(%rbp)
leaq 33698936(%r12), %rcx
movq %rcx, -128(%rbp)
movq %r12, -120(%rbp)
movq %rax, -112(%rbp)
movabsq $jl_apply_generic, %rax
movl $3, %esi
movq %rbx, %rdi
callq *%rax
movabsq $jl_alloc_array_1d, %rcx
movq %rax, -224(%rbp)
movq (%rax), %rsi
leaq 885120(%r12), %rdi
callq *%rcx
movq %rax, -216(%rbp)
movabsq $"fill!", %rcx
xorpd %xmm0, %xmm0
movq %rax, %rdi
callq *%rcx
movq %rax, %rbx
movq %rbx, -208(%rbp)
Source line: 83
cmpq $1, %r15
jl L916
xorl %r15d, %r15d
Source line: 5
movabsq $mapreduce, %r13
nopw %cs:(%rax,%rax)
L480:
xorl %ecx, %ecx
Source line: 131
addq $1, %r15
cmpq $1, %r15
jl L502
cmpq -264(%rbp), %r15
setle %cl
L502:
xorl %eax, %eax
andb $1, %cl
movb %cl, -250(%rbp)
movb -250(%rbp), %cl
andb $1, %cl
je L526
movb $1, %al
Source line: 132
L526:
andb $1, %al
movb %al, -249(%rbp)
movb -249(%rbp), %al
andb $1, %al
jne L612
movabsq $jl_gc_alloc_2w, %rax
callq *%rax
movq %rax, -200(%rbp)
leaq 3568704(%r12), %rcx
movq %rcx, -8(%rax)
movq %r15, (%rax)
leaq 41808(%r12), %rcx
movq %rcx, 8(%rax)
movq -272(%rbp), %rdi
movq %rax, %rsi
movabsq $throw_boundserror, %rax
callq *%rax
Source line: 215
L612:
leaq 1480216(%r12), %rax
movq %rax, -128(%rbp)
movq -272(%rbp), %rax
movq %rax, -120(%rbp)
movq %r15, %rdi
Source line: 303
movabsq $jl_box_int64, %rax
Source line: 215
callq *%rax
movq %rax, -112(%rbp)
leaq 41808(%r12), %rax
movq %rax, -104(%rbp)
leaq 22558176(%r12), %rdi
movl $4, %edx
leaq -128(%rbp), %rsi
movabsq $_unsafe_getindex, %rax
callq *%rax
movq %rax, -192(%rbp)
movq -280(%rbp), %rdi
movq %rax, %rsi
movabsq $"*", %rax
callq *%rax
movq %rax, -184(%rbp)
movq %rax, %rdi
movabsq $cis, %rax
callq *%rax
movq %rax, %r14
movq %r14, -176(%rbp)
Source line: 5
movq %r14, -168(%rbp)
leaq -72(%rbp), %rdi
movq %r14, %rsi
callq *%r13
movq %r14, -160(%rbp)
movq %r14, %rdi
movabsq $conj, %rax
callq *%rax
movq %rax, -152(%rbp)
leaq -88(%rbp), %rdi
Source line: 246
movq %rax, %rsi
callq *%r13
Source line: 124
movsd -72(%rbp), %xmm0 # xmm0 = mem[0],zero
movsd -88(%rbp), %xmm1 # xmm1 = mem[0],zero
movapd %xmm0, %xmm2
mulsd %xmm1, %xmm2
movsd -64(%rbp), %xmm3 # xmm3 = mem[0],zero
movsd -80(%rbp), %xmm4 # xmm4 = mem[0],zero
movapd %xmm3, %xmm5
mulsd %xmm4, %xmm5
subsd %xmm5, %xmm2
mulsd %xmm4, %xmm0
mulsd %xmm1, %xmm3
addsd %xmm0, %xmm3
movsd %xmm2, -56(%rbp)
movsd %xmm3, -48(%rbp)
movq %rbx, -144(%rbp)
movq %rbx, %rdi
leaq -56(%rbp), %rsi
movq %r15, %rdx
movabsq $"setindex!", %rax
callq *%rax
Source line: 83
cmpq %r15, -264(%rbp)
jne L480
Source line: 7
L916:
movq %rbx, -136(%rbp)
movq -240(%rbp), %rax
movabsq $jl_tls_states, %rcx
movq %rax, (%rcx)
movq %rbx, %rax
addq $248, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
nopw %cs:(%rax,%rax)