hey Milan,
so consider following code:

Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git")
using TimeIt

v = rand(3)
r = rand(6000,3)
x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))'

dotprods = r * x[2,:]
imexp    = cis(dotprods)
sumprod  = sum(imexp) * sum(conj(imexp))

f(r, x) = r * x[2,:]    
g(r, x) = r * x'
h(imexp)    = sum(imexp) * sum(conj(imexp))

function s(r, x)
        result = zeros(size(x,1))
        for i = 1:size(x,1)
                imexp    = cis(r * x[i,:])
                result[i]= sum(imexp) * sum(conj(imexp))
        end
        return result
end

@timeit zeros(size(x,1))
@timeit f(r,x)
@timeit g(r,x)
@timeit cis(dotprods)
@timeit h(imexp)
@timeit s(r,x)

@code_native f(r,x)
@code_native g(r,x)
@code_native cis(dotprods)
@code_native h(imexp)
@code_native s(r,x)

and I attached the output of the last @code_native s(r,x) as text files for 
the binary tarball, as well as the latest nalimilan update. For the whole 
function s, the exported code looks actually the same everywhere.
But s(r,x) is the one that is considerable slower on the i7 than the i5, 
whereas all the other timed calls are more or less same speed on i5 and i7. 
Here are the timings in the same order as above (all run repeatedly to not 
have compile time in it for last one):

i7:
1000000 loops, best of 3: 871.68 ns per loop
10000 loops, best of 3: 10.84 µs per loop
100 loops, best of 3: 5.19 ms per loop
10000 loops, best of 3: 71.35 µs per loop
10000 loops, best of 3: 26.65 µs per loop
1 loops, best of 3: 159.99 ms per loop

i5:
100000 loops, best of 3: 1.01 µs per loop
10000 loops, best of 3: 10.93 µs per loop
100 loops, best of 3: 5.09 ms per loop
10000 loops, best of 3: 75.93 µs per loop
10000 loops, best of 3: 29.23 µs per loop
1 loops, best of 3: 103.70 ms per loop

So based on inside s(r,x) calls, the i7 should be faster, but the whole 
s(r,x) is slower. Still clueless... And don't know how to further pin this 
down...

cheers, Johannes




On Monday, April 4, 2016 at 10:48:40 PM UTC+2, Milan Bouchet-Valat wrote:
>
> Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a écrit : 
> > hey guys, 
> > so attached you find text files with @code_native output for the 
> > instructions  
> > - r * x[1,:] 
> > - cis(imexp) 
> > - sum(imexp) * sum(conj(imexp)) 
> > 
> > for julia 0.5.  
> > 
> > Hardware I run on is a Haswell i5 machine, a Haswell i7 machine, and 
> > a IvyBridge i5 machine. Turned out on an Haswell i5 machine the code 
> > also runs fast. Only the Haswell i7 machine is the slow one. This 
> > really drove me nuts. First I thought it was the OS, then the 
> > architecture, and now its just from i5 to i7.... Anyways, I don't 
> > know anything about x86 assembly, but the julia 0.45 code is the same 
> > on all machines. However, for the dot product, the 0.5 code has 
> > already 2 different instructions on the i5 vs. the i7 (line 44&47). 
> > For the cis call also (line 149...). And the IvyBridge i5 code is 
> > similar to the Haswell i5. I included also versioninfo() at the top 
> > of the file. So you could just look at a vimdiff of the julia0.5 
> > files... Can anyone make sense out of this? 
> I'm definitely not an expert in assembly, but that additional leaq 
> instruction on line 44, and the additional movq instructions on line 
> 111, 151 and 152 really look weird 
>
> Could you do the same test with the binary tarballs? If the difference 
> persists, you should open an issue on GitHub to track this. 
>
> BTW, please wrap the fist call in a function to ensure it is 
> specialized for the arguments types, i.e.: 
>
> f(r, x) = r * x[1,:] 
> @code_native f(r, x) 
>
> Also, please check whether you still see the difference with this code: 
> g(r, x) = r * x 
> @code_native g(r, x[1,:]) 
>
> What are the types of r and x? Could you provide a simple reproducible 
> example with dummy values? 
>
> > The binary tarballs I will still test. If I remove the cis() call, 
> > the difference is hard to tell, the loop is ~10times faster and more 
> > or less all around 5ms. For the whole loop with cis() call, from i5 
> > to i7 the difference is ~ 50ms on i5 to 90ms on i7. 
> > 
> > Shall I also post the julia 0.4 code? 
> If it's identical for all machines, I don't think it's needed. 
>
>
> Regards 
>
>
> > cheers, Johannes 
> > 
> > 
> > 
> > > Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a écrit :  
> > > >  
> > > >  
> > > > > Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner a 
> écrit :   
> > > > > > Sorry for not having expressed myself clearly, I meant the 
> latest   
> > > > > > version of fedora to work fine (24 development). I always used 
> the   
> > > > > > latest julia nightly available on the copr nalimilan repo. Right 
> now   
> > > > > > that is: 0.5.0-dev+3292, Commit 9d527c5*, all use   
> > > > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)   
> > > > > >   
> > > > > > peakflops on all machines (hardware identical) is ~1.2..1.5e11. 
>     
> > > > > >   
> > > > > > Fedora 22&23 with julia 0.5 is ~50% slower then 0.4, only on 
> fedora   
> > > > > > 24 julia 0.5 is  faster compared to julia 0.4.   
> > > > > Could you try to find a simple code to reproduce the problem? In   
> > > > > particular, it would be useful to check whether this comes from   
> > > > > OpenBLAS differences or whether it also happens with pure Julia 
> code   
> > > > > (typical operations which depend on BLAS are matrix 
> multiplication, as   
> > > > > well as most of linear algebra). Normally, 0.4 and 0.5 should use 
> the   
> > > > > same BLAS, but who knows...   
> > > > well thats what I did, and the 3 simple calls inside the loop are  
> > > > more or less same speed. only the whole loop seems slower. See my  
> > > > code sample fromanswer march 8th (code gets in same proportions  
> > > > faster when exp(im .* dotprods) is replaced by cis(dotprods) ).   
> > > > So I don't know what I can do then...    
> > > Sorry, somehow I had missed that message. This indeed looks like a 
> code  
> > > generation issue in Julia/LLVM.  
> > > 
> > > > > Can you also confirm that all versioninfo() fields are the same 
> for all   
> > > > > three machines, both for 0.4 and 0.5? We must envision the 
> possibility   
> > > > > that the differences actually come from 0.4.   
> > > > ohoh, right! just noticed that my fedora 24 machine was an ivy 
> bridge  
> > > > which works fast:  
> > > >  
> > > > Julia Version 0.5.0-dev+3292  
> > > > Commit 9d527c5* (2016-03-28 06:55 UTC)  
> > > > Platform Info:  
> > > >   System: Linux (x86_64-redhat-linux)  
> > > >   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz  
> > > >   WORD_SIZE: 64  
> > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)  
> > > >   LAPACK: libopenblasp.so.0  
> > > >   LIBM: libopenlibm  
> > > >   LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)  
> > > >  
> > > > and the other ones with fed22/23 are haswell, which work slow:  
> > > >  
> > > > Julia Version 0.5.0-dev+3292  
> > > > Commit 9d527c5* (2016-03-28 06:55 UTC)  
> > > > Platform Info:  
> > > >   System: Linux (x86_64-redhat-linux)  
> > > >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz  
> > > >   WORD_SIZE: 64  
> > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)  
> > > >   LAPACK: libopenblasp.so.0  
> > > >   LIBM: libopenlibm  
> > > >   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)  
> > > >  
> > > > I just booted an fedora 23 on the ivy bridge machine and it's also 
> fast.   
> > > >    
> > > > Now if I use julia 0.45 on both architectures:  
> > > >  
> > > > Julia Version 0.4.5  
> > > > Commit 2ac304d* (2016-03-18 00:58 UTC)  
> > > > Platform Info:  
> > > >   System: Linux (x86_64-redhat-linux)  
> > > >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz  
> > > >   WORD_SIZE: 64  
> > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)  
> > > >   LAPACK: libopenblasp.so.0  
> > > >   LIBM: libopenlibm  
> > > >   LLVM: libLLVM-3.3  
> > > >  
> > > > and:  
> > > >  
> > > > Julia Version 0.4.5  
> > > > Commit 2ac304d* (2016-03-18 00:58 UTC)  
> > > > Platform Info:  
> > > >   System: Linux (x86_64-redhat-linux)  
> > > >   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz  
> > > >   WORD_SIZE: 64  
> > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)  
> > > >   LAPACK: libopenblasp.so.0  
> > > >   LIBM: libopenlibm  
> > > >   LLVM: libLLVM-3.3  
> > > >  
> > > > there is no speed difference apart from the ~10% or so from the  
> > > > faster haswell machine. So could perhaps be haswell hardware target  
> > > > specific with the change from llvm 3.3 to 3.7.1? Is there anything  
> > > > else I could provide?  
> > > This is certainly an interesting finding. Could you paste somewhere 
> the  
> > > output of @code_native for your function on Sandybridge vs. Haswell,  
> > > for both 0.4 and 0.5?  
> > > 
> > > It would also be useful to check whether the same difference appears 
> if  
> > > you use the generic binary tarballs from 
> http://julialang.org/downloads  
> > > .  
> > > 
> > > Finally, do you get the same result if you remove the call to exp()  
> > > from the loop? (This is the only external function, so it shouldn't 
> be  
> > > affected by changes in Julia.)  
> > > 
> > > 
> > > Regards  
> > > 
> > > 
> > > > Best, Johannes  
> > > >  
> > > > >  Regards   
> > > >  
> > > >  
> > > > > > Le mercredi 16 mars 2016 à 09:25 -0700, Johannes Wagner a 
> écrit :    
> > > > > > > just a little update. Tested some other fedoras: Fedora 22 
> with llvm    
> > > > > > > 3.8 is also slow with julia 0.5, whereas a fedora 24 branch 
> with llvm    
> > > > > > > 3.7 is faster on julia 0.5 compared to julia 0.4, as it should 
> be    
> > > > > > > (speedup from inner loop parts translated into speedup to 
> whole    
> > > > > > > function).    
> > > > > > >    
> > > > > > > don't know if anyone cares about that... At least the latest 
> version    
> > > > > > > seems to work fine, hope it stays like this into the final 
> fedora 24    
> > > > > > What's the "latest version"? git built from source or RPM 
> nightlies?    
> > > > > > With which LLVM version for each?    
> > > > > >   
> > > > > > If from the RPMs, I've switched them to LLVM 3.8 for a few days, 
> and    
> > > > > > went back to 3.7 because of a build failure. So that might 
> explain the    
> > > > > > difference. You can install the last version which built with 
> LLVM 3.8    
> > > > > > manually from here:    
> > > > > > 
> https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-23-x86_64/00167549-julia/
>     
>
> > > > > >   
> > > > > > It would be interesting to compare it with the latest nightly 
> with 3.7.    
> > > > > >   
> > > > > >   
> > > > > > Regards    
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > > > hey guys,    
> > > > > > > > I just experienced something weird. I have some code that 
> runs fine    
> > > > > > > > on 0.43, then I updated to 0.5dev to test the new Arrays, 
> run same    
> > > > > > > > code and noticed it got about ~50% slower. Then I downgraded 
> back    
> > > > > > > > to 0.43, ran the old code, but speed remained slow. I 
> noticed while    
> > > > > > > > reinstalling 0.43, openblas-threads didn't get isntalled 
> along with    
> > > > > > > > it. So I manually installed it, but no change.     
> > > > > > > > Does anyone has an idea what could be going on? LLVM on 
> fedora23 is    
> > > > > > > > 3.7    
> > > > > > > >    
> > > > > > > > Cheers, Johannes    
> > > > > > > >    
>
julia> versioninfo()
Julia Version 0.5.0-dev+3404
Commit db2c6ab* (2016-04-05 07:45 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)



julia> @code_native s(r,x)
        .text
Filename: none
Source line: 0
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        subq    $248, %rsp
        movq    %rsi, -272(%rbp)
        movq    %rdi, -280(%rbp)
        movabsq $139623632765904, %r12  # imm = 0x7EFCA90883D0
        leaq    -128(%rbp), %rbx
        leaq    -104(%rbp), %r14
        movq    $0, -232(%rbp)
        movq    $0, -224(%rbp)
        movq    $0, -216(%rbp)
        movq    $0, -208(%rbp)
        movq    $0, -200(%rbp)
        movq    $0, -192(%rbp)
        movq    $0, -184(%rbp)
        movq    $0, -176(%rbp)
        movq    $0, -168(%rbp)
        movq    $0, -160(%rbp)
        movq    $0, -152(%rbp)
        movq    $0, -144(%rbp)
        movq    $0, -136(%rbp)
        movq    $0, -128(%rbp)
        movq    $0, -120(%rbp)
        movq    $0, -112(%rbp)
        movq    $0, -96(%rbp)
        movq    $36, -248(%rbp)
        movabsq $jl_tls_states, %rcx
        movq    (%rcx), %rax
        movq    %rax, -240(%rbp)
        leaq    -248(%rbp), %rax
        movq    %rax, (%rcx)
        movq    24(%rsi), %r15
Source line: 303
        movq    %r15, -264(%rbp)
        movq    %r12, -104(%rbp)
        movabsq $jl_box_int64, %rax
        movq    %r15, %rdi
        callq   *%rax
        movq    %rax, -96(%rbp)
        leaq    31904(%r12), %rdi
        movabsq $139632250667936, %rax  # imm = 0x7EFEAAB343A0
        movl    $2, %edx
        movq    %r14, %rsi
        callq   *%rax
        movq    %rax, -232(%rbp)
        leaq    29142760(%r12), %rcx
        movq    %rcx, -128(%rbp)
        movq    %r12, -120(%rbp)
        movq    %rax, -112(%rbp)
        movabsq $jl_apply_generic, %rax
        movl    $3, %esi
        movq    %rbx, %rdi
        callq   *%rax
        movabsq $jl_alloc_array_1d, %rcx
        movq    %rax, -224(%rbp)
        movq    (%rax), %rsi
        leaq    1563008(%r12), %rdi
        callq   *%rcx
        movq    %rax, -216(%rbp)
        movabsq $"fill!", %rcx
        xorpd   %xmm0, %xmm0
        movq    %rax, %rdi
        callq   *%rcx
        movq    %rax, %rbx
        movq    %rbx, -208(%rbp)
Source line: 83
        cmpq    $1, %r15
        jl      L916
        xorl    %r15d, %r15d
Source line: 5
        movabsq $mapreduce, %r13
        nopw    %cs:(%rax,%rax)
L480:
        xorl    %ecx, %ecx
Source line: 131
        addq    $1, %r15
        cmpq    $1, %r15
        jl      L502
        cmpq    -264(%rbp), %r15
        setle   %cl
L502:
        xorl    %eax, %eax
        andb    $1, %cl
        movb    %cl, -250(%rbp)
        movb    -250(%rbp), %cl
        andb    $1, %cl
        je      L526
        movb    $1, %al
Source line: 132
L526:
        andb    $1, %al
        movb    %al, -249(%rbp)
        movb    -249(%rbp), %al
        andb    $1, %al
        jne     L612
        movabsq $jl_gc_alloc_2w, %rax
        callq   *%rax
        movq    %rax, -200(%rbp)
        leaq    3601136(%r12), %rcx
        movq    %rcx, -8(%rax)
        movq    %r15, (%rax)
        leaq    1361216(%r12), %rcx
        movq    %rcx, 8(%rax)
        movq    -272(%rbp), %rdi
        movq    %rax, %rsi
        movabsq $throw_boundserror, %rax
        callq   *%rax
Source line: 215
L612:
        leaq    5600400(%r12), %rax
        movq    %rax, -128(%rbp)
        movq    -272(%rbp), %rax
        movq    %rax, -120(%rbp)
        movq    %r15, %rdi
Source line: 303
        movabsq $jl_box_int64, %rax
Source line: 215
        callq   *%rax
        movq    %rax, -112(%rbp)
        leaq    1361216(%r12), %rax
        movq    %rax, -104(%rbp)
        leaq    40224456(%r12), %rdi
        movl    $4, %edx
        leaq    -128(%rbp), %rsi
        movabsq $_unsafe_getindex, %rax
        callq   *%rax
        movq    %rax, -192(%rbp)
        movq    -280(%rbp), %rdi
        movq    %rax, %rsi
        movabsq $"*", %rax
        callq   *%rax
        movq    %rax, -184(%rbp)
        movq    %rax, %rdi
        movabsq $cis, %rax
        callq   *%rax
        movq    %rax, %r14
        movq    %r14, -176(%rbp)
Source line: 5
        movq    %r14, -168(%rbp)
        leaq    -72(%rbp), %rdi
        movq    %r14, %rsi
        callq   *%r13
        movq    %r14, -160(%rbp)
        movq    %r14, %rdi
        movabsq $conj, %rax
        callq   *%rax
        movq    %rax, -152(%rbp)
        leaq    -88(%rbp), %rdi
Source line: 246
        movq    %rax, %rsi
        callq   *%r13
Source line: 124
        movsd   -72(%rbp), %xmm0        # xmm0 = mem[0],zero
        movsd   -88(%rbp), %xmm1        # xmm1 = mem[0],zero
        movapd  %xmm0, %xmm2
        mulsd   %xmm1, %xmm2
        movsd   -64(%rbp), %xmm3        # xmm3 = mem[0],zero
        movsd   -80(%rbp), %xmm4        # xmm4 = mem[0],zero
        movapd  %xmm3, %xmm5
        mulsd   %xmm4, %xmm5
        subsd   %xmm5, %xmm2
        mulsd   %xmm4, %xmm0
        mulsd   %xmm1, %xmm3
        addsd   %xmm0, %xmm3
        movsd   %xmm2, -56(%rbp)
        movsd   %xmm3, -48(%rbp)
        movq    %rbx, -144(%rbp)
        movq    %rbx, %rdi
        leaq    -56(%rbp), %rsi
        movq    %r15, %rdx
        movabsq $"setindex!", %rax
        callq   *%rax
Source line: 83
        cmpq    %r15, -264(%rbp)
        jne     L480
Source line: 7
L916:
        movq    %rbx, -136(%rbp)
        movq    -240(%rbp), %rax
        movabsq $jl_tls_states, %rcx
        movq    %rax, (%rcx)
        movq    %rbx, %rax
        addq    $248, %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)



julia> versioninfo()
Julia Version 0.5.0-dev+3313
Commit 5e01b1a (2016-03-29 15:14 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)



julia> @code_native s(r,x)
        .text
Filename: none
Source line: 0
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        subq    $312, %rsp              # imm = 0x138
        movq    %rsi, -304(%rbp)
        movq    %rdi, -320(%rbp)
        movabsq $jl_new_array, %r13
        movabsq $140660019585240, %r12  # imm = 0x7FEDF68060D8
        leaq    -128(%rbp), %r15
        movq    $0, -272(%rbp)
        movq    $0, -264(%rbp)
        movq    $0, -256(%rbp)
        movq    $0, -248(%rbp)
        movq    $0, -240(%rbp)
        movq    $0, -232(%rbp)
        movq    $0, -224(%rbp)
        movq    $0, -216(%rbp)
        movq    $0, -208(%rbp)
        movq    $0, -200(%rbp)
        movq    $0, -192(%rbp)
        movq    $0, -184(%rbp)
        movq    $0, -176(%rbp)
        movq    $0, -168(%rbp)
        movq    $0, -160(%rbp)
        movq    $0, -152(%rbp)
        movq    $0, -144(%rbp)
        movq    $0, -136(%rbp)
        movq    $0, -128(%rbp)
        movq    $0, -120(%rbp)
        movq    $0, -112(%rbp)
        movq    $0, -96(%rbp)
        movq    $46, -288(%rbp)
        movabsq $jl_tls_states, %rcx
        movq    (%rcx), %rax
        movq    %rax, -280(%rbp)
        leaq    -288(%rbp), %rax
        movq    %rax, (%rcx)
        movq    24(%rsi), %r14
Source line: 307
        movq    %r14, -312(%rbp)
        leaq    -2137352(%r12), %rbx
        movq    %rbx, -104(%rbp)
        movabsq $jl_box_int64, %rax
        movq    %r14, %rdi
        callq   *%rax
        movq    %rax, -96(%rbp)
        leaq    -2101408(%r12), %rdi
        movabsq $convert, %rax
        movl    $2, %edx
        leaq    -104(%rbp), %rsi
        callq   *%rax
        movq    %rax, -272(%rbp)
        leaq    20617392(%r12), %rcx
        movq    %rcx, -128(%rbp)
        movq    %rbx, -120(%rbp)
        movq    %rax, -112(%rbp)
        movabsq $jl_apply_generic, %rax
        movl    $3, %esi
        movq    %r15, %rdi
        callq   *%rax
        movq    %rax, -264(%rbp)
        movq    (%rax), %rsi
        leaq    599832(%r12), %rdi
        movq    %rdi, -328(%rbp)
        leaq    224(%r13), %rax
        callq   *%rax
        movq    %rax, -256(%rbp)
        movabsq $"fill!", %rcx
        xorpd   %xmm0, %xmm0
        movq    %rax, %rdi
        callq   *%rcx
        movq    %rax, %rbx
        movq    %rbx, -336(%rbp)
        movq    %rbx, -248(%rbp)
Source line: 83
        cmpq    $1, %r14
        jl      L1121
        xorl    %r15d, %r15d
        movq    -320(%rbp), %rax
        movq    24(%rax), %rax
Source line: 124
        movq    %rax, -344(%rbp)
        cmpq    $0, %rbx
        je      L1204
Source line: 5
        movabsq $mapreduce, %r13
        nop
L576:
        xorl    %ecx, %ecx
Source line: 131
        addq    $1, %r15
        cmpq    $1, %r15
        jl      L598
        cmpq    -312(%rbp), %r15
        setle   %cl
L598:
        xorl    %eax, %eax
        andb    $1, %cl
        movb    %cl, -290(%rbp)
        movb    -290(%rbp), %cl
        andb    $1, %cl
        je      L636
        movb    $1, %al
        movq    -304(%rbp), %rcx
        movq    %rcx, -240(%rbp)
Source line: 132
L636:
        andb    $1, %al
        movb    %al, -289(%rbp)
        movb    -289(%rbp), %al
        andb    $1, %al
        jne     L724
        movq    -304(%rbp), %rbx
        movq    %rbx, -232(%rbp)
        movabsq $jl_gc_alloc_2w, %rax
        callq   *%rax
        movq    %rax, -224(%rbp)
        leaq    86433224(%r12), %rcx
        movq    %rcx, -8(%rax)
        movq    %r15, (%rax)
        movq    %r12, 8(%rax)
        movq    %rbx, %rdi
        movq    %rax, %rsi
        movabsq $throw_boundserror, %rax
        callq   *%rax
Source line: 215
L724:
        movq    -304(%rbp), %rax
        movq    %rax, -120(%rbp)
        leaq    -8296(%r12), %rax
        movq    %rax, -128(%rbp)
        movq    %r15, %rdi
Source line: 307
        movabsq $jl_box_int64, %rax
Source line: 215
        callq   *%rax
        movq    %rax, -112(%rbp)
        movq    %r12, -104(%rbp)
        leaq    22250520(%r12), %rdi
        movl    $4, %edx
        leaq    -128(%rbp), %rsi
        movabsq $_unsafe_getindex, %rax
        callq   *%rax
        movq    %rax, %rbx
        movq    %rbx, -216(%rbp)
Source line: 196
        movabsq $__pool_alloc, %rax
        callq   *%rax
        leaq    -2111064(%r12), %rcx
        movq    %rcx, -8(%rax)
        movq    -344(%rbp), %rcx
        movq    %rcx, (%rax)
        movq    %rax, -208(%rbp)
        movq    -328(%rbp), %rdi
        movq    %rax, %rsi
        movabsq $jl_new_array, %rax
        callq   *%rax
        movq    %rax, -200(%rbp)
        movq    -320(%rbp), %rdx
Source line: 88
        movq    %rdx, -192(%rbp)
        movl    $78, %esi
        movq    %rax, %rdi
        movq    %rbx, %rcx
        movabsq $"gemv!", %rax
        callq   *%rax
        movq    %rax, -184(%rbp)
        movq    %rax, %rdi
        movabsq $cis, %rax
        callq   *%rax
        movq    %rax, %r14
        movq    %r14, -176(%rbp)
Source line: 5
        cmpq    $0, %r14
        je      L1577
        movq    %r14, -168(%rbp)
        leaq    -72(%rbp), %rdi
        movq    %r14, %rsi
        callq   *%r13
        movq    %r14, -160(%rbp)
        movq    %r14, %rdi
        movabsq $conj, %rax
        callq   *%rax
        movq    %rax, -152(%rbp)
        leaq    -88(%rbp), %rdi
Source line: 246
        movq    %rax, %rsi
        callq   *%r13
Source line: 124
        movsd   -72(%rbp), %xmm0        # xmm0 = mem[0],zero
        movsd   -88(%rbp), %xmm1        # xmm1 = mem[0],zero
        movsd   -64(%rbp), %xmm2        # xmm2 = mem[0],zero
        movsd   -80(%rbp), %xmm3        # xmm3 = mem[0],zero
        movapd  %xmm2, %xmm4
        mulsd   %xmm3, %xmm4
        mulsd   %xmm0, %xmm3
        mulsd   %xmm1, %xmm0
        subsd   %xmm4, %xmm0
        mulsd   %xmm1, %xmm2
        addsd   %xmm3, %xmm2
        movsd   %xmm0, -56(%rbp)
        movsd   %xmm2, -48(%rbp)
        movq    -336(%rbp), %rbx
        movq    %rbx, -144(%rbp)
        movq    %rbx, %rdi
        leaq    -56(%rbp), %rsi
        movq    %r15, %rdx
        movabsq $"setindex!", %rax
        callq   *%rax
Source line: 83
        cmpq    %r15, -312(%rbp)
        jne     L576
Source line: 7
L1121:
        cmpq    $0, %rbx
        je      L1175
        movq    %rbx, -136(%rbp)
        movq    -280(%rbp), %rax
        movabsq $jl_tls_states, %rcx
        movq    %rax, (%rcx)
        movq    %rbx, %rax
        addq    $312, %rsp              # imm = 0x138
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
L1175:
        movabsq $jl_new_array, %rdi
        addq    $23823232, %rdi         # imm = 0x16B8380
        movabsq $jl_undefined_var_error, %rax
        callq   *%rax
L1204:
        xorl    %eax, %eax
Source line: 131
        cmpq    $0, -312(%rbp)
        setg    %cl
        andb    $1, %cl
        movb    %cl, -290(%rbp)
        movb    -290(%rbp), %cl
        andb    $1, %cl
        je      L1253
        movb    $1, %al
        movq    -304(%rbp), %rcx
        movq    %rcx, -240(%rbp)
Source line: 132
L1253:
        andb    $1, %al
        movb    %al, -289(%rbp)
        movb    -289(%rbp), %al
        andb    $1, %al
        jne     L1345
        movq    -304(%rbp), %rbx
        movq    %rbx, -232(%rbp)
        movabsq $jl_gc_alloc_2w, %rax
        callq   *%rax
        movq    %rax, -224(%rbp)
        leaq    86433224(%r12), %rcx
        movq    %rcx, -8(%rax)
        movq    $1, (%rax)
        movq    %r12, 8(%rax)
        movabsq $throw_boundserror, %rcx
        movq    %rbx, %rdi
        movq    %rax, %rsi
        callq   *%rcx
Source line: 215
L1345:
        movq    -304(%rbp), %rax
        movq    %rax, -120(%rbp)
        leaq    -8296(%r12), %rax
        movq    %rax, -128(%rbp)
        movl    $1, %edi
Source line: 307
        movabsq $jl_box_int64, %rax
Source line: 215
        callq   *%rax
        movq    %rax, -112(%rbp)
        movq    %r12, -104(%rbp)
        leaq    22250520(%r12), %rdi
        movabsq $_unsafe_getindex, %rax
        movl    $4, %edx
        leaq    -128(%rbp), %rsi
        callq   *%rax
        movq    %rax, %r14
        movq    %r14, -216(%rbp)
Source line: 196
        movabsq $__pool_alloc, %rax
        callq   *%rax
        leaq    -2111064(%r12), %rcx
        movq    %rcx, -8(%rax)
        movq    -344(%rbp), %rcx
        movq    %rcx, (%rax)
        movq    %rax, -208(%rbp)
        movq    -328(%rbp), %rdi
        movq    %rax, %rsi
        movabsq $jl_new_array, %rax
        callq   *%rax
        movq    %rax, -200(%rbp)
        movq    -320(%rbp), %rdx
Source line: 88
        movq    %rdx, -192(%rbp)
        movabsq $"gemv!", %rbx
        movl    $78, %esi
        movq    %rax, %rdi
        movq    %r14, %rcx
        callq   *%rbx
        movq    %rax, -184(%rbp)
        movabsq $cis, %rcx
        movq    %rax, %rdi
        callq   *%rcx
        movq    %rax, %r14
        movq    %r14, -176(%rbp)
Source line: 5
        cmpq    $0, %r14
        jne     L1599
L1577:
        addq    $-2742760, %r12         # imm = 0xFFFFFFFFFFD62618
        movabsq $jl_undefined_var_error, %rax
        movq    %r12, %rdi
        callq   *%rax
L1599:
        movq    %r14, -168(%rbp)
        movabsq $mapreduce, %rbx
        leaq    -72(%rbp), %rdi
        movq    %r14, %rsi
        callq   *%rbx
        movq    %r14, -160(%rbp)
        movabsq $conj, %rax
        movq    %r14, %rdi
        callq   *%rax
        movq    %rax, -152(%rbp)
        leaq    -88(%rbp), %rdi
Source line: 246
        movq    %rax, %rsi
        callq   *%rbx
Source line: 124
        movsd   -72(%rbp), %xmm0        # xmm0 = mem[0],zero
        movsd   -88(%rbp), %xmm1        # xmm1 = mem[0],zero
        movsd   -64(%rbp), %xmm2        # xmm2 = mem[0],zero
        movsd   -80(%rbp), %xmm3        # xmm3 = mem[0],zero
        movapd  %xmm2, %xmm4
        mulsd   %xmm3, %xmm4
        mulsd   %xmm0, %xmm3
        mulsd   %xmm1, %xmm0
        subsd   %xmm4, %xmm0
        mulsd   %xmm1, %xmm2
        addsd   %xmm3, %xmm2
        movsd   %xmm0, -56(%rbp)
        movsd   %xmm2, -48(%rbp)
        movabsq $jl_new_array, %rdi
        addq    $23823232, %rdi         # imm = 0x16B8380
        movabsq $jl_undefined_var_error, %rax
        callq   *%rax
        nopw    %cs:(%rax,%rax)

julia> versioninfo()
Julia Version 0.5.0-dev+3390
Commit a9e7e86* (2016-04-04 12:47 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)



julia> @code_native s(r,x)
        .text
Filename: none
Source line: 0
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        subq    $248, %rsp
        movq    %rsi, -272(%rbp)
        movq    %rdi, -280(%rbp)
        movabsq $140241110008784, %r12  # imm = 0x7F8C6D8B83D0
        leaq    -128(%rbp), %rbx
        leaq    -104(%rbp), %r14
        movq    $0, -232(%rbp)
        movq    $0, -224(%rbp)
        movq    $0, -216(%rbp)
        movq    $0, -208(%rbp)
        movq    $0, -200(%rbp)
        movq    $0, -192(%rbp)
        movq    $0, -184(%rbp)
        movq    $0, -176(%rbp)
        movq    $0, -168(%rbp)
        movq    $0, -160(%rbp)
        movq    $0, -152(%rbp)
        movq    $0, -144(%rbp)
        movq    $0, -136(%rbp)
        movq    $0, -128(%rbp)
        movq    $0, -120(%rbp)
        movq    $0, -112(%rbp)
        movq    $0, -96(%rbp)
        movq    $36, -248(%rbp)
        movabsq $jl_tls_states, %rcx
        movq    (%rcx), %rax
        movq    %rax, -240(%rbp)
        leaq    -248(%rbp), %rax
        movq    %rax, (%rcx)
        movq    24(%rsi), %r15
Source line: 303
        movq    %r15, -264(%rbp)
        movq    %r12, -104(%rbp)
        movabsq $jl_box_int64, %rax
        movq    %r15, %rdi
        callq   *%rax
        movq    %rax, -96(%rbp)
        leaq    32248(%r12), %rdi
        movabsq $140249727909376, %rax  # imm = 0x7F8E6F363E00
        movl    $2, %edx
        movq    %r14, %rsi
        callq   *%rax
        movq    %rax, -232(%rbp)
        leaq    31823512(%r12), %rcx
        movq    %rcx, -128(%rbp)
        movq    %r12, -120(%rbp)
        movq    %rax, -112(%rbp)
        movabsq $jl_apply_generic, %rax
        movl    $3, %esi
        movq    %rbx, %rdi
        callq   *%rax
        movabsq $jl_alloc_array_1d, %rcx
        movq    %rax, -224(%rbp)
        movq    (%rax), %rsi
        leaq    535328(%r12), %rdi
        callq   *%rcx
        movq    %rax, -216(%rbp)
        movabsq $"fill!", %rcx
        xorpd   %xmm0, %xmm0
        movq    %rax, %rdi
        callq   *%rcx
        movq    %rax, %rbx
        movq    %rbx, -208(%rbp)
Source line: 83
        cmpq    $1, %r15
        jl      L916
        xorl    %r15d, %r15d
Source line: 5
        movabsq $mapreduce, %r13
        nopw    %cs:(%rax,%rax)
L480:
        xorl    %ecx, %ecx
Source line: 131
        addq    $1, %r15
        cmpq    $1, %r15
        jl      L502
        cmpq    -264(%rbp), %r15
        setle   %cl
L502:
        xorl    %eax, %eax
        andb    $1, %cl
        movb    %cl, -250(%rbp)
        movb    -250(%rbp), %cl
        andb    $1, %cl
        je      L526
        movb    $1, %al
Source line: 132
L526:
        andb    $1, %al
        movb    %al, -249(%rbp)
        movb    -249(%rbp), %al
        andb    $1, %al
        jne     L612
        movabsq $jl_gc_alloc_2w, %rax
        callq   *%rax
        movq    %rax, -200(%rbp)
        leaq    6511328(%r12), %rcx
        movq    %rcx, -8(%rax)
        movq    %r15, (%rax)
        leaq    34784(%r12), %rcx
        movq    %rcx, 8(%rax)
        movq    -272(%rbp), %rdi
        movq    %rax, %rsi
        movabsq $throw_boundserror, %rax
        callq   *%rax
Source line: 215
L612:
        leaq    4163848(%r12), %rax
        movq    %rax, -128(%rbp)
        movq    -272(%rbp), %rax
        movq    %rax, -120(%rbp)
        movq    %r15, %rdi
Source line: 303
        movabsq $jl_box_int64, %rax
Source line: 215
        callq   *%rax
        movq    %rax, -112(%rbp)
        leaq    34784(%r12), %rax
        movq    %rax, -104(%rbp)
        leaq    24790232(%r12), %rdi
        movl    $4, %edx
        leaq    -128(%rbp), %rsi
        movabsq $_unsafe_getindex, %rax
        callq   *%rax
        movq    %rax, -192(%rbp)
        movq    -280(%rbp), %rdi
        movq    %rax, %rsi
        movabsq $"*", %rax
        callq   *%rax
        movq    %rax, -184(%rbp)
        movq    %rax, %rdi
        movabsq $cis, %rax
        callq   *%rax
        movq    %rax, %r14
        movq    %r14, -176(%rbp)
Source line: 5
        movq    %r14, -168(%rbp)
        leaq    -72(%rbp), %rdi
        movq    %r14, %rsi
        callq   *%r13
        movq    %r14, -160(%rbp)
        movq    %r14, %rdi
        movabsq $conj, %rax
        callq   *%rax
        movq    %rax, -152(%rbp)
        leaq    -88(%rbp), %rdi
Source line: 246
        movq    %rax, %rsi
        callq   *%r13
Source line: 124
        movsd   -72(%rbp), %xmm0        # xmm0 = mem[0],zero
        movsd   -88(%rbp), %xmm1        # xmm1 = mem[0],zero
        movapd  %xmm0, %xmm2
        mulsd   %xmm1, %xmm2
        movsd   -64(%rbp), %xmm3        # xmm3 = mem[0],zero
        movsd   -80(%rbp), %xmm4        # xmm4 = mem[0],zero
        movapd  %xmm3, %xmm5
        mulsd   %xmm4, %xmm5
        subsd   %xmm5, %xmm2
        mulsd   %xmm4, %xmm0
        mulsd   %xmm1, %xmm3
        addsd   %xmm0, %xmm3
        movsd   %xmm2, -56(%rbp)
        movsd   %xmm3, -48(%rbp)
        movq    %rbx, -144(%rbp)
        movq    %rbx, %rdi
        leaq    -56(%rbp), %rsi
        movq    %r15, %rdx
        movabsq $"setindex!", %rax
        callq   *%rax
Source line: 83
        cmpq    %r15, -264(%rbp)
        jne     L480
Source line: 7
L916:
        movq    %rbx, -136(%rbp)
        movq    -240(%rbp), %rax
        movabsq $jl_tls_states, %rcx
        movq    %rax, (%rcx)
        movq    %rbx, %rax
        addq    $248, %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

julia> versioninfo()
Julia Version 0.5.0-dev+3313
Commit 5e01b1a (2016-03-29 15:14 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)



julia> @code_native s(r,x)
        .text
Filename: none
Source line: 0
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        subq    $248, %rsp
        movq    %rsi, -272(%rbp)
        movq    %rdi, -280(%rbp)
        movabsq $140465848796112, %r12  # imm = 0x7FC0C10543D0
        leaq    -128(%rbp), %rbx
        leaq    -104(%rbp), %r14
        movq    $0, -232(%rbp)
        movq    $0, -224(%rbp)
        movq    $0, -216(%rbp)
        movq    $0, -208(%rbp)
        movq    $0, -200(%rbp)
        movq    $0, -192(%rbp)
        movq    $0, -184(%rbp)
        movq    $0, -176(%rbp)
        movq    $0, -168(%rbp)
        movq    $0, -160(%rbp)
        movq    $0, -152(%rbp)
        movq    $0, -144(%rbp)
        movq    $0, -136(%rbp)
        movq    $0, -128(%rbp)
        movq    $0, -120(%rbp)
        movq    $0, -112(%rbp)
        movq    $0, -96(%rbp)
        movq    $36, -248(%rbp)
        movabsq $jl_tls_states, %rcx
        movq    (%rcx), %rax
        movq    %rax, -240(%rbp)
        leaq    -248(%rbp), %rax
        movq    %rax, (%rcx)
        movq    24(%rsi), %r15
Source line: 303
        movq    %r15, -264(%rbp)
        movq    %r12, -104(%rbp)
        movabsq $jl_box_int64, %rax
        movq    %r15, %rdi
        callq   *%rax
        movq    %rax, -96(%rbp)
        leaq    32016(%r12), %rdi
        movabsq $140474466699184, %rax  # imm = 0x7FC2C2B007B0
        movl    $2, %edx
        movq    %r14, %rsi
        callq   *%rax
        movq    %rax, -232(%rbp)
        leaq    33698936(%r12), %rcx
        movq    %rcx, -128(%rbp)
        movq    %r12, -120(%rbp)
        movq    %rax, -112(%rbp)
        movabsq $jl_apply_generic, %rax
        movl    $3, %esi
        movq    %rbx, %rdi
        callq   *%rax
        movabsq $jl_alloc_array_1d, %rcx
        movq    %rax, -224(%rbp)
        movq    (%rax), %rsi
        leaq    885120(%r12), %rdi
        callq   *%rcx
        movq    %rax, -216(%rbp)
        movabsq $"fill!", %rcx
        xorpd   %xmm0, %xmm0
        movq    %rax, %rdi
        callq   *%rcx
        movq    %rax, %rbx
        movq    %rbx, -208(%rbp)
Source line: 83
        cmpq    $1, %r15
        jl      L916
        xorl    %r15d, %r15d
Source line: 5
        movabsq $mapreduce, %r13
        nopw    %cs:(%rax,%rax)
L480:
        xorl    %ecx, %ecx
Source line: 131
        addq    $1, %r15
        cmpq    $1, %r15
        jl      L502
        cmpq    -264(%rbp), %r15
        setle   %cl
L502:
        xorl    %eax, %eax
        andb    $1, %cl
        movb    %cl, -250(%rbp)
        movb    -250(%rbp), %cl
        andb    $1, %cl
        je      L526
        movb    $1, %al
Source line: 132
L526:
        andb    $1, %al
        movb    %al, -249(%rbp)
        movb    -249(%rbp), %al
        andb    $1, %al
        jne     L612
        movabsq $jl_gc_alloc_2w, %rax
        callq   *%rax
        movq    %rax, -200(%rbp)
        leaq    3568704(%r12), %rcx
        movq    %rcx, -8(%rax)
        movq    %r15, (%rax)
        leaq    41808(%r12), %rcx
        movq    %rcx, 8(%rax)
        movq    -272(%rbp), %rdi
        movq    %rax, %rsi
        movabsq $throw_boundserror, %rax
        callq   *%rax
Source line: 215
L612:
        leaq    1480216(%r12), %rax
        movq    %rax, -128(%rbp)
        movq    -272(%rbp), %rax
        movq    %rax, -120(%rbp)
        movq    %r15, %rdi
Source line: 303
        movabsq $jl_box_int64, %rax
Source line: 215
        callq   *%rax
        movq    %rax, -112(%rbp)
        leaq    41808(%r12), %rax
        movq    %rax, -104(%rbp)
        leaq    22558176(%r12), %rdi
        movl    $4, %edx
        leaq    -128(%rbp), %rsi
        movabsq $_unsafe_getindex, %rax
        callq   *%rax
        movq    %rax, -192(%rbp)
        movq    -280(%rbp), %rdi
        movq    %rax, %rsi
        movabsq $"*", %rax
        callq   *%rax
        movq    %rax, -184(%rbp)
        movq    %rax, %rdi
        movabsq $cis, %rax
        callq   *%rax
        movq    %rax, %r14
        movq    %r14, -176(%rbp)
Source line: 5
        movq    %r14, -168(%rbp)
        leaq    -72(%rbp), %rdi
        movq    %r14, %rsi
        callq   *%r13
        movq    %r14, -160(%rbp)
        movq    %r14, %rdi
        movabsq $conj, %rax
        callq   *%rax
        movq    %rax, -152(%rbp)
        leaq    -88(%rbp), %rdi
Source line: 246
        movq    %rax, %rsi
        callq   *%r13
Source line: 124
        movsd   -72(%rbp), %xmm0        # xmm0 = mem[0],zero
        movsd   -88(%rbp), %xmm1        # xmm1 = mem[0],zero
        movapd  %xmm0, %xmm2
        mulsd   %xmm1, %xmm2
        movsd   -64(%rbp), %xmm3        # xmm3 = mem[0],zero
        movsd   -80(%rbp), %xmm4        # xmm4 = mem[0],zero
        movapd  %xmm3, %xmm5
        mulsd   %xmm4, %xmm5
        subsd   %xmm5, %xmm2
        mulsd   %xmm4, %xmm0
        mulsd   %xmm1, %xmm3
        addsd   %xmm0, %xmm3
        movsd   %xmm2, -56(%rbp)
        movsd   %xmm3, -48(%rbp)
        movq    %rbx, -144(%rbp)
        movq    %rbx, %rdi
        leaq    -56(%rbp), %rsi
        movq    %r15, %rdx
        movabsq $"setindex!", %rax
        callq   *%rax
Source line: 83
        cmpq    %r15, -264(%rbp)
        jne     L480
Source line: 7
L916:
        movq    %rbx, -136(%rbp)
        movq    -240(%rbp), %rax
        movabsq $jl_tls_states, %rcx
        movq    %rax, (%rcx)
        movq    %rbx, %rax
        addq    $248, %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

Reply via email to