Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a écrit :
> hey guys,
> so attached you find text files with @code_native output for the
> instructions 
> - r * x[1,:]
> - cis(imexp)
> - sum(imexp) * sum(conj(imexp))
> 
> for julia 0.5. 
> 
> Hardware I run on is a Haswell i5 machine, a Haswell i7 machine, and
> a IvyBridge i5 machine. Turned out on an Haswell i5 machine the code
> also runs fast. Only the Haswell i7 machine is the slow one. This
> really drove me nuts. First I thought it was the OS, then the
> architecture, and now its just from i5 to i7.... Anyways, I don't
> know anything about x86 assembly, but the julia 0.45 code is the same
> on all machines. However, for the dot product, the 0.5 code has
> already 2 different instructions on the i5 vs. the i7 (line 44&47).
> For the cis call also (line 149...). And the IvyBridge i5 code is
> similar to the Haswell i5. I included also versioninfo() at the top
> of the file. So you could just look at a vimdiff of the julia0.5
> files... Can anyone make sense out of this?
I'm definitely not an expert in assembly, but that additional leaq
instruction on line 44, and the additional movq instructions on line
111, 151 and 152 really look weird

Could you do the same test with the binary tarballs? If the difference
persists, you should open an issue on GitHub to track this.

BTW, please wrap the fist call in a function to ensure it is
specialized for the arguments types, i.e.:

f(r, x) = r * x[1,:]
@code_native f(r, x)

Also, please check whether you still see the difference with this code:
g(r, x) = r * x
@code_native g(r, x[1,:])

What are the types of r and x? Could you provide a simple reproducible example 
with dummy values?

> The binary tarballs I will still test. If I remove the cis() call,
> the difference is hard to tell, the loop is ~10times faster and more
> or less all around 5ms. For the whole loop with cis() call, from i5
> to i7 the difference is ~ 50ms on i5 to 90ms on i7.
> 
> Shall I also post the julia 0.4 code?
If it's identical for all machines, I don't think it's needed.


Regards


> cheers, Johannes
> 
> 
> 
> > Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a écrit : 
> > > 
> > > 
> > > > Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner a écrit :  
> > > > > Sorry for not having expressed myself clearly, I meant the latest  
> > > > > version of fedora to work fine (24 development). I always used the  
> > > > > latest julia nightly available on the copr nalimilan repo. Right now  
> > > > > that is: 0.5.0-dev+3292, Commit 9d527c5*, all use  
> > > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)  
> > > > >  
> > > > > peakflops on all machines (hardware identical) is ~1.2..1.5e11.    
> > > > >  
> > > > > Fedora 22&23 with julia 0.5 is ~50% slower then 0.4, only on fedora  
> > > > > 24 julia 0.5 is  faster compared to julia 0.4.  
> > > > Could you try to find a simple code to reproduce the problem? In  
> > > > particular, it would be useful to check whether this comes from  
> > > > OpenBLAS differences or whether it also happens with pure Julia code  
> > > > (typical operations which depend on BLAS are matrix multiplication, as  
> > > > well as most of linear algebra). Normally, 0.4 and 0.5 should use the  
> > > > same BLAS, but who knows...  
> > > well thats what I did, and the 3 simple calls inside the loop are 
> > > more or less same speed. only the whole loop seems slower. See my 
> > > code sample fromanswer march 8th (code gets in same proportions 
> > > faster when exp(im .* dotprods) is replaced by cis(dotprods) ).  
> > > So I don't know what I can do then...   
> > Sorry, somehow I had missed that message. This indeed looks like a code 
> > generation issue in Julia/LLVM. 
> > 
> > > > Can you also confirm that all versioninfo() fields are the same for all 
> > > >  
> > > > three machines, both for 0.4 and 0.5? We must envision the possibility  
> > > > that the differences actually come from 0.4.  
> > > ohoh, right! just noticed that my fedora 24 machine was an ivy bridge 
> > > which works fast: 
> > > 
> > > Julia Version 0.5.0-dev+3292 
> > > Commit 9d527c5* (2016-03-28 06:55 UTC) 
> > > Platform Info: 
> > >   System: Linux (x86_64-redhat-linux) 
> > >   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz 
> > >   WORD_SIZE: 64 
> > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge) 
> > >   LAPACK: libopenblasp.so.0 
> > >   LIBM: libopenlibm 
> > >   LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge) 
> > > 
> > > and the other ones with fed22/23 are haswell, which work slow: 
> > > 
> > > Julia Version 0.5.0-dev+3292 
> > > Commit 9d527c5* (2016-03-28 06:55 UTC) 
> > > Platform Info: 
> > >   System: Linux (x86_64-redhat-linux) 
> > >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz 
> > >   WORD_SIZE: 64 
> > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) 
> > >   LAPACK: libopenblasp.so.0 
> > >   LIBM: libopenlibm 
> > >   LLVM: libLLVM-3.7.1 (ORCJIT, haswell) 
> > > 
> > > I just booted an fedora 23 on the ivy bridge machine and it's also fast.  
> > >   
> > > Now if I use julia 0.45 on both architectures: 
> > > 
> > > Julia Version 0.4.5 
> > > Commit 2ac304d* (2016-03-18 00:58 UTC) 
> > > Platform Info: 
> > >   System: Linux (x86_64-redhat-linux) 
> > >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz 
> > >   WORD_SIZE: 64 
> > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) 
> > >   LAPACK: libopenblasp.so.0 
> > >   LIBM: libopenlibm 
> > >   LLVM: libLLVM-3.3 
> > > 
> > > and: 
> > > 
> > > Julia Version 0.4.5 
> > > Commit 2ac304d* (2016-03-18 00:58 UTC) 
> > > Platform Info: 
> > >   System: Linux (x86_64-redhat-linux) 
> > >   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz 
> > >   WORD_SIZE: 64 
> > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge) 
> > >   LAPACK: libopenblasp.so.0 
> > >   LIBM: libopenlibm 
> > >   LLVM: libLLVM-3.3 
> > > 
> > > there is no speed difference apart from the ~10% or so from the 
> > > faster haswell machine. So could perhaps be haswell hardware target 
> > > specific with the change from llvm 3.3 to 3.7.1? Is there anything 
> > > else I could provide? 
> > This is certainly an interesting finding. Could you paste somewhere the 
> > output of @code_native for your function on Sandybridge vs. Haswell, 
> > for both 0.4 and 0.5? 
> > 
> > It would also be useful to check whether the same difference appears if 
> > you use the generic binary tarballs from http://julialang.org/downloads 
> > . 
> > 
> > Finally, do you get the same result if you remove the call to exp() 
> > from the loop? (This is the only external function, so it shouldn't be 
> > affected by changes in Julia.) 
> > 
> > 
> > Regards 
> > 
> > 
> > > Best, Johannes 
> > > 
> > > >  Regards  
> > > 
> > > 
> > > > > Le mercredi 16 mars 2016 à 09:25 -0700, Johannes Wagner a écrit :   
> > > > > > just a little update. Tested some other fedoras: Fedora 22 with 
> > > > > > llvm   
> > > > > > 3.8 is also slow with julia 0.5, whereas a fedora 24 branch with 
> > > > > > llvm   
> > > > > > 3.7 is faster on julia 0.5 compared to julia 0.4, as it should be   
> > > > > > (speedup from inner loop parts translated into speedup to whole   
> > > > > > function).   
> > > > > >   
> > > > > > don't know if anyone cares about that... At least the latest 
> > > > > > version   
> > > > > > seems to work fine, hope it stays like this into the final fedora 
> > > > > > 24   
> > > > > What's the "latest version"? git built from source or RPM nightlies?  
> > > > >  
> > > > > With which LLVM version for each?   
> > > > >  
> > > > > If from the RPMs, I've switched them to LLVM 3.8 for a few days, and  
> > > > >  
> > > > > went back to 3.7 because of a build failure. So that might explain 
> > > > > the   
> > > > > difference. You can install the last version which built with LLVM 
> > > > > 3.8   
> > > > > manually from here:   
> > > > > https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-23-x86_64/00167549-julia/
> > > > >    
> > > > >  
> > > > > It would be interesting to compare it with the latest nightly with 
> > > > > 3.7.   
> > > > >  
> > > > >  
> > > > > Regards   
> > > > >  
> > > > >  
> > > > >  
> > > > > > > hey guys,   
> > > > > > > I just experienced something weird. I have some code that runs 
> > > > > > > fine   
> > > > > > > on 0.43, then I updated to 0.5dev to test the new Arrays, run 
> > > > > > > same   
> > > > > > > code and noticed it got about ~50% slower. Then I downgraded back 
> > > > > > >   
> > > > > > > to 0.43, ran the old code, but speed remained slow. I noticed 
> > > > > > > while   
> > > > > > > reinstalling 0.43, openblas-threads didn't get isntalled along 
> > > > > > > with   
> > > > > > > it. So I manually installed it, but no change.    
> > > > > > > Does anyone has an idea what could be going on? LLVM on fedora23 
> > > > > > > is   
> > > > > > > 3.7   
> > > > > > >   
> > > > > > > Cheers, Johannes   
> > > > > > >   

Reply via email to