Re: [julia-users] Re: A Julia code slower than Matlab, can it be faster?

博陈 Wed, 16 Dec 2015 01:20:00 -0800

Before seeing your scripts, I could not imagine the application of Julia on 
my research. Now I have forked your repository in github, and followed you. 
 I am interested in your research, and caught a glimpse of one paper by Ni 
group. Althoug it's far from the focus of my research, I believe that your 
repositories will make a great difference to my explore on Juila.




在 2015年12月15日星期二 UTC+8下午10:11:19，Yichao Yu写道：
>
> On Tue, Dec 15, 2015 at 7:42 AM, 博陈 <[email protected] <javascript:>> 
> wrote: 
> > The code get great performance improvement. My original version cost 133 
> s 
> > and about 166 G memories, while your version cost 61 s and about 423M 
> > memories! 
> > 
> > 
>
> My version is here[1] (ignore the unused first 12 lines, get to that 
> later). 
> It's pretty much what Pablo has already done. Only minor improvement is 
> that 
>
> 1. Swap the two buffers directly instead of doing a extra copy (the 
> copy is merged into the first scaling) 
> 2. Do not use maxabs2 since that function is allocating (which is 
> probably a bug that should be fixed, or wait for jb/function branch 
> ;-p) 
> 3. I didn't change Ks to Ks - 1 in my version since I'm not sure which 
> one are you using for the matlab version. 
>
> Depending on the machine I benchmark it on, I see a 10-20% performance 
> improvement. The CPU utilization is ~75-80% on all cores so that's 
> about the maximum performance improvement you could hope for without 
> further optimize fftw =) 
>
> If you are benchmarking the matlab version with the same Ks, here's a 
> few guess why the julia version could still be ~20% slower and also a 
> few other improvements that I've tried 
>
> 1. Matlab might be using threads for these operations Julia currently 
> can't. We are working hard to make that possible (look for the 
> `multi-threading` issue/pr tag on github) 
> 2. Matlab might be storing the array in a different format (or 
> otherwise makes it easier to use the CPU SIMD instructions) 
>
>     This could be fixed with compiler improvements[4]. 
>     Another way to solve this is to change the format of the Array. 
> This is the way I was using (I'm using a much smaller array so the 
> time I spend outside fftw is much more important for me.) The code I 
> has can be find here[5]. In short, it uses the StructsOfArrays package 
> to store the real and imaginary part of the array separately. For 
> reasons that can be explained later, this helps the compiler to make 
> use of the SIMD instructions and I could get a 10x speed up in the 
> part outside fftw. One problem with this is that FFTW doesn't have too 
> good support for this representation[6] so you would need to either 
> copy it between the two representations when doing the FFT or change 
> the storage a little bit (see @simonster's comment in this issue) to 
> make it possible to run FFTW on it directly (which is what the first 
> 10lines of the code in [1] is) 
>
>
> To answer a few of your other questions that haven't be address already: 
>
> > When you said "temporary arrays", did you mean that I should use less 
> global variables? Yes, I can write arrays such as x, px only in the 
> functions Potential and Ko, but that will make the code less 
> understandable. 
>
> No. I'm talking about things like `a .* = b` is currently `a = a .* b` 
> and so it has to compute `a .* b` which needs to allocate an array. 
>
> > When starting the code, I use Ks=2^11, but when I finally plot the \phi 
> function, I get an error, and it seems that I misunderstood the linspace 
> function (I wrote linspace(-pm,pm,Ks) at first), so I changed it to 2^11+1. 
> It's a bad choice, and now I realize it. When use Ks=2^11-1, it only takes 
> 81 seconds for the code to run, which is slightly faster than the Matlab 
> version. 
>
> I think this is mentioned in the doc somewhere but I think it's not 
> really emphasized enough for people who needs it and I can't find it 
> now.... (I also learnt this the hard way....) 
>
> > I have already stated the variables \phi and Pl outside the loop, and in 
> my opinion, they are pre-allocated and "should" be manipulated in place. To 
> be honest, maybe I don't really understand the words "in place" and 
> "pre-allocated". 
>
> See `a .*= b` above, it's not inplace, unfortunately. We have an issue 
> for this[2]. I don't find it to be too much an issue since writing the 
> loop isn't too much harder and it is limitted to a single operation 
> anyway. 
>
> > But I don't understand why the assignment \phi .*=ko2 is unnecessary. 
>  Are you saying that I should rewrite them as scale(\phi, ko2)? I'll have a 
> try. 
>
> This assignment is necessary. What I meant is the Pl * \phi one. I 
> agree the new dft api is a little bit unintuitive for in place 
> operation. Fortunately, having the assignment only has a tiny 
> performance hit so it dosen't matter too much. 
>
> [1] 
> https://github.com/yuyichao/explore/blob/e32d7c19b7663526cc8a6a4697b337ad9132c16a/julia/complex_array/improved.jl
>  
> [2] https://github.com/JuliaLang/julia/issues/249 
> [3] https://github.com/JuliaLang/julia/pull/12087#issuecomment-120127518 
> [4] https://github.com/JuliaLang/julia/issues/11899 
> [5] 
> https://github.com/nacs-lab/yyc-data/blob/512a4e3f9d353274c24c353f401121afa81ec981/calculations/single-atom/test/heating-tests/two-level-1d.jl
>  
> [6] https://github.com/JuliaLang/julia/issues/12398 
>
> > 
> > 在 2015年12月15日星期二 UTC+8下午7:34:55，Pablo Zubieta写道： 
> >> 
> >> Stefan said that your code is well type (type stable). Have you tried 
> >> Yichao's suggestions? 
> >> 
> >> 
> >> Here is the function ground() taking them into account: 
> >> 
> >> function ground() 
> >>     R = 320. 
> >>     Ks = 2^11 - 1 
> >>     x = linspace(-R, R, Ks+1) 
> >>     dx = 2R/Ks 
> >>     dt = 0.1 
> >>     pm = 2π/2dx 
> >>     px = circshift( linspace(-pm, pm, Ks+1),  round(Int64, Ks/2) ) 
> >>     FFTW.set_num_threads(8) 
> >> 
> >> 
> >>     V = Float64[Potential(ix, iy)  for ix in x, iy in x] 
> >>     ko = Float64[e^(-dt/4* (Ko(ipx,ipy))) for ipx in px, ipy in px] 
> >>     ko2 = ko.^2 
> >>     vo = Float64[e^(-dt* (Potential(ix,iy))) for ix in x, iy in x] 
> >>     ϕ = Array(Complex128,(Ks+1,Ks+1)) 
> >>     Pl = plan_fft!(ϕ; flags=FFTW.MEASURE) 
> >>     for i in eachindex(ϕ); ϕ[i] = V[i]; end # Conversion is automatic 
> here 
> >> 
> >> 
> >>     invnormphi = 1. / (sqrt(sumabs2(ϕ))*dx) 
> >>     scale!(ϕ, invnormphi) 
> >> 
> >> 
> >>     Pl * ϕ # No need to assign to ϕ 
> >>     for i in eachindex(ϕ); ϕ[i] *= ko[i]; end 
> >>     Pl \ ϕ 
> >> 
> >>     ϕ′ = similar(ϕ) 
> >> 
> >>     Δϕ = 1. 
> >>     nstep = 0 
> >>     println("start loop") 
> >>     while Δϕ > 1.e-15 
> >>         copy!(ϕ′, ϕ) 
> >>         for i in eachindex(ϕ), ϕ[i] *= vo[i]; end 
> >> 
> >>         invnormphi = 1. / (sqrt(sumabs2(ϕ))*dx) 
> >>         scale!(ϕ, invnormphi) 
> >> 
> >> 
> >>         Pl * ϕ 
> >>         for i in eachindex(ϕ); ϕ[i] *= ko2[i]; end 
> >>         Pl \ ϕ 
> >> # if check  Δϕ for every step, 35s is needed. 
> >>         if nstep>500 
> >>             Δϕ = maxabs(ϕ-ϕ′) 
> >>         end 
> >>         nstep += 1 
> >>         if mod(nstep,200)==0 
> >>             print(nstep,"  ") 
> >>             println("Δϕ=",Δϕ) 
> >>         end 
> >> 
> >>     end 
> >>     for i in eachindex(ϕ); ϕ[i] *= vo[i]; end 
> >> 
> >>     Pl * ϕ 
> >>     for i in eachindex(ϕ); ϕ[i] *= ko[i]; end 
> >>     Pl \ ϕ 
> >> 
> >> 
> >>     invnormphi = 1. / (sqrt(sumabs2(ϕ))*dx) 
> >>     scale!(ϕ, invnormphi) 
> >> end 
> >> 
> >> 
> >> 
> > 
>

Re: [julia-users] Re: A Julia code slower than Matlab, can it be faster?

Reply via email to