Before seeing your scripts, I could not imagine the application of Julia on my research. Now I have forked your repository in github, and followed you. I am interested in your research, and caught a glimpse of one paper by Ni group. Althoug it's far from the focus of my research, I believe that your repositories will make a great difference to my explore on Juila.
在 2015年12月15日星期二 UTC+8下午10:11:19,Yichao Yu写道: > > On Tue, Dec 15, 2015 at 7:42 AM, 博陈 <[email protected] <javascript:>> > wrote: > > The code get great performance improvement. My original version cost 133 > s > > and about 166 G memories, while your version cost 61 s and about 423M > > memories! > > > > > > My version is here[1] (ignore the unused first 12 lines, get to that > later). > It's pretty much what Pablo has already done. Only minor improvement is > that > > 1. Swap the two buffers directly instead of doing a extra copy (the > copy is merged into the first scaling) > 2. Do not use maxabs2 since that function is allocating (which is > probably a bug that should be fixed, or wait for jb/function branch > ;-p) > 3. I didn't change Ks to Ks - 1 in my version since I'm not sure which > one are you using for the matlab version. > > Depending on the machine I benchmark it on, I see a 10-20% performance > improvement. The CPU utilization is ~75-80% on all cores so that's > about the maximum performance improvement you could hope for without > further optimize fftw =) > > If you are benchmarking the matlab version with the same Ks, here's a > few guess why the julia version could still be ~20% slower and also a > few other improvements that I've tried > > 1. Matlab might be using threads for these operations Julia currently > can't. We are working hard to make that possible (look for the > `multi-threading` issue/pr tag on github) > 2. Matlab might be storing the array in a different format (or > otherwise makes it easier to use the CPU SIMD instructions) > > This could be fixed with compiler improvements[4]. > Another way to solve this is to change the format of the Array. > This is the way I was using (I'm using a much smaller array so the > time I spend outside fftw is much more important for me.) The code I > has can be find here[5]. In short, it uses the StructsOfArrays package > to store the real and imaginary part of the array separately. For > reasons that can be explained later, this helps the compiler to make > use of the SIMD instructions and I could get a 10x speed up in the > part outside fftw. One problem with this is that FFTW doesn't have too > good support for this representation[6] so you would need to either > copy it between the two representations when doing the FFT or change > the storage a little bit (see @simonster's comment in this issue) to > make it possible to run FFTW on it directly (which is what the first > 10lines of the code in [1] is) > > > To answer a few of your other questions that haven't be address already: > > > When you said "temporary arrays", did you mean that I should use less > global variables? Yes, I can write arrays such as x, px only in the > functions Potential and Ko, but that will make the code less > understandable. > > No. I'm talking about things like `a .* = b` is currently `a = a .* b` > and so it has to compute `a .* b` which needs to allocate an array. > > > When starting the code, I use Ks=2^11, but when I finally plot the \phi > function, I get an error, and it seems that I misunderstood the linspace > function (I wrote linspace(-pm,pm,Ks) at first), so I changed it to 2^11+1. > It's a bad choice, and now I realize it. When use Ks=2^11-1, it only takes > 81 seconds for the code to run, which is slightly faster than the Matlab > version. > > I think this is mentioned in the doc somewhere but I think it's not > really emphasized enough for people who needs it and I can't find it > now.... (I also learnt this the hard way....) > > > I have already stated the variables \phi and Pl outside the loop, and in > my opinion, they are pre-allocated and "should" be manipulated in place. To > be honest, maybe I don't really understand the words "in place" and > "pre-allocated". > > See `a .*= b` above, it's not inplace, unfortunately. We have an issue > for this[2]. I don't find it to be too much an issue since writing the > loop isn't too much harder and it is limitted to a single operation > anyway. > > > But I don't understand why the assignment \phi .*=ko2 is unnecessary. > Are you saying that I should rewrite them as scale(\phi, ko2)? I'll have a > try. > > This assignment is necessary. What I meant is the Pl * \phi one. I > agree the new dft api is a little bit unintuitive for in place > operation. Fortunately, having the assignment only has a tiny > performance hit so it dosen't matter too much. > > [1] > https://github.com/yuyichao/explore/blob/e32d7c19b7663526cc8a6a4697b337ad9132c16a/julia/complex_array/improved.jl > > [2] https://github.com/JuliaLang/julia/issues/249 > [3] https://github.com/JuliaLang/julia/pull/12087#issuecomment-120127518 > [4] https://github.com/JuliaLang/julia/issues/11899 > [5] > https://github.com/nacs-lab/yyc-data/blob/512a4e3f9d353274c24c353f401121afa81ec981/calculations/single-atom/test/heating-tests/two-level-1d.jl > > [6] https://github.com/JuliaLang/julia/issues/12398 > > > > > 在 2015年12月15日星期二 UTC+8下午7:34:55,Pablo Zubieta写道: > >> > >> Stefan said that your code is well type (type stable). Have you tried > >> Yichao's suggestions? > >> > >> > >> Here is the function ground() taking them into account: > >> > >> function ground() > >> R = 320. > >> Ks = 2^11 - 1 > >> x = linspace(-R, R, Ks+1) > >> dx = 2R/Ks > >> dt = 0.1 > >> pm = 2π/2dx > >> px = circshift( linspace(-pm, pm, Ks+1), round(Int64, Ks/2) ) > >> FFTW.set_num_threads(8) > >> > >> > >> V = Float64[Potential(ix, iy) for ix in x, iy in x] > >> ko = Float64[e^(-dt/4* (Ko(ipx,ipy))) for ipx in px, ipy in px] > >> ko2 = ko.^2 > >> vo = Float64[e^(-dt* (Potential(ix,iy))) for ix in x, iy in x] > >> ϕ = Array(Complex128,(Ks+1,Ks+1)) > >> Pl = plan_fft!(ϕ; flags=FFTW.MEASURE) > >> for i in eachindex(ϕ); ϕ[i] = V[i]; end # Conversion is automatic > here > >> > >> > >> invnormphi = 1. / (sqrt(sumabs2(ϕ))*dx) > >> scale!(ϕ, invnormphi) > >> > >> > >> Pl * ϕ # No need to assign to ϕ > >> for i in eachindex(ϕ); ϕ[i] *= ko[i]; end > >> Pl \ ϕ > >> > >> ϕ′ = similar(ϕ) > >> > >> Δϕ = 1. > >> nstep = 0 > >> println("start loop") > >> while Δϕ > 1.e-15 > >> copy!(ϕ′, ϕ) > >> for i in eachindex(ϕ), ϕ[i] *= vo[i]; end > >> > >> invnormphi = 1. / (sqrt(sumabs2(ϕ))*dx) > >> scale!(ϕ, invnormphi) > >> > >> > >> Pl * ϕ > >> for i in eachindex(ϕ); ϕ[i] *= ko2[i]; end > >> Pl \ ϕ > >> # if check Δϕ for every step, 35s is needed. > >> if nstep>500 > >> Δϕ = maxabs(ϕ-ϕ′) > >> end > >> nstep += 1 > >> if mod(nstep,200)==0 > >> print(nstep," ") > >> println("Δϕ=",Δϕ) > >> end > >> > >> end > >> for i in eachindex(ϕ); ϕ[i] *= vo[i]; end > >> > >> Pl * ϕ > >> for i in eachindex(ϕ); ϕ[i] *= ko[i]; end > >> Pl \ ϕ > >> > >> > >> invnormphi = 1. / (sqrt(sumabs2(ϕ))*dx) > >> scale!(ϕ, invnormphi) > >> end > >> > >> > >> > > >
