Re: [julia-users] Re: A Julia code slower than Matlab, can it be faster?

Yichao Yu Tue, 15 Dec 2015 06:11:33 -0800

On Tue, Dec 15, 2015 at 7:42 AM, 博陈 <[email protected]> wrote:
> The code get great performance improvement. My original version cost 133 s
> and about 166 G memories, while your version cost 61 s and about 423M
> memories!
>
>


My version is here[1] (ignore the unused first 12 lines, get to that later).
It's pretty much what Pablo has already done. Only minor improvement is that

1. Swap the two buffers directly instead of doing a extra copy (the
copy is merged into the first scaling)
2. Do not use maxabs2 since that function is allocating (which is
probably a bug that should be fixed, or wait for jb/function branch
;-p)
3. I didn't change Ks to Ks - 1 in my version since I'm not sure which
one are you using for the matlab version.

Depending on the machine I benchmark it on, I see a 10-20% performance
improvement. The CPU utilization is ~75-80% on all cores so that's
about the maximum performance improvement you could hope for without
further optimize fftw =)

If you are benchmarking the matlab version with the same Ks, here's a
few guess why the julia version could still be ~20% slower and also a
few other improvements that I've tried

1. Matlab might be using threads for these operations Julia currently
can't. We are working hard to make that possible (look for the
`multi-threading` issue/pr tag on github)
2. Matlab might be storing the array in a different format (or
otherwise makes it easier to use the CPU SIMD instructions)

    This could be fixed with compiler improvements[4].
    Another way to solve this is to change the format of the Array.
This is the way I was using (I'm using a much smaller array so the
time I spend outside fftw is much more important for me.) The code I
has can be find here[5]. In short, it uses the StructsOfArrays package
to store the real and imaginary part of the array separately. For
reasons that can be explained later, this helps the compiler to make
use of the SIMD instructions and I could get a 10x speed up in the
part outside fftw. One problem with this is that FFTW doesn't have too
good support for this representation[6] so you would need to either
copy it between the two representations when doing the FFT or change
the storage a little bit (see @simonster's comment in this issue) to
make it possible to run FFTW on it directly (which is what the first
10lines of the code in [1] is)


To answer a few of your other questions that haven't be address already:

> When you said "temporary arrays", did you mean that I should use less global 
> variables? Yes, I can write arrays such as x, px only in the functions 
> Potential and Ko, but that will make the code less understandable.

No. I'm talking about things like `a .* = b` is currently `a = a .* b`
and so it has to compute `a .* b` which needs to allocate an array.

> When starting the code, I use Ks=2^11, but when I finally plot the \phi 
> function, I get an error, and it seems that I misunderstood the linspace 
> function (I wrote linspace(-pm,pm,Ks) at first), so I changed it to 2^11+1. 
> It's a bad choice, and now I realize it. When use Ks=2^11-1, it only takes 81 
> seconds for the code to run, which is slightly faster than the Matlab version.

I think this is mentioned in the doc somewhere but I think it's not
really emphasized enough for people who needs it and I can't find it
now.... (I also learnt this the hard way....)

> I have already stated the variables \phi and Pl outside the loop, and in my 
> opinion, they are pre-allocated and "should" be manipulated in place. To be 
> honest, maybe I don't really understand the words "in place" and 
> "pre-allocated".

See `a .*= b` above, it's not inplace, unfortunately. We have an issue
for this[2]. I don't find it to be too much an issue since writing the
loop isn't too much harder and it is limitted to a single operation
anyway.

> But I don't understand why the assignment \phi .*=ko2 is unnecessary.  Are 
> you saying that I should rewrite them as scale(\phi, ko2)? I'll have a try.

This assignment is necessary. What I meant is the Pl * \phi one. I
agree the new dft api is a little bit unintuitive for in place
operation. Fortunately, having the assignment only has a tiny
performance hit so it dosen't matter too much.

[1] 
https://github.com/yuyichao/explore/blob/e32d7c19b7663526cc8a6a4697b337ad9132c16a/julia/complex_array/improved.jl
[2] https://github.com/JuliaLang/julia/issues/249
[3] https://github.com/JuliaLang/julia/pull/12087#issuecomment-120127518
[4] https://github.com/JuliaLang/julia/issues/11899
[5] 
https://github.com/nacs-lab/yyc-data/blob/512a4e3f9d353274c24c353f401121afa81ec981/calculations/single-atom/test/heating-tests/two-level-1d.jl
[6] https://github.com/JuliaLang/julia/issues/12398

>
> 在 2015年12月15日星期二 UTC+8下午7:34:55，Pablo Zubieta写道：
>>
>> Stefan said that your code is well type (type stable). Have you tried
>> Yichao's suggestions?
>>
>>
>> Here is the function ground() taking them into account:
>>
>> function ground()
>>     R = 320.
>>     Ks = 2^11 - 1
>>     x = linspace(-R, R, Ks+1)
>>     dx = 2R/Ks
>>     dt = 0.1
>>     pm = 2π/2dx
>>     px = circshift( linspace(-pm, pm, Ks+1),  round(Int64, Ks/2) )
>>     FFTW.set_num_threads(8)
>>
>>
>>     V = Float64[Potential(ix, iy)  for ix in x, iy in x]
>>     ko = Float64[e^(-dt/4* (Ko(ipx,ipy))) for ipx in px, ipy in px]
>>     ko2 = ko.^2
>>     vo = Float64[e^(-dt* (Potential(ix,iy))) for ix in x, iy in x]
>>     ϕ = Array(Complex128,(Ks+1,Ks+1))
>>     Pl = plan_fft!(ϕ; flags=FFTW.MEASURE)
>>     for i in eachindex(ϕ); ϕ[i] = V[i]; end # Conversion is automatic here
>>
>>
>>     invnormphi = 1. / (sqrt(sumabs2(ϕ))*dx)
>>     scale!(ϕ, invnormphi)
>>
>>
>>     Pl * ϕ # No need to assign to ϕ
>>     for i in eachindex(ϕ); ϕ[i] *= ko[i]; end
>>     Pl \ ϕ
>>
>>     ϕ′ = similar(ϕ)
>>
>>     Δϕ = 1.
>>     nstep = 0
>>     println("start loop")
>>     while Δϕ > 1.e-15
>>         copy!(ϕ′, ϕ)
>>         for i in eachindex(ϕ), ϕ[i] *= vo[i]; end
>>
>>         invnormphi = 1. / (sqrt(sumabs2(ϕ))*dx)
>>         scale!(ϕ, invnormphi)
>>
>>
>>         Pl * ϕ
>>         for i in eachindex(ϕ); ϕ[i] *= ko2[i]; end
>>         Pl \ ϕ
>> # if check  Δϕ for every step, 35s is needed.
>>         if nstep>500
>>             Δϕ = maxabs(ϕ-ϕ′)
>>         end
>>         nstep += 1
>>         if mod(nstep,200)==0
>>             print(nstep,"  ")
>>             println("Δϕ=",Δϕ)
>>         end
>>
>>     end
>>     for i in eachindex(ϕ); ϕ[i] *= vo[i]; end
>>
>>     Pl * ϕ
>>     for i in eachindex(ϕ); ϕ[i] *= ko[i]; end
>>     Pl \ ϕ
>>
>>
>>     invnormphi = 1. / (sqrt(sumabs2(ϕ))*dx)
>>     scale!(ϕ, invnormphi)
>> end
>>
>>
>>
>

Re: [julia-users] Re: A Julia code slower than Matlab, can it be faster?

Reply via email to