Intel compilers won't help, because your julia code is being compiled by LLVM.

It's still hard to tell what's up from what you've shown us. When you run 
@time, does it allocate any memory? (You still have global variables in there, 
but maybe you made them const.)

But you can save yourself two iterations through the arrays (i.e., more cache 
misses) by putting
    T[i-1,j-1,k-1] += RHS[i-1,j-1,k-1]
inside the first loop and discarding the second loop (except for cleaning up 
the edges). Fortran may be doing this automatically for you? 
http://en.wikipedia.org/wiki/Polytope_model

--Tim

On Tuesday, April 28, 2015 07:14:40 AM Ángel de Vicente wrote:
> Hi Tim,
> 
> On Tuesday, April 28, 2015 at 2:53:45 PM UTC+1, Tim Holy wrote:
> > Before deciding that the compiler is the answer...profile. Where is the
> > bottleneck?
> 
> well, the code now runs quite fast (double the time it takes for my Fortran
> version), after following the suggestions made in this thread. Basically
> there is only one function in the code, so the bottleneck has to be there
> 
> :-), but I'm not sure I can do anything else to improve its performance.
> 
> The relevant part of the code is:
> 
> const T = zeros(Float64,NX,NY,NZ)
> const RHS = zeros(Float64,NX,NY,NZ)
> 
> [...]
> 
> function main_loop()
>          for n = 0:NT-1
>           @inbounds for k=2:NZ-1, j=2:NY-1, i=2:NX-1
>                   RHS[i,j,k] = dt*A*(
> (T[i-1,j,k]-2*T[i,j,k]+T[i+1,j,k])/dx2  +
>                                    (T[i,j-1,k]-2*T[i,j,k]+T[i,j+1,k])/dy2  +
> (T[i,j,k-1]-2*T[i,j,k]+T[i,j,k+1])/dz2 )
> 
>            end
> 
>            @inbounds for k=2:NZ-1, j=2:NY-1, i=2:NX-1
>                  T[i,j,k] = T[i,j,k] + RHS[i,j,k]
>             end
> 
>          end
> end
> 
> Trying to get Julia compiled with the Intel compilers was just to see if I
> could squeeze a bit more performance out of it, but certainly I would also
> appreciate any suggestions on how to speed up my existing Julia code.
> 
> Thanks,
> Ángel de Vicente

Reply via email to