This makes sense. Thank you! It seems changing the order of nested loops also helps because of the column-major ordering of arrays in julia (note that const T = zeros(Float64,NX,NY,NZ) ):
for k=2:NZ-1, j=2:NY-1, i=2:NX-1 : 2 sec for i=2:NX-1, j=2:NY-1, k=2:NZ-1 : 3 sec On Sunday, April 26, 2015 at 6:28:38 AM UTC-4, Kristoffer Carlsson wrote: > > If you don't have the const you pay a cost every time you *access *the > variable (you need to unbox it). > > Now, with the original slicing there is not that many *accesses* to T and > RHS and the cost is thus dominated by the overhead of the slicing operation. > > When you are using a triple nested for loop you are accessing the variable > T and RHS much more frequently so this time the const in front of the T > and RHS is much more important > since you have to pay the unboxing "fee" every loop iteration. > > Hope that made sense. > > // Kristoffer > > On Sunday, April 26, 2015 at 3:33:26 AM UTC+2, Pooya wrote: >> >> Ah! I am also a newbie, and am trying to get a sense for efficiency >> improvement in julia, and how it works, so I tried a few of the combination >> of suggestions here, and this doesn't make sense to me. Putting const in >> front of T and RHS in the original code does not change running time much, >> but it is critical for the performance of the following nested loops. With >> nested loops, without const, running time is much more than the original >> code. The following are the exact run times on my computer (for 100 time >> steps NT = 100): >> >> original code: 21 sec >> >> original code with const in front if T and RHS: 20 sec >> nested loops w/out const in front if T and RHS: 200 sec >> nested loops with const in front if T and RHS: 3 sec >> >> Also, @inbounds does not have much effect in any of the above cases, just >> a little! Any thoughts are appreciated! >> >> >> On Saturday, April 25, 2015 at 4:45:45 PM UTC-4, Kristoffer Carlsson >> wrote: >>> >>> It is definitely the slicing that is killing performance. Right now, >>> slicing is expensive since you copy a whole new array for it. >>> >>> Putting const in front of T and RHS and using a loop like this (maybe >>> some mistake but the principle is what is important) makes the code 10x >>> faster: >>> >>> @inbounds for i=2:NX-1, j=2:NY-1, k=2:NZ-1 >>> RHS[i,j,k] = dt*A*( (T[i-1,j,k]-2*T[i,j,k]+T[i+1,j,k])/dx2 + >>> (T[i,j-1,k]-2*T[i,j,k]+T[i,j+1,k])/dy2 + >>> (T[i,j,k-1]-2*T[i,j,k]+T[i,j,k+1])/dz2 ) >>> >>> T[i,j,k] = T[i,j,k] + RHS[i,j,k] >>> end >>> >>> >>> >>> >>> On Saturday, April 25, 2015 at 7:57:01 PM UTC+2, Johan Sigfrids wrote: >>>> >>>> I think it is all the slicing that is killing the performance. Maybe >>>> something like arrayviews or the new sub stuff on 0.4 would help. >>>> Alternatively devectorizing into a bunch of nested loops. >>>> >>>> On Saturday, April 25, 2015 at 8:42:09 PM UTC+3, Stefan Karpinski wrote: >>>>> >>>>> Stick const in front of T and RHS. >>>>> >>>>> On Sat, Apr 25, 2015 at 11:32 AM, Tim Holy <[email protected]> wrote: >>>>> >>>>>> Did you read through >>>>>> http://docs.julialang.org/en/release-0.3/manual/performance-tips/? >>>>>> You should >>>>>> memorize :-) the sections up through the Tools section; the rest you >>>>>> can >>>>>> consult as you discover you need them. >>>>>> >>>>>> --Tim >>>>>> >>>>>> On Saturday, April 25, 2015 01:03:38 AM Ángel de Vicente wrote: >>>>>> > Hi, >>>>>> > >>>>>> > a complete Julia newbie here... I spent a couple of days learning >>>>>> the >>>>>> > syntax and main aspects of Julia, and since I heard many good >>>>>> things about >>>>>> > it, I decided to try a little program to see how it compares >>>>>> against the >>>>>> > other ones I regularly use: Fortran and Python. >>>>>> > >>>>>> > I wrote a minimal program to solve the 3D heat equation in a cube of >>>>>> > 100x100x100 points in the three languages and the time it takes to >>>>>> run in >>>>>> > each one is: >>>>>> > >>>>>> > Fortran: ~7s >>>>>> > Python: ~33s >>>>>> > Julia: ~80s >>>>>> > >>>>>> > The code runs for 1000 iterations, and I'm being nice to Julia, >>>>>> since the >>>>>> > programs in Fortran and Python write 100 HDF5 files with the >>>>>> complete 100^3 >>>>>> > data (every 10 iterations). >>>>>> > >>>>>> > I attach the code (and you can also get it at: >>>>>> http://pastebin.com/y5HnbWQ1) >>>>>> > >>>>>> > Am I doing something obviously wrong? Any suggestions on how to >>>>>> improve its >>>>>> > speed? >>>>>> > >>>>>> > Thanks a lot, >>>>>> > Ángel de Vicente >>>>>> >>>>>> >>>>>
