The code is really short: N=2000000
function doit1(N) Fe=zeros(3,1); Ns=zeros(3,1)+1.0; f=-6.0; Jac=1.0; for i=1:N Fe += Ns * (f * Jac); end Fe end function doit2(N) Fe=zeros(3,1); Ns=zeros(3,1)+1.0; f=-6.0; Jac=1.0; for i=1:N Fe += Ns .* ([f] * Jac); end Fe end function doit3(N) Fe=zeros(3,1); Ns=zeros(3,1)+1.0; f=-6.0; Jac=1.0; fs=[1.0] for i=1:N fs= ([f] * Jac ); Fe += Ns .* fs; end Fe end function doit4(N) Fe=zeros(3,1); Ns=zeros(3,1)+1.0; f=-6.0; Jac=1.0; fs=[1.0] for i=1:N # fs= [f]; fs *= Jac; Fe += Ns .* fs; end Fe end # @time doit1(N) @time doit2(N) @time doit3(N) @time doit4(N) Here are the measurements on my machine: elapsed time: 0.461173619 seconds (384129944 bytes allocated, 44.45% gc time) elapsed time: 6.905249901 seconds (1904151036 bytes allocated, 12.84% gc time) elapsed time: 6.862259146 seconds (1904166072 bytes allocated, 13.23% gc time) elapsed time: 6.842678542 seconds (1904166256 bytes allocated, 12.81% gc time) I'll be grateful for any pointers as to how to structure the code so that the array operations not incur this horrendous penalty. Petr On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote: > > Can you post short but complete code examples in a gist? > https://gist.github.com/ > That would make it easier to follow than chasing code examples across long > email chains. > > --Tim > > > On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: > > I don't know if this is correct, but here is a guess: > > > > Option 3 still requires a temp array ( to calculate the result of the > > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The > > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU > time. > > So WHY is the difference between 1 and 2 so HUUUGE? > > > > I think this calls for someone who wrote the compiler. Guys? > > > > Thanks a bunch, > > > > P > > > > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: > > > Actually: option (4) was also tested: > > > # 16.333766821 seconds (3008899660 bytes > > > fs[1]= f; fs *= (Jac * w[j]); > > > > > > Fe += Ns[j] .* fs; > > > > > > So, allocation of memory was reduced somewhat, runtime not so much. > > > > > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: > > >> Well, temporary array was also on my mind. However, things are I > > >> believe a little bit more complicated. > > >> > > >> Here is the code with three timed options. As you can see, the first > two > > >> options are the fast one (multiplication with a scalar) and the slow > one > > >> (multiplication with a one by one matrix). In the third option I > tried > > >> to > > >> avoid the creation of an ad hoc temporary by allocating a variable > > >> outside > > >> of the loop. The effect unfortunately is nil. > > >> > > >> fs=[0.0]# Used only for option (3) > > >> # Now loop over all fes in the block > > >> for i=1:size(conns,1) > > >> > > >> ... > > >> for j=1:npts > > >> > > >> ... > > >> > > >> # Option (1): 7.193767019 seconds (1648850568 bytes > > >> # Fe += Ns[j] * (f * Jac * w[j]); # > > >> > > >> # Option (2): 17.301214583 seconds (3244458368 bytes > > >> > > >> # Fe += Ns[j] .* ([f] * Jac * w[j]); # > > >> > > >> # Option (3): 16.943314075 seconds (3232879120 > > >> > > >> fs= ([f] * Jac * w[j]); Fe += Ns[j] .* fs; > > >> > > >> end > > >> ... > > >> > > >> end > > >> > > >> What do you think? Why is the code still getting hit with a big > > >> performance/memory penalty? > > >> > > >> Thanks, > > >> > > >> Petr > > >> > > >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: > > >>> I would think that when f is a 1x1 matrix Julia is allocating a new > 1x1 > > >>> matrix to store the result. If it is a scalar that allocation can be > > >>> skipped. When this part of the code is now in a hot loop it might > happen > > >>> that you allocate millions of very small short-lived objects and > that > > >>> taxes > > >>> the GC quite a lot. > >