Ah, thanks for that! I did put the gc() inside the @time intentionally, so that each test would include the time to clean up after itself.
I still find it a bit tricky to reason about what can/can't be optimized, but I'm getting better about writing performance tests so that if I make changes later that cause a bunch of new allocation it should get flagged when I run my unit tests. peace, s On Wed, Jun 25, 2014 at 3:53 PM, Simon Kornblith <si...@simonster.com> wrote: > There shouldn't be any allocation in the last 3 calls. There is only > allocation because, at the top of buftest(), you define: > > _ = with_alloc(cont) > _ = without_alloc(cont) > _ = without_alloc_with_temp(cont) > _ = using_parameter(cont, buf) > > This doesn't actually introduce uncertainty about the types, but it does > deoptimize things a bit: it causes Julia to box _ when you use it as an > iteration variable later in buftest(), which is the source of the > allocation. Simply deleting the LHS here removes the allocation in your > tests of those three functions, although it doesn't make much of a > difference in performance. > > Also note that you put gc() within the @time block, which may not be what > you want. With the change above, there isn't really a point to calling gc() > at all, since only with_alloc allocates. > > Simon > > > On Wednesday, June 25, 2014 3:24:38 PM UTC-4, Spencer Russell wrote: > >> Mystery solved. In #3 I was missing the indexing `[i]` so I was adding a >> constant to the whole array instead of just incrementing that item each >> time. >> >> My final results were (note I'm doing way more trials now to offset the >> GC time): >> >> ``` >> Doing 100000 Iterations of each... >> ------- >> Timing with allocation each call >> elapsed time: 0.474479476 seconds (417591824 bytes allocated, 42.90% gc >> time) >> Timing without allocation each call >> elapsed time: 0.35568662 seconds (1591824 bytes allocated, 2.88% gc time) >> Timing without allocation using a temp buffer each call >> elapsed time: 0.18984845 seconds (1591824 bytes allocated, 5.64% gc time) >> Timing passing array as a parameter >> elapsed time: 0.173606393 seconds (1591824 bytes allocated, 6.03% gc time) >> ``` >> >> So the really good news is that assigning the object field to a temp >> variable before my tight loop gives about a 2x speed increase, without >> modifying my method's API. >> >> Thanks all for the pointers! >> >> >> peace, >> s >> >> >> On Wed, Jun 25, 2014 at 2:57 PM, Spencer Russell <s...@mit.edu> wrote: >> >>> Hi Thomas, >>> >>> Thanks for the pointer towards @time and the GC info. >>> >>> I also just realized I broke a golden performance rule in my test, which >>> was referencing variables in global scope. >>> >>> Putting the whole test inside a function gives more reasonable results >>> in the sense that #2 and #4 do the exact same amount of allocation, and #2 >>> is a bit faster than #1, but not as fast as #4. >>> >>> ``` >>> Timing with allocation each call >>> elapsed time: 0.043889071 seconds (41751824 bytes allocated) >>> Timing without allocation each call >>> elapsed time: 0.026565517 seconds (151824 bytes allocated) >>> Timing without allocation using a temp buffer each call >>> elapsed time: 29.461950105 seconds (42762391824 bytes allocated, 59.40% >>> gc time) >>> Timing passing array as a parameter >>> elapsed time: 0.01580412 seconds (151824 bytes allocated) >>> ``` >>> >>> I'm still a bit surprised that #2 is that much slower than #4, as it >>> seems like it's just another pointer dereference, and that #3 isn't a fix >>> for that. >>> >>> peace, >>> s >>> >>> >>> On Wed, Jun 25, 2014 at 2:39 PM, Tomas Lycken <tomas....@gmail.com> >>> wrote: >>> >>>> If you measure time using the @time macro instead of with tic()/toc(), >>>> you also get information about memory allocation and garbage collection. >>>> Doing that, I find >>>> >>>> Timing with allocation each call >>>> elapsed time: 0.004325641 seconds (4167824 bytes allocated) >>>> Timing without allocation each call >>>> elapsed time: 0.53675596 seconds (98399824 bytes allocated, 7.60% gc >>>> time) >>>> Timing without allocation using a temp buffer each call >>>> elapsed time: 2.165323004 seconds (4309087824 bytes allocated, 54.22% >>>> gc time) >>>> Timing passing array as a parameter >>>> elapsed time: 0.001356721 seconds (7824 bytes allocated) >>>> >>>> so you see that the third method is terribly memory-inefficient, both >>>> allocating and garbage collecting way more than any other method. The last >>>> method is much faster since it barely allocates any new memory. >>>> >>>> // T >>>> >>>> On Wednesday, June 25, 2014 7:57:18 PM UTC+2, Spencer Russell wrote: >>>>> >>>>> I'm having some trouble understanding some performance issues. I put >>>>> together a minimal example here: >>>>> >>>>> https://gist.github.com/ssfrr/8934c14d8d2703a3d203 >>>>> >>>>> I had some methods that were allocating arrays on each call, which I >>>>> figured wasn't very efficient. >>>>> >>>>> My first attempt to improve things was to allocate an array in my main >>>>> container type on initialization, and then share that between function >>>>> calls. >>>>> >>>>> Suprisingly (to me) this slowed things down by about 60x. >>>>> >>>>> I wondered if maybe this was because of the extra dereference to get >>>>> the array (though the slowdown seemed too dramatic for that) so I saved >>>>> the >>>>> reference to the array in a temp variable before my tight loop. >>>>> >>>>> This slowed down by an additional 7x (more surprises!). >>>>> >>>>> Passing the array as a parameter directly to each function invocation >>>>> was by far the fastest, and was about 2x faster than my original that >>>>> allocated each time. This approach complicates my interface somewhat >>>>> though, as now the caller needs to know how many work buffers the function >>>>> might need, instead of baking that information into the type. I could >>>>> probably solve this with a wrapper function, but I'd like to understand >>>>> what's going on and if there's some sort of type-inference thing I should >>>>> clean up. >>>>> >>>>> Specifically my questions are: >>>>> >>>>> 1. Why is accessing the array as a parameter so much faster than >>>>> accessing the array through an object passed as a parameter? As far as >>>>> I >>>>> can tell the same type information is there. >>>>> 2. Why does it slow things down so much to store the reference to >>>>> the array in the beginning of the function and then access that in the >>>>> tight loop? >>>>> >>>>> >>>>> peace, >>>>> s >>>>> >>>> >>> >>