I'm pretty sure I know why that is. When TBB runs a parallel loop it uses all 8 of my virtual cores (4 cores × 2 threads).
However, there is lots of overhead in other places. The actual computation (the monadic negation function in my test) took about a quarter to a fifth (140 ms per iteration) of the CPU time of the entire operation. The rest was spent on copying the array over and over again. After eliminating a lot of that, the actual computational loop was still just a half of the total time spent. The rest is spent outside the loop. The "temp" optimisation can't eliminate all copying, so what we can do to improve the parallelisability (is that a word?) of this operation would be to combine the loop that duplicates the array with the operation that manipulates it. This reduces overhead since there would only be one loop to parallelise instead of two. Regards, Elias On 15 Mar 2014 00:19, "David Lamkins" <da...@lamkins.net> wrote: > This is interesting. The parallel speedup on your machine using TBB is in > the same ballpark as on my machine using OpenMP, and they're both > delivering less than a 2:1 speedup. > > I informally ran some experiments two nights ago to try to characterize > the behavior. On my machine, with OpenMP #pragmas on the scalar loops, the > ratio of single-threaded to multi-threaded runtimes held stubbornly at > about 0.7 regardless of the size of the problem. I tried integer and float > data, addition and power, with ravels up to 100 million elements. (My > smallest test set was a million elements; I still need to try smaller sets > to see whether I can find a knee where the thread setup overhead dominates > and results in a runtime ratio greater than 1.) > > I'm not sure what this means, yet. I'd hoped to see some further > improvement as the ravel size increased, despite the internal > inefficiencies. TBH, I didn't find and annotate the copy loop(s); that > might have a lot to do with my results. (I did find and annotate another > loop in check_value(), though. Maybe parallelizing that will improve your > results.) I'm hoping that the poor showing so far isn't a result of memory > bandwidth limitations. > > I hope to spend some more time on this over the weekend. > > > P.S.: I will note that the nice part about using OpenMP is that there's no > hand-coding necessary. All you do is add #pragmas to your program; the > compiler takes care of the rewrites. > > > ---------- Forwarded message ---------- > >> From: "Elias Mårtenson" <loke...@gmail.com> >> To: "bug-apl@gnu.org" <bug-apl@gnu.org> >> Cc: >> Date: Fri, 14 Mar 2014 22:22:15 +0800 >> Subject: [Bug-apl] Performance optimisations: Results >> Hello guys, >> >> I've spent some time experimenting with various performance optimisations >> and I would like to share my latest results with you: >> >> I've run a lot of tests using Callgrind, which is part of the >> Valgrind<http://valgrind.org/>tool (documentation >> here <http://valgrind.org/docs/manual/cl-manual.html>). In doing so, >> I've concluded that a disproportionate amount of time is spent copying >> values (this can be parallelised; more about that below). >> >> I set out to see how much faster I could make simple test program that >> applied a monadic scalar function. Here is my test program: >> >> ∇Z←testinv;tmp >> src←10000 4000⍴÷⍳100 >> 'Starting' >> tmp←{--------⍵} time src >> Z←1 >> ∇ >> >> >> This program calls my time operator which simply shows the amount of >> time it took to execute the operation. This is of course needed for >> benchmarking. For completeness, here is the implementation of time: >> >> ∇Z←L (OP time) R;start;end >> start←⎕AI >> →(0≠⎕NC 'L')/twoargs >> Z←OP R >> →finish >> twoargs: >> Z←L OP R >> finish: >> end←⎕AI >> 'Time:',((end[3]+end[2]×1E6) - (start[3]+start[2]×1E6))÷1E6 >> ∇ >> >> >> The unmodified version of GNU APL runs this in *5037.00* milliseconds on >> my machine. >> >> I then set out to minimise the amount of cloning of values, taking >> advantage of the existing temp functionality. Once I had done this, the >> execution time was reduced to *2577.00* ms. >> >> I then used the Threading Building >> Blocks<https://www.threadingbuildingblocks.org/>library to parallelise two >> operations: The clone operation and the monadic >> SkalarFunction::eval_skalar_B(). After this, on my 4-core machine, the >> runtime was reduced to *1430.00* ms. >> >> Threading Building Blocks is available from the application repositories >> of at least Arch Linux and Ubuntu, and I'm sure it's available elsewhere >> too. To test in on OSX I had to download it separately. >> >> To summarise: >> >> - Standard: 5037.00 >> - Reduced cloning: 2577.00 >> - Parallel: 1430.00 >> >> I have attached the patch, but it's definitely not something that should >> be applied blindly. I have hacked around is several parts of the code, some >> of which I can't say I understand fully, so see it as a proof-of-concept, >> nothing else. >> >> Note that the code that implements the parallelism using TBB is pretty >> ugly, and the code ends up being duplicated in the parallel and >> non-parallel version. This can, of course, be encapsulated much nicer if >> one wants to make this generic. >> >> Another thing, TBB is incredibly efficient, especially on Intel CPU's. >> I'd be very interested to see how OpenMP performs on this same code. >> >> Regards, >> Elias >> >> > -- > "The secret to creativity is knowing how to hide your sources." > Albert Einstein > > > http://soundcloud.com/davidlamkins > http://reverbnation.com/lamkins > http://reverbnation.com/lcw > http://lamkins-guitar.com/ > http://lamkins.net/ > http://successful-lisp.com/ >