This is interesting. The parallel speedup on your machine using TBB is in the same ballpark as on my machine using OpenMP, and they're both delivering less than a 2:1 speedup.
I informally ran some experiments two nights ago to try to characterize the behavior. On my machine, with OpenMP #pragmas on the scalar loops, the ratio of single-threaded to multi-threaded runtimes held stubbornly at about 0.7 regardless of the size of the problem. I tried integer and float data, addition and power, with ravels up to 100 million elements. (My smallest test set was a million elements; I still need to try smaller sets to see whether I can find a knee where the thread setup overhead dominates and results in a runtime ratio greater than 1.) I'm not sure what this means, yet. I'd hoped to see some further improvement as the ravel size increased, despite the internal inefficiencies. TBH, I didn't find and annotate the copy loop(s); that might have a lot to do with my results. (I did find and annotate another loop in check_value(), though. Maybe parallelizing that will improve your results.) I'm hoping that the poor showing so far isn't a result of memory bandwidth limitations. I hope to spend some more time on this over the weekend. P.S.: I will note that the nice part about using OpenMP is that there's no hand-coding necessary. All you do is add #pragmas to your program; the compiler takes care of the rewrites. ---------- Forwarded message ---------- > From: "Elias Mårtenson" <loke...@gmail.com> > To: "bug-apl@gnu.org" <bug-apl@gnu.org> > Cc: > Date: Fri, 14 Mar 2014 22:22:15 +0800 > Subject: [Bug-apl] Performance optimisations: Results > Hello guys, > > I've spent some time experimenting with various performance optimisations > and I would like to share my latest results with you: > > I've run a lot of tests using Callgrind, which is part of the > Valgrind<http://valgrind.org/>tool (documentation > here <http://valgrind.org/docs/manual/cl-manual.html>). In doing so, I've > concluded that a disproportionate amount of time is spent copying values > (this can be parallelised; more about that below). > > I set out to see how much faster I could make simple test program that > applied a monadic scalar function. Here is my test program: > > ∇Z←testinv;tmp > src←10000 4000⍴÷⍳100 > 'Starting' > tmp←{--------⍵} time src > Z←1 > ∇ > > > This program calls my time operator which simply shows the amount of time > it took to execute the operation. This is of course needed for > benchmarking. For completeness, here is the implementation of time: > > ∇Z←L (OP time) R;start;end > start←⎕AI > →(0≠⎕NC 'L')/twoargs > Z←OP R > →finish > twoargs: > Z←L OP R > finish: > end←⎕AI > 'Time:',((end[3]+end[2]×1E6) - (start[3]+start[2]×1E6))÷1E6 > ∇ > > > The unmodified version of GNU APL runs this in *5037.00* milliseconds on > my machine. > > I then set out to minimise the amount of cloning of values, taking > advantage of the existing temp functionality. Once I had done this, the > execution time was reduced to *2577.00* ms. > > I then used the Threading Building > Blocks<https://www.threadingbuildingblocks.org/>library to parallelise two > operations: The clone operation and the monadic > SkalarFunction::eval_skalar_B(). After this, on my 4-core machine, the > runtime was reduced to *1430.00* ms. > > Threading Building Blocks is available from the application repositories > of at least Arch Linux and Ubuntu, and I'm sure it's available elsewhere > too. To test in on OSX I had to download it separately. > > To summarise: > > - Standard: 5037.00 > - Reduced cloning: 2577.00 > - Parallel: 1430.00 > > I have attached the patch, but it's definitely not something that should > be applied blindly. I have hacked around is several parts of the code, some > of which I can't say I understand fully, so see it as a proof-of-concept, > nothing else. > > Note that the code that implements the parallelism using TBB is pretty > ugly, and the code ends up being duplicated in the parallel and > non-parallel version. This can, of course, be encapsulated much nicer if > one wants to make this generic. > > Another thing, TBB is incredibly efficient, especially on Intel CPU's. I'd > be very interested to see how OpenMP performs on this same code. > > Regards, > Elias > > -- "The secret to creativity is knowing how to hide your sources." Albert Einstein http://soundcloud.com/davidlamkins http://reverbnation.com/lamkins http://reverbnation.com/lcw http://lamkins-guitar.com/ http://lamkins.net/ http://successful-lisp.com/