Re: [Bug-apl] Performance optimisations: Results

Juergen Sauermann Fri, 14 Mar 2014 10:23:37 -0700

Hi David,

sounds like something is wrong. OpenMP states that their parallisationoverhead is

4000 or so cycles which is much less than your measurements. Maybe some
unintended sync between the threads?


Could you send me the latest patch?

/// Juergen


On 03/14/2014 05:18 PM, David Lamkins wrote:

This is interesting. The parallel speedup on your machine using TBB isin the same ballpark as on my machine using OpenMP, and they're bothdelivering less than a 2:1 speedup.

I informally ran some experiments two nights ago to try tocharacterize the behavior. On my machine, with OpenMP #pragmas on thescalar loops, the ratio of single-threaded to multi-threaded runtimesheld stubbornly at about 0.7 regardless of the size of the problem. Itried integer and float data, addition and power, with ravels up to100 million elements. (My smallest test set was a million elements; Istill need to try smaller sets to see whether I can find a knee wherethe thread setup overhead dominates and results in a runtime ratiogreater than 1.)

I'm not sure what this means, yet. I'd hoped to see some furtherimprovement as the ravel size increased, despite the internalinefficiencies. TBH, I didn't find and annotate the copy loop(s); thatmight have a lot to do with my results. (I did find and annotateanother loop in check_value(), though. Maybe parallelizing that willimprove your results.) I'm hoping that the poor showing so far isn't aresult of memory bandwidth limitations.


I hope to spend some more time on this over the weekend.

P.S.: I will note that the nice part about using OpenMP is thatthere's no hand-coding necessary. All you do is add #pragmas to yourprogram; the compiler takes care of the rewrites.



---------- Forwarded message ----------

    From: "Elias Mårtenson" <loke...@gmail.com <mailto:loke...@gmail.com>>
    To: "bug-apl@gnu.org <mailto:bug-apl@gnu.org>" <bug-apl@gnu.org
    <mailto:bug-apl@gnu.org>>
    Cc:
    Date: Fri, 14 Mar 2014 22:22:15 +0800
    Subject: [Bug-apl] Performance optimisations: Results
    Hello guys,

    I've spent some time experimenting with various performance
    optimisations and I would like to share my latest results with you:

    I've run a lot of tests using Callgrind, which is part of the
    Valgrind <http://valgrind.org/> tool (documentation here
    <http://valgrind.org/docs/manual/cl-manual.html>). In doing so,
    I've concluded that a disproportionate amount of time is spent
    copying values (this can be parallelised; more about that below).

    I set out to see how much faster I could make simple test program
    that applied a monadic scalar function. Here is my test program:

        ∇Z←testinv;tmp
        src←10000 4000⍴÷⍳100
        'Starting'
        tmp←{--------⍵} time src
        Z←1
        ∇


    This program calls my time operator which simply shows the amount
    of time it took to execute the operation. This is of course needed
    for benchmarking. For completeness, here is the implementation of
    time:

        ∇Z←L (OP time) R;start;end
        start←⎕AI
        →(0≠⎕NC 'L')/twoargs
        Z←OP R
        →finish
        twoargs:
        Z←L OP R
        finish:
        end←⎕AI
        'Time:',((end[3]+end[2]×1E6) - (start[3]+start[2]×1E6))÷1E6
        ∇


    The unmodified version of GNU APL runs this in
    *5037.00* milliseconds on my machine.

    I then set out to minimise the amount of cloning of values, taking
    advantage of the existing temp functionality. Once I had done
    this, the execution time was reduced to *2577.00* ms.

    I then used the Threading Building Blocks
    <https://www.threadingbuildingblocks.org/> library to parallelise
    two operations: The clone operation and the monadic
    SkalarFunction::eval_skalar_B(). After this, on my 4-core machine,
    the runtime was reduced to *1430.00* ms.

    Threading Building Blocks is available from the application
    repositories of at least Arch Linux and Ubuntu, and I'm sure it's
    available elsewhere too. To test in on OSX I had to download it
    separately.

    To summarise:

      * Standard: 5037.00
      * Reduced cloning: 2577.00
      * Parallel: 1430.00

    I have attached the patch, but it's definitely not something that
    should be applied blindly. I have hacked around is several parts
    of the code, some of which I can't say I understand fully, so see
    it as a proof-of-concept, nothing else.

    Note that the code that implements the parallelism using TBB is
    pretty ugly, and the code ends up being duplicated in the parallel
    and non-parallel version. This can, of course, be encapsulated
    much nicer if one wants to make this generic.

    Another thing, TBB is incredibly efficient, especially on Intel
    CPU's. I'd be very interested to see how OpenMP performs on this
    same code.

    Regards,
    Elias


--
"The secret to creativity is knowing how to hide your sources."
   Albert Einstein


http://soundcloud.com/davidlamkins
http://reverbnation.com/lamkins
http://reverbnation.com/lcw
http://lamkins-guitar.com/
http://lamkins.net/
http://successful-lisp.com/

Re: [Bug-apl] Performance optimisations: Results

Reply via email to