Re: [Bug-apl] Performance optimisations: Results

Juergen Sauermann Wed, 02 Apr 2014 06:46:27 -0700

Hi Elias,

there are a number of other functions that have side effects: many ⎕functions/vars, all user defined functions,and in particular everybody else who creates a value (unless new isatomic; the Value constructor is currently not).Some of these side effects are internal (eg. ⎕RL, ⎕EA). I believe aparallel EACH with user defined functions or

even with ⎕ function is either trivial or close to impossible.

BTW you mentioned earlier that you were able to cut execution time inhalf by using the tmp flag. I would be

interested in which places you did that.

/// Jürgen


On 04/02/2014 03:25 PM, Elias Mårtenson wrote:

Thanks. I'm at home now, but I'll run the tests tomorrow at the office.

I would say, though, that being able to run, say, the EACH operator ona user function (or lambda) would provide some tremendousopportunities for parallelisation.

Would it be safe to say that as long as the function being called donot use the assignment operator nor the EXECUTE function, it should besafe for parallelisation?


Regards,
Elias

On 2 April 2014 20:12, Juergen Sauermann<juergen.sauerm...@t-online.de <mailto:juergen.sauerm...@t-online.de>>wrote:


    Hi,

    the output is meant to be gnuplotted. You either copy-and-paste
    the data lines
    into a some file or else apl > file (in that case you have to type
    blindly and remove the
    non-data lines with an editor.

    The first 256 data lines are the cycle counter of the CPU before
    the nth iteration
    at the beginning of the loop. Looking at the result:

       0, 168
       1, 344610
       2, 673064
       3, 994497

    and and at the code:

    int64_t T0 = cycle_counter();

       for (c = 0; c < count; c++)
           {
             if ((c & 0x0FFF) == 0)   Tn[c >> 12] = cycle_counter();
             const Cell * cell_A = &A->get_ravel(c);
            ...
         }

    int64_t TX = cycle_counter();


    we see that that the loop begins at 0 cycles (actually at T0 but
    T0 is subtracted from Tn
    when printed so that time 0 is virtually the start of the loop.

    At cycle 168 we are at the first line of the loop.
    At cycle 344610 we have performed 4096 iterations of the loop,
    At cycle 673064 we have performed 4096 more iterations of the loop,
    At cycle 994497 we have performed 4096 more iterations of the loop,
    ...

    The last value is the cycle counter after the loop (so joining of
    the the threads is included).

    In file parallel, the first half of the data is supposedly the
    timestamps written by one thread
    and the other half the timestamps written by the other thread. On
    a 8 core machine this should
    look like:

      /|  /|  /|  /|  /|  /| /|  /
     / | / | / | / | / | / | / | /
    /  |/  |/  |/  |/  |/  |/  |/


    The interesting times are:

    - T0 (showing roughly the start-up overhead of the loop), and
    -  the difference between the last two values compared with the
    average difference between
       two values (showing the joining overhead of the loop), and
    - the last value (the total execution time of the loop).

    Comparing files seqential and parallel we see that the same loop
    costs 81834662 cycles when run on one core
    and 43046192 cycles when run on two cores. This is 2128861 cycles
    away from speedup 2.

    The difference between two values is around 322500 (for 4096
    iterations) which is about 79 cycles for one iteration.
    Thus the break-even point where parallel is faster is at vector
    length 26947 (due to the fact that integer addition is
    about the fastest operation on a CPU). The real code had a call to

    expand_pointers(cell_Z, cell_A, cell_B, fun);

    instead of:

    (cell_B->*fun)(cell_Z, cell_A);

    so that the break-even point will go down a little.

    /// Jürgen



    On 04/02/2014 12:43 PM, Elias Mårtenson wrote:

    Thanks,

    Now I have an OpenMP enabled build on Solaris, and I'm ready to
    test. How am I supposed to interpret the output from this command?

    Regards,
    Elias


    On 2 April 2014 01:27, Juergen Sauermann
    <juergen.sauerm...@t-online.de
    <mailto:juergen.sauerm...@t-online.de>> wrote:

        Hi Elias,

        I have attached the changed files. Note thjat this is very
        quick-and-dirty.
        No automake integration, fixed variable size, etc. And the
        process hangs
        after the loop (no idea why, "killall apl" helps).

        I apl do:

        A←(1024×1024)⍴2 ◊ A+A

        /// Jürgen





        On 04/01/2014 06:44 PM, Elias Mårtenson wrote:


            Hello Jürgen,

            How can I reproduce your tests? My workstation has 4
            cores (×2 threads) and at work I have up to 32-core
            machines to test on. (I might have some 128 or 256 core
            machines too).

            Regards,
            Elias

Re: [Bug-apl] Performance optimisations: Results

Reply via email to