Re: [Bug-apl] 80 core performance results

Elias Mårtenson Fri, 22 Aug 2014 09:25:03 -0700

Thanks, that's interesting indeed.

What about the idea of coalescing multiple functions so that each thread
can stream multiple operations in a row without synchronising? To me, it
would seem to be hugely beneficial if the expression -1+2+X could stream
the three operations (two additions, one negation) when generating the
output. Would such a feature require much re-architecting of the
application?


Regards,
Elias


On 22 August 2014 21:46, Juergen Sauermann <juergen.sauerm...@t-online.de>
wrote:

>  Hi Elias,
>
> I am working on it.
>
> As a preparation I have created a new command *]PSTAT* that shows how
> many CPU cycles
> the different scalar function take. You can run the new workspace
> *ScalarBenchmark_1.apl* to
> see the results (*SVN 444*).
>
> These numbers are very imprtant to determine when to switch from
> sequential to parallel execution.
> The idea is to feed back these numbers into the interpreter so that
> machines can tune themselves.
>
> The other thing is the lesson learned from your benchmark results. As far
> as I can see, semaphores are far to
> slow for syncing the threads. The picture that is currently evolving in my
> head is this: Instead of 2 states
> (blocked on semaphore/running) of the threads there should be 3 states:
>
> 1. blocked on semaphore,
> 2. busy waiting on some flag in userspace, and
> 3. running (performing parallel computations)
>
> The transition between states 1 and 2 is somewhat slow, but only done when
> the interpreter is blocked on input from the user.
> The transition between 2 and 3 is much more lightweight so that the
> break-even point between sequential and
> parallel execution occurs at much shorter vector sizes.
>
> Since this involves some interaction with *Input.cc *I wasn't sure if I
> should first throw out *libreadline* (in order to simplify* Input**.cc*)
> or to do the parallel stuff first.
>
> Another lesson from the benchmark was that OMP is always slower than the
> hand-crafted method, so I guess it is out of scope now,
>
> My long term plan for the next 1 or 2 releases is this:
>
> 1. remove libreadline
> 2. parallel for scalar functions
> 3. replace liblapack
>
> /// Jürgen
>
>
>  On 08/22/2014 12:22 PM, Elias Mårtenson wrote:
>
> Have the results of this been integrated in the interpreter?
>
>
> On 1 August 2014 21:57, Juergen Sauermann <juergen.sauerm...@t-online.de>
> wrote:
>
>>  Hi Elias,
>>
>> yes - actually a lot. I haven't looked through all files, but
>> at 80, 60, and small core counts.
>>
>> The good news is that all results look plausible now. There are some
>> variations
>> in the data, of course, but the trend is clear:
>>
>> The total time for OMP (the rightmost value in the plot, i.e. x ==
>> corecount + 10) is consistently
>> about twice the total time for a hand-crafted fork/sync. The benchmark
>> was made in such way
>> that it only shows the fork/join times. Column N ≤ corecount shows the
>> time when the N'th core
>> started execution of its task.
>>
>> I have attached a plot for the 80 core result (4 hand-crafted runs in red
>> and 4 OMP runs in green).
>> And the script that created the plots using gnuplot.
>>
>> /// Jürgen
>>
>>
>>
>> On 08/01/2014 03:16 PM, Elias Mårtenson wrote:
>>
>> Were you able to deduce anything from the test results?
>> On 11 May 2014 23:02, "Juergen Sauermann" <juergen.sauerm...@t-online.de>
>> wrote:
>>
>>>  Hi Elias,
>>>
>>> thanks, already interesting. If you could loop around the core count:
>>>
>>> *for ((i=1; $i<=80; ++i)); do*
>>> * ./Parallel $i*
>>> * ./Parallel_OMP $i*
>>> *done*
>>>
>>> then I could understand the data better. Also not sure if something
>>> is wrong with the benchmark program. On my new 4-core with OMP I get
>>> fluctuations from:
>>>
>>> eedjsa@server65 ~/apl-1.3/tools $ ./Parallel_OMP 4
>>> Pass 0: 4 cores/threads, 8229949 cycles total
>>> Pass 1: 4 cores/threads, 8262 cycles total
>>> Pass 2: 4 cores/threads, 4035 cycles total
>>> Pass 3: 4 cores/threads, 4126 cycles total
>>> Pass 4: 4 cores/threads, 4179 cycles total
>>>
>>> to:
>>>
>>> eedjsa@server65 ~/apl-1.3/tools $ ./Parallel_OMP 4
>>> Pass 0: 4 cores/threads, 11368032 cycles total
>>> Pass 1: 4 cores/threads, 4042228 cycles total
>>> Pass 2: 4 cores/threads, 7251419 cycles total
>>> Pass 3: 4 cores/threads, 3846 cycles total
>>> Pass 4: 4 cores/threads, 2725 cycles total
>>>
>>> The fluctuations with the manual parallel for are smaller:
>>>
>>> Pass 0: 4 cores/threads, 87225 cycles total
>>> Pass 1: 4 cores/threads, 245046 cycles total
>>> Pass 2: 4 cores/threads, 84632 cycles total
>>> Pass 3: 4 cores/threads, 63619 cycles total
>>> Pass 4: 4 cores/threads, 93437 cycles total
>>>
>>> but still considerable. The picture so far suggests that OMP fluctuates
>>> much
>>> more (in the start-up + sync time) than manual with the highest OMP
>>> start-up above manual
>>> and the lowest far below. One change on my  TODO list is to use futexes
>>> instead of mutexes
>>> (like OMP does), probably not an issue under Solaris sunce futextes are
>>> linux-specific.
>>>
>>> /// Jürgen
>>>
>>>
>>> On 05/11/2014 04:23 AM, Elias Mårtenson wrote:
>>>
>>> Here are the files that I promised earlier.
>>>
>>>  Regards,
>>> Elias
>>>
>>>
>>>
>>
>
>

Re: [Bug-apl] 80 core performance results

Reply via email to