Thanks, that's interesting indeed. What about the idea of coalescing multiple functions so that each thread can stream multiple operations in a row without synchronising? To me, it would seem to be hugely beneficial if the expression -1+2+X could stream the three operations (two additions, one negation) when generating the output. Would such a feature require much re-architecting of the application?
Regards, Elias On 22 August 2014 21:46, Juergen Sauermann <juergen.sauerm...@t-online.de> wrote: > Hi Elias, > > I am working on it. > > As a preparation I have created a new command *]PSTAT* that shows how > many CPU cycles > the different scalar function take. You can run the new workspace > *ScalarBenchmark_1.apl* to > see the results (*SVN 444*). > > These numbers are very imprtant to determine when to switch from > sequential to parallel execution. > The idea is to feed back these numbers into the interpreter so that > machines can tune themselves. > > The other thing is the lesson learned from your benchmark results. As far > as I can see, semaphores are far to > slow for syncing the threads. The picture that is currently evolving in my > head is this: Instead of 2 states > (blocked on semaphore/running) of the threads there should be 3 states: > > 1. blocked on semaphore, > 2. busy waiting on some flag in userspace, and > 3. running (performing parallel computations) > > The transition between states 1 and 2 is somewhat slow, but only done when > the interpreter is blocked on input from the user. > The transition between 2 and 3 is much more lightweight so that the > break-even point between sequential and > parallel execution occurs at much shorter vector sizes. > > Since this involves some interaction with *Input.cc *I wasn't sure if I > should first throw out *libreadline* (in order to simplify* Input**.cc*) > or to do the parallel stuff first. > > Another lesson from the benchmark was that OMP is always slower than the > hand-crafted method, so I guess it is out of scope now, > > My long term plan for the next 1 or 2 releases is this: > > 1. remove libreadline > 2. parallel for scalar functions > 3. replace liblapack > > /// Jürgen > > > On 08/22/2014 12:22 PM, Elias Mårtenson wrote: > > Have the results of this been integrated in the interpreter? > > > On 1 August 2014 21:57, Juergen Sauermann <juergen.sauerm...@t-online.de> > wrote: > >> Hi Elias, >> >> yes - actually a lot. I haven't looked through all files, but >> at 80, 60, and small core counts. >> >> The good news is that all results look plausible now. There are some >> variations >> in the data, of course, but the trend is clear: >> >> The total time for OMP (the rightmost value in the plot, i.e. x == >> corecount + 10) is consistently >> about twice the total time for a hand-crafted fork/sync. The benchmark >> was made in such way >> that it only shows the fork/join times. Column N ≤ corecount shows the >> time when the N'th core >> started execution of its task. >> >> I have attached a plot for the 80 core result (4 hand-crafted runs in red >> and 4 OMP runs in green). >> And the script that created the plots using gnuplot. >> >> /// Jürgen >> >> >> >> On 08/01/2014 03:16 PM, Elias Mårtenson wrote: >> >> Were you able to deduce anything from the test results? >> On 11 May 2014 23:02, "Juergen Sauermann" <juergen.sauerm...@t-online.de> >> wrote: >> >>> Hi Elias, >>> >>> thanks, already interesting. If you could loop around the core count: >>> >>> *for ((i=1; $i<=80; ++i)); do* >>> * ./Parallel $i* >>> * ./Parallel_OMP $i* >>> *done* >>> >>> then I could understand the data better. Also not sure if something >>> is wrong with the benchmark program. On my new 4-core with OMP I get >>> fluctuations from: >>> >>> eedjsa@server65 ~/apl-1.3/tools $ ./Parallel_OMP 4 >>> Pass 0: 4 cores/threads, 8229949 cycles total >>> Pass 1: 4 cores/threads, 8262 cycles total >>> Pass 2: 4 cores/threads, 4035 cycles total >>> Pass 3: 4 cores/threads, 4126 cycles total >>> Pass 4: 4 cores/threads, 4179 cycles total >>> >>> to: >>> >>> eedjsa@server65 ~/apl-1.3/tools $ ./Parallel_OMP 4 >>> Pass 0: 4 cores/threads, 11368032 cycles total >>> Pass 1: 4 cores/threads, 4042228 cycles total >>> Pass 2: 4 cores/threads, 7251419 cycles total >>> Pass 3: 4 cores/threads, 3846 cycles total >>> Pass 4: 4 cores/threads, 2725 cycles total >>> >>> The fluctuations with the manual parallel for are smaller: >>> >>> Pass 0: 4 cores/threads, 87225 cycles total >>> Pass 1: 4 cores/threads, 245046 cycles total >>> Pass 2: 4 cores/threads, 84632 cycles total >>> Pass 3: 4 cores/threads, 63619 cycles total >>> Pass 4: 4 cores/threads, 93437 cycles total >>> >>> but still considerable. The picture so far suggests that OMP fluctuates >>> much >>> more (in the start-up + sync time) than manual with the highest OMP >>> start-up above manual >>> and the lowest far below. One change on my TODO list is to use futexes >>> instead of mutexes >>> (like OMP does), probably not an issue under Solaris sunce futextes are >>> linux-specific. >>> >>> /// Jürgen >>> >>> >>> On 05/11/2014 04:23 AM, Elias Mårtenson wrote: >>> >>> Here are the files that I promised earlier. >>> >>> Regards, >>> Elias >>> >>> >>> >> > >