Oh and one more thing: Have you given any thought to my comments re. the coalescing of certain functions to reduce thread dispatch effort? (also, add some more functions to the no-copy optimisation?)
Regards, Elias On 11 March 2014 23:22, Elias Mårtenson <[email protected]> wrote: > I agree. I just wanted to point out that without a runtime option, > delivering binary versions will be hard, forcing the package maintainers to > choose a default that will surely be wrong for the majority of users. > > That said, being able to choose a compile-time value is good too. > > Regards, > Elias > > > On 11 March 2014 23:20, Juergen Sauermann > <[email protected]>wrote: > >> Hi, >> >> we could do it similar to the LOG macro where you can choose between >> more efficient compile-time settings and less efficient run-time settings. >> >> It is important that we do these things properly from the outset to avoid >> too many changes later on. >> >> /// Jürgen >> >> >> >> On 03/11/2014 04:10 PM, Elias Mårtenson wrote: >> >> May I suggest that being able to choose the number of cores at runtime >> should actually be the default. Remember that most Linux distributions will >> not compile the source on the local machine and instead distributes >> binaries. >> >> Having some #ifdefs would be good, and having runtime user-selected (or >> automatically based on cores) number of threads as default is important for >> this reason. >> >> Regards, >> Elias >> >> >> On 11 March 2014 23:07, Juergen Sauermann >> <[email protected]>wrote: >> >>> Hi David, >>> >>> looks good! Some comments, though. >>> >>> 1 .you could adapt src/testcases/Performance.pt with some longer >>> skalar functions in order to get some performance figures. You can start >>> it like this: >>> >>> ./apl -T testcases/Performance.pt >>> >>> 2. I believe we should not bother the user with specifying >>> parallelization parameters in ⎕SYL. >>> I would rather ./configure CORES=n with n=1 meaning no parallel >>> execution, CORES=auto >>> being the number of cores on the build machine, and explicit numbers n>1 >>> meaning that >>> n cores shall be used. This would generate slightly faster code than >>> computing array bounds >>> at runtime. Its a bit more hassle for the user, but may pay off soon. >>> >>> 3. Yes, GNU APL throws many exception (almost every APL error was thrown >>> from somewhere), >>> and I was excpecting that we have to catch them on the throwing >>> processor. Not too difficult if >>> we do it on the top level. >>> >>> 4. It would be good to understand how the OPenMP loops work. I could >>> imagined one of two strategies: >>> >>> - in loop(j, MAX) thread j executes iteration j, j+CORES, ... >>> - thread j executes iterations j*MAX/CORES ... (j+1)*MAX/CORES >>> >>> The first strategy interleaves the data and is more intuitive >>> while the second uses blocks of data and is more cache-friendly and >>> therefore probably >>> giving better performance. >>> >>> 5. Not sure if your earlier comment on letting the scheduler decide is >>> correct. I have been doing >>> pthread programming in the past and I have seen cases where the >>> scheduler fooled itself and >>> led to cases where the same problem took more than double the capacity >>> compared to explicit >>> affinity on a 4-core CPU. I would expect that APL generates very >>> fine-graned and short-lived >>> pieces of execution and the scheduler may not be optimized for that. I >>> guess we have to try that out. >>> >>> /// Jürgen >>> >>> >>> >>> >>> On 03/11/2014 08:02 AM, David B. Lamkins wrote: >>> >>>> Juergen's suggestion prompted me to attempt an implementation using >>>> OpenMP rather than the by-hand coding that I had been anticipating. >>>> Attached is a quick-and-dirty patch to enable GNU APL to be build with >>>> OpenMP support. >>>> >>>> ./configure --with-openmp >>>> >>>> There are many rough edges, both in the Makefile and the code. >>>> >>>> --with-openmp would ideally check to see whether the compiler supports >>>> OpenMP. It may be necessary to check the compiler version, as different >>>> compilers support different versions of OpenMP. Also, I've assumed >>>> compilation on/for Linux despite the fact that GNU APL and OpenMP should >>>> be buildable with the right Windows compiler. >>>> >>>> As one might expect, OpenMP requires that any throw from a worker thread >>>> must be caught by the same thread. I'm almost certain that this >>>> restriction could be violated by GNU APL code as currently written. >>>> >>>> The good news, though, is that the changes are benign; in the absence of >>>> --with-openmp, GNU APL's behavior is unchanged. >>>> >>>> With OpenMP support, ⎕syl is extended to access some of OpenMPs >>>> parameters. >>>> >>>> I've done only trivial testing at this point; just enough to verify that >>>> compiling OpenMP support doesn't obviously break GNU APL. >>>> >>>> I haven't confirmed that the OpenMP #pragmas on the key loops in >>>> SkalarFunction.cc have any effect on execution time or processor core >>>> utilization. I hope to do more testing later this week. >>>> >>>> Best wishes, >>>> David >>>> >>>> >>> >>> >> >> >
