Hi Richard,
Karl: 1) thanks for the summary, and 2) Please let me know when your
think your section of the manual is in a state in which I can look at it
and make some contributions. One thing I'd like to write about (if you
haven't already covered it) is some basic info about controlling MPI
process placement/pinning and the (sometimes surprisingly large) effects
it can have on performance. This is getting a lot more complicated as
systems add more NUMA domains and hardware threads. When I was at Intel
I encountered a ton of performance problems that were mostly due to bad
process placement (which, fortunately, meant they were actually easy to
fix!).
sure, thanks. Expect to receive a bunch of text by tomorrow evening.
Regarding processor placement: I was running 'make streams' to collect
data for the manual chapter. It turned out that the first N/2 processes
were indeed placed on the first socket only, so the N/2+1st process
added a significant boost in achieved overall bandwidth. This should
make for a nice illustration of the subject in the manual. Also, it will
benefit users if we get the processor mapping right in `make streams`,
so that the output is more in line with our recommendation of "just a
few processes per node to saturate memory bandwidth".
Best regards,
Karli