Hi Richard,

Karl: 1) thanks for the summary, and 2) Please let me know when your think your section of the manual is in a state in which I can look at it and make some contributions. One thing I'd like to write about (if you haven't already covered it) is some basic info about controlling MPI process placement/pinning and the (sometimes surprisingly large) effects it can have on performance. This is getting a lot more complicated as systems add more NUMA domains and hardware threads. When I was at Intel I encountered a ton of performance problems that were mostly due to bad process placement (which, fortunately, meant they were actually easy to fix!).

sure, thanks. Expect to receive a bunch of text by tomorrow evening.

Regarding processor placement: I was running 'make streams' to collect data for the manual chapter. It turned out that the first N/2 processes were indeed placed on the first socket only, so the N/2+1st process added a significant boost in achieved overall bandwidth. This should make for a nice illustration of the subject in the manual. Also, it will benefit users if we get the processor mapping right in `make streams`, so that the output is more in line with our recommendation of "just a few processes per node to saturate memory bandwidth".

Best regards,
Karli

Reply via email to