Hi Dave,

thanks for the concise summary and for getting a handle on the VecNorm scaling issue.

> My conclusion from the results presented above is that this is a NUMA
issue because the scaling is very good on Vulcan where the nodes do
not have NUMA issues.

This is a reasonable conclusion. Am I correct that you used one MPI rank per node for all the figures?


I've also performed a set of strong scaling runs on a NUMA machine using
8 threads but without setting thread affinities.  These runs scale pretty well
but are about a factor of 2 slower initially than the case where thread 
affinities
are set.  Plots of these runs are shown in the last attached plot.  See the
curves marked "aff_yes" and "aff_no".  On this set of plots, you can also see
that the two node result is about the same with or without affinities set.  
Since
it appears from the results of using the diagnostic printf above that the thread
affinities are being properly set and recognized by the OS, it seems that this
final problem is the result of the data being located in a different NUMA domain
from that of the thread for the threads that are mapped to the second socket
cores when there are two or more nodes.

The factor of two is a strong indicator for a NUMA hickup, yes.

For the single node case, it would seem that the data is properly distributed
in memory so that it resides in the same NUMA domain as the core to which
its thread is bound.  But for the multiple node case, it would seem that the
data for threads bound to cores in the second socket actually resides in
memory attached to the first socket.  That the performance result is different
for a single node and multiple nodes would suggest that a different path
through the source code is taken for multiple nodes than for a single node.

Hmm, apparently this requires more debugging then.


These are my conclusions based on the testing and debugging that I have
done so far.  I've also verified in isolated cases that threadcomm with openmp
has the same scaling issues.  Do these conclusions seem reasonable?  Or
are there other possible scenarios that could reproduce my test data?

It would be nice to get this problem fixed so that the threadcomm package
would be more useful.

Definitely. Since you probably have an isolated test case and hardware at hand: Do you happen to know whether the same scaling issue shows up with VecDot() and/or VecTDot()? They are supposed to run through the same reductions, so this should give us a hint on whether the problem is VecNorm-specific or applies to reductions in general.

Thanks and best regards,
Karli

Reply via email to