Re: [petsc-dev] Status of threadcomm testing

Karl Rupp Tue, 12 Nov 2013 13:20:08 -0800

Hi Dave,

thanks for the concise summary and for getting a handle on the VecNormscaling issue.


> My conclusion from the results presented above is that this is a NUMA

issue because the scaling is very good on Vulcan where the nodes do
not have NUMA issues.

This is a reasonable conclusion. Am I correct that you used one MPI rankper node for all the figures?

I've also performed a set of strong scaling runs on a NUMA machine using
8 threads but without setting thread affinities.  These runs scale pretty well
but are about a factor of 2 slower initially than the case where thread 
affinities
are set.  Plots of these runs are shown in the last attached plot.  See the
curves marked "aff_yes" and "aff_no".  On this set of plots, you can also see
that the two node result is about the same with or without affinities set.  
Since
it appears from the results of using the diagnostic printf above that the thread
affinities are being properly set and recognized by the OS, it seems that this
final problem is the result of the data being located in a different NUMA domain
from that of the thread for the threads that are mapped to the second socket
cores when there are two or more nodes.


The factor of two is a strong indicator for a NUMA hickup, yes.

For the single node case, it would seem that the data is properly distributed
in memory so that it resides in the same NUMA domain as the core to which
its thread is bound.  But for the multiple node case, it would seem that the
data for threads bound to cores in the second socket actually resides in
memory attached to the first socket.  That the performance result is different
for a single node and multiple nodes would suggest that a different path
through the source code is taken for multiple nodes than for a single node.


Hmm, apparently this requires more debugging then.

These are my conclusions based on the testing and debugging that I have
done so far.  I've also verified in isolated cases that threadcomm with openmp
has the same scaling issues.  Do these conclusions seem reasonable?  Or
are there other possible scenarios that could reproduce my test data?

It would be nice to get this problem fixed so that the threadcomm package
would be more useful.

Definitely. Since you probably have an isolated test case and hardwareat hand: Do you happen to know whether the same scaling issue shows upwith VecDot() and/or VecTDot()? They are supposed to run through thesame reductions, so this should give us a hint on whether the problem isVecNorm-specific or applies to reductions in general.


Thanks and best regards,
Karli

Re: [petsc-dev] Status of threadcomm testing

Reply via email to