Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes its
algorithm for job sizes > 60 to some binomial method. I changed the
threshold to 100 and my NP64 jobs run fine. Now to try and understand
what about ompi_coll_tuned_gather_intra_binomial() is causing these
connect delays...
On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops. One key typo here: This is the IMB-MPI1 gather test, not
barrier. :(
On 9/16/2010 12:05 PM, Steve Wise wrote:
Hi,
I'm debugging a performance problem with running IMB-MP1/barrier in
an NP64 cluster (8 nodes, 8 cores each). I'm using openmpi-1.4.1
from the OFED-1.5.1 distribution. The BTL is openib/iWARP via
Chelsio's T3 RNIC. In short, a NP60 and smaller run completes in a
timely manner as expected, but NP61 and larger runs come to a crawl
at the 8KB IO size and take ~5-10min to complete. It does complete
though. It behaves this way even if I run on > 8 nodes so there are
available cores. IE a NP64 on a 16 node cluster still behaves the
same way even though there are only 4 ranks on each node. So its
apparently not a thread starvation issue due to lack of cores. When
in the stalled state, I see on the order of 100 or so established
iwarp connections on each node. And the connection count increases
VERY slowly and sporadically (at its peak there are around 800
connections for a NP64 gather operation). In comparison, when I run
the <= NP60 runs, the connections quickly ramp up to the expected
amount. I added hooks in the openib BTL to track the time it takes
to setup each connection. In all runs, both <= NP60 and > NP60, the
average connection setup time is around 200ms. And the max setup
time seen is never much above this value. That tells me that its not
individual connection setup that is the issue. I then added
printfs/fflushes in librdmacm to visually see when a connection is
attempted and when it is accepted. When I run with these printfs, I
see the connections get setup quickly and evently in the <= NP60
case. Initially when the job is started, I see a small flurry of
connections getting setup, then the run begins and at around 1KB IO
size I see a 2nd large flurry of connection setups. Then the test
continues and completes. With the >NP60 case, this second round of
connection setups is very sporadic and slow. Very slow! I'll see
little bursts of ~10-20 connections setup, then long random pauses.
The net is that full connection setup for the job takes 5-10min.
During this time the ranks are basically spinning idle awaiting the
connections to get setup. So I'm concluding that something above the
BTL layer isn't issuing the endpoint connect requests in a timely
manner.
Attached are 3 padb dumps during the stall. Anybody see anything
interesting in these?
Any ideas how I can further debug this? Once I get above the openib
BTL layer my eyes glaze over and I get lost quickly. :) I would
greatly appreciate any ideas from the OpenMPI experts!
Thanks in advance,
Steve.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel