Re: [OMPI devel] NP64 _gather_ problem

Steve Wise Thu, 16 Sep 2010 17:05:10 -0400

Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes itsalgorithm for job sizes > 60 to some binomial method. I changed thethreshold to 100 and my NP64 jobs run fine. Now to try and understandwhat about ompi_coll_tuned_gather_intra_binomial() is causing theseconnect delays...


On 9/16/2010 1:01 PM, Steve Wise wrote:

Oops. One key typo here: This is the IMB-MPI1 gather test, notbarrier. :(
On 9/16/2010 12:05 PM, Steve Wise wrote:
 Hi,
I'm debugging a performance problem with running IMB-MP1/barrier inan NP64 cluster (8 nodes, 8 cores each). I'm using openmpi-1.4.1from the OFED-1.5.1 distribution. The BTL is openib/iWARP viaChelsio's T3 RNIC. In short, a NP60 and smaller run completes in atimely manner as expected, but NP61 and larger runs come to a crawlat the 8KB IO size and take ~5-10min to complete. It does completethough. It behaves this way even if I run on > 8 nodes so there areavailable cores. IE a NP64 on a 16 node cluster still behaves thesame way even though there are only 4 ranks on each node. So itsapparently not a thread starvation issue due to lack of cores. Whenin the stalled state, I see on the order of 100 or so establishediwarp connections on each node. And the connection count increasesVERY slowly and sporadically (at its peak there are around 800connections for a NP64 gather operation). In comparison, when I runthe <= NP60 runs, the connections quickly ramp up to the expectedamount. I added hooks in the openib BTL to track the time it takesto setup each connection. In all runs, both <= NP60 and > NP60, theaverage connection setup time is around 200ms. And the max setuptime seen is never much above this value. That tells me that its notindividual connection setup that is the issue. I then addedprintfs/fflushes in librdmacm to visually see when a connection isattempted and when it is accepted. When I run with these printfs, Isee the connections get setup quickly and evently in the <= NP60case. Initially when the job is started, I see a small flurry ofconnections getting setup, then the run begins and at around 1KB IOsize I see a 2nd large flurry of connection setups. Then the testcontinues and completes. With the >NP60 case, this second round ofconnection setups is very sporadic and slow. Very slow! I'll seelittle bursts of ~10-20 connections setup, then long random pauses.The net is that full connection setup for the job takes 5-10min.During this time the ranks are basically spinning idle awaiting theconnections to get setup. So I'm concluding that something above theBTL layer isn't issuing the endpoint connect requests in a timelymanner.
Attached are 3 padb dumps during the stall. Anybody see anythinginteresting in these?
Any ideas how I can further debug this? Once I get above the openibBTL layer my eyes glaze over and I get lost quickly. :) I wouldgreatly appreciate any ideas from the OpenMPI experts!
Thanks in advance,

Steve.


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] NP64 _gather_ problem

Reply via email to