Re: [OMPI devel] NP64 _gather_ problem

Terry Dontje Fri, 17 Sep 2010 13:14:16 -0400

Right, by default all connections will be handled on the fly. So as anMPI_Send is executed to a process that there is not a connection to thena dance happens between the sender and the receiver. So why thishappens with np > 60 may have to do with how many connections arehappening at the same time or if the destination of one connectionrequest is not in the MPI library.

It would be interesting to figure out when in the timeline of the jobthat such requests are are being delayed. You can get such a timelineby using a tool like Solaris Studio collector/analyzer (which actuallyhas a Linux version).


--td

Steve Wise wrote:

Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. Soits not the algorithm in and of itself, but rather some interplaybetween the algorithm and connection setup I guess.
On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all. Thismight be able to help determine if it is the actually connection setup between processes that are out of sync as oppose to something inthe actual gather algorithm.
--td

Steve Wise wrote:
Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes itsalgorithm for job sizes > 60 to some binomial method. I changed thethreshold to 100 and my NP64 jobs run fine. Now to try andunderstand what about ompi_coll_tuned_gather_intra_binomial() iscausing these connect delays...
On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops. One key typo here: This is the IMB-MPI1 gather test, notbarrier. :(
On 9/16/2010 12:05 PM, Steve Wise wrote:
 Hi,
I'm debugging a performance problem with running IMB-MP1/barrierin an NP64 cluster (8 nodes, 8 cores each). I'm usingopenmpi-1.4.1 from the OFED-1.5.1 distribution. The BTL isopenib/iWARP via Chelsio's T3 RNIC. In short, a NP60 and smallerrun completes in a timely manner as expected, but NP61 and largerruns come to a crawl at the 8KB IO size and take ~5-10min tocomplete. It does complete though. It behaves this way even if Irun on > 8 nodes so there are available cores. IE a NP64 on a 16node cluster still behaves the same way even though there are only4 ranks on each node. So its apparently not a thread starvationissue due to lack of cores. When in the stalled state, I see onthe order of 100 or so established iwarp connections on eachnode. And the connection count increases VERY slowly andsporadically (at its peak there are around 800 connections for aNP64 gather operation). In comparison, when I run the <= NP60runs, the connections quickly ramp up to the expected amount. Iadded hooks in the openib BTL to track the time it takes to setupeach connection. In all runs, both <= NP60 and > NP60, theaverage connection setup time is around 200ms. And the max setuptime seen is never much above this value. That tells me that itsnot individual connection setup that is the issue. I then addedprintfs/fflushes in librdmacm to visually see when a connection isattempted and when it is accepted. When I run with these printfs,I see the connections get setup quickly and evently in the <= NP60case. Initially when the job is started, I see a small flurry ofconnections getting setup, then the run begins and at around 1KBIO size I see a 2nd large flurry of connection setups. Then thetest continues and completes. With the >NP60 case, this secondround of connection setups is very sporadic and slow. Very slow!I'll see little bursts of ~10-20 connections setup, then longrandom pauses. The net is that full connection setup for the jobtakes 5-10min. During this time the ranks are basically spinningidle awaiting the connections to get setup. So I'm concludingthat something above the BTL layer isn't issuing the endpointconnect requests in a timely manner.
Attached are 3 padb dumps during the stall. Anybody see anythinginteresting in these?
Any ideas how I can further debug this? Once I get above theopenib BTL layer my eyes glaze over and I get lost quickly. :) Iwould greatly appreciate any ideas from the OpenMPI experts!
Thanks in advance,

Steve.


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email [email protected] <mailto:[email protected]>


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email [email protected] <mailto:[email protected]>

Re: [OMPI devel] NP64 _gather_ problem

Reply via email to