Re: [OMPI devel] NP64 _gather_ problem

Steve Wise Mon, 20 Sep 2010 13:12:44 -0400

Just an update for folks: The connection setup latency was a bug in myiw_cxgb3 rdma driver. It wasn't turning off RX coalescing for the iwarpconnections. This resulted in 100-200ms added latency since the iwarpconnection setup uses TCP streaming mode messages to negotiate the iwarpconnection mode. Once I fixed this, the gather operation for > NP60behaves much much better...


Thanks Terry for helping.


Steve.


On 09/17/2010 03:46 PM, Steve Wise wrote:

I'll look into Solaris Studio. I think somehow the connections aregetting single threaded or somehow funneled due to the gatheralgorithm. And since they are taking ~160ms to setup each one, andthere are ~3600 connections getting setup, we end up with a 7 minuterun time. Now, 160ms seems way too high for setting up even an iWARPconnection which has some streaming mode TCP exchanges as part ofconnection setup. I would think it should be around a few hundred_usecs_. So I'm pursuing the connect latency too.
Thanks,

Steve.

On 9/17/2010 12:13 PM, Terry Dontje wrote:
Right, by default all connections will be handled on the fly. So asan MPI_Send is executed to a process that there is not a connectionto then a dance happens between the sender and the receiver. So whythis happens with np > 60 may have to do with how many connectionsare happening at the same time or if the destination of oneconnection request is not in the MPI library.
It would be interesting to figure out when in the timeline of the jobthat such requests are are being delayed. You can get such atimeline by using a tool like Solaris Studio collector/analyzer(which actually has a Linux version).
--td

Steve Wise wrote:
Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. Soits not the algorithm in and of itself, but rather some interplaybetween the algorithm and connection setup I guess.
On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all.This might be able to help determine if it is the actuallyconnection set up between processes that are out of sync as opposeto something in the actual gather algorithm.
--td

Steve Wise wrote:
Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changesits algorithm for job sizes > 60 to some binomial method. Ichanged the threshold to 100 and my NP64 jobs run fine. Now totry and understand what aboutompi_coll_tuned_gather_intra_binomial() is causing these connectdelays...
On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops. One key typo here: This is the IMB-MPI1 gather test, notbarrier. :(
On 9/16/2010 12:05 PM, Steve Wise wrote:
 Hi,
I'm debugging a performance problem with running IMB-MP1/barrierin an NP64 cluster (8 nodes, 8 cores each). I'm usingopenmpi-1.4.1 from the OFED-1.5.1 distribution. The BTL isopenib/iWARP via Chelsio's T3 RNIC. In short, a NP60 andsmaller run completes in a timely manner as expected, but NP61and larger runs come to a crawl at the 8KB IO size and take~5-10min to complete. It does complete though. It behaves thisway even if I run on > 8 nodes so there are available cores. IEa NP64 on a 16 node cluster still behaves the same way eventhough there are only 4 ranks on each node. So its apparentlynot a thread starvation issue due to lack of cores. When in thestalled state, I see on the order of 100 or so established iwarpconnections on each node. And the connection count increasesVERY slowly and sporadically (at its peak there are around 800connections for a NP64 gather operation). In comparison, when Irun the <= NP60 runs, the connections quickly ramp up to theexpected amount. I added hooks in the openib BTL to track thetime it takes to setup each connection. In all runs, both <=NP60 and > NP60, the average connection setup time is around200ms. And the max setup time seen is never much above thisvalue. That tells me that its not individual connection setupthat is the issue. I then added printfs/fflushes in librdmacmto visually see when a connection is attempted and when it isaccepted. When I run with these printfs, I see the connectionsget setup quickly and evently in the <= NP60 case. Initiallywhen the job is started, I see a small flurry of connectionsgetting setup, then the run begins and at around 1KB IO size Isee a 2nd large flurry of connection setups. Then the testcontinues and completes. With the >NP60 case, this second roundof connection setups is very sporadic and slow. Very slow!I'll see little bursts of ~10-20 connections setup, then longrandom pauses. The net is that full connection setup for thejob takes 5-10min. During this time the ranks are basicallyspinning idle awaiting the connections to get setup. So I'mconcluding that something above the BTL layer isn't issuing theendpoint connect requests in a timely manner.
Attached are 3 padb dumps during the stall. Anybody seeanything interesting in these?
Any ideas how I can further debug this? Once I get above theopenib BTL layer my eyes glaze over and I get lost quickly. :)I would greatly appreciate any ideas from the OpenMPI experts!
Thanks in advance,

Steve.


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email [email protected] <mailto:[email protected]>


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email [email protected] <mailto:[email protected]>


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] NP64 _gather_ problem

Reply via email to