Just an update for folks: The connection setup latency was a bug in my
iw_cxgb3 rdma driver. It wasn't turning off RX coalescing for the iwarp
connections. This resulted in 100-200ms added latency since the iwarp
connection setup uses TCP streaming mode messages to negotiate the iwarp
connection mode. Once I fixed this, the gather operation for > NP60
behaves much much better...
Thanks Terry for helping.
Steve.
On 09/17/2010 03:46 PM, Steve Wise wrote:
I'll look into Solaris Studio. I think somehow the connections are
getting single threaded or somehow funneled due to the gather
algorithm. And since they are taking ~160ms to setup each one, and
there are ~3600 connections getting setup, we end up with a 7 minute
run time. Now, 160ms seems way too high for setting up even an iWARP
connection which has some streaming mode TCP exchanges as part of
connection setup. I would think it should be around a few hundred
_usecs_. So I'm pursuing the connect latency too.
Thanks,
Steve.
On 9/17/2010 12:13 PM, Terry Dontje wrote:
Right, by default all connections will be handled on the fly. So as
an MPI_Send is executed to a process that there is not a connection
to then a dance happens between the sender and the receiver. So why
this happens with np > 60 may have to do with how many connections
are happening at the same time or if the destination of one
connection request is not in the MPI library.
It would be interesting to figure out when in the timeline of the job
that such requests are are being delayed. You can get such a
timeline by using a tool like Solaris Studio collector/analyzer
(which actually has a Linux version).
--td
Steve Wise wrote:
Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. So
its not the algorithm in and of itself, but rather some interplay
between the algorithm and connection setup I guess.
On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all.
This might be able to help determine if it is the actually
connection set up between processes that are out of sync as oppose
to something in the actual gather algorithm.
--td
Steve Wise wrote:
Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes
its algorithm for job sizes > 60 to some binomial method. I
changed the threshold to 100 and my NP64 jobs run fine. Now to
try and understand what about
ompi_coll_tuned_gather_intra_binomial() is causing these connect
delays...
On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops. One key typo here: This is the IMB-MPI1 gather test, not
barrier. :(
On 9/16/2010 12:05 PM, Steve Wise wrote:
Hi,
I'm debugging a performance problem with running IMB-MP1/barrier
in an NP64 cluster (8 nodes, 8 cores each). I'm using
openmpi-1.4.1 from the OFED-1.5.1 distribution. The BTL is
openib/iWARP via Chelsio's T3 RNIC. In short, a NP60 and
smaller run completes in a timely manner as expected, but NP61
and larger runs come to a crawl at the 8KB IO size and take
~5-10min to complete. It does complete though. It behaves this
way even if I run on > 8 nodes so there are available cores. IE
a NP64 on a 16 node cluster still behaves the same way even
though there are only 4 ranks on each node. So its apparently
not a thread starvation issue due to lack of cores. When in the
stalled state, I see on the order of 100 or so established iwarp
connections on each node. And the connection count increases
VERY slowly and sporadically (at its peak there are around 800
connections for a NP64 gather operation). In comparison, when I
run the <= NP60 runs, the connections quickly ramp up to the
expected amount. I added hooks in the openib BTL to track the
time it takes to setup each connection. In all runs, both <=
NP60 and > NP60, the average connection setup time is around
200ms. And the max setup time seen is never much above this
value. That tells me that its not individual connection setup
that is the issue. I then added printfs/fflushes in librdmacm
to visually see when a connection is attempted and when it is
accepted. When I run with these printfs, I see the connections
get setup quickly and evently in the <= NP60 case. Initially
when the job is started, I see a small flurry of connections
getting setup, then the run begins and at around 1KB IO size I
see a 2nd large flurry of connection setups. Then the test
continues and completes. With the >NP60 case, this second round
of connection setups is very sporadic and slow. Very slow!
I'll see little bursts of ~10-20 connections setup, then long
random pauses. The net is that full connection setup for the
job takes 5-10min. During this time the ranks are basically
spinning idle awaiting the connections to get setup. So I'm
concluding that something above the BTL layer isn't issuing the
endpoint connect requests in a timely manner.
Attached are 3 padb dumps during the stall. Anybody see
anything interesting in these?
Any ideas how I can further debug this? Once I get above the
openib BTL layer my eyes glaze over and I get lost quickly. :)
I would greatly appreciate any ideas from the OpenMPI experts!
Thanks in advance,
Steve.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel