Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. So
its not the algorithm in and of itself, but rather some interplay
between the algorithm and connection setup I guess.
On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all. This
might be able to help determine if it is the actually connection set
up between processes that are out of sync as oppose to something in
the actual gather algorithm.
--td
Steve Wise wrote:
Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes its
algorithm for job sizes > 60 to some binomial method. I changed the
threshold to 100 and my NP64 jobs run fine. Now to try and
understand what about ompi_coll_tuned_gather_intra_binomial() is
causing these connect delays...
On 9/16/2010 1:01 PM, Steve Wise wrote:
Oops. One key typo here: This is the IMB-MPI1 gather test, not
barrier. :(
On 9/16/2010 12:05 PM, Steve Wise wrote:
Hi,
I'm debugging a performance problem with running IMB-MP1/barrier in
an NP64 cluster (8 nodes, 8 cores each). I'm using openmpi-1.4.1
from the OFED-1.5.1 distribution. The BTL is openib/iWARP via
Chelsio's T3 RNIC. In short, a NP60 and smaller run completes in a
timely manner as expected, but NP61 and larger runs come to a
crawl at the 8KB IO size and take ~5-10min to complete. It does
complete though. It behaves this way even if I run on > 8 nodes so
there are available cores. IE a NP64 on a 16 node cluster still
behaves the same way even though there are only 4 ranks on each
node. So its apparently not a thread starvation issue due to lack
of cores. When in the stalled state, I see on the order of 100 or
so established iwarp connections on each node. And the connection
count increases VERY slowly and sporadically (at its peak there are
around 800 connections for a NP64 gather operation). In
comparison, when I run the <= NP60 runs, the connections quickly
ramp up to the expected amount. I added hooks in the openib BTL to
track the time it takes to setup each connection. In all runs,
both <= NP60 and > NP60, the average connection setup time is
around 200ms. And the max setup time seen is never much above this
value. That tells me that its not individual connection setup that
is the issue. I then added printfs/fflushes in librdmacm to
visually see when a connection is attempted and when it is
accepted. When I run with these printfs, I see the connections get
setup quickly and evently in the <= NP60 case. Initially when the
job is started, I see a small flurry of connections getting setup,
then the run begins and at around 1KB IO size I see a 2nd large
flurry of connection setups. Then the test continues and
completes. With the >NP60 case, this second round of connection
setups is very sporadic and slow. Very slow! I'll see little
bursts of ~10-20 connections setup, then long random pauses. The
net is that full connection setup for the job takes 5-10min.
During this time the ranks are basically spinning idle awaiting the
connections to get setup. So I'm concluding that something above
the BTL layer isn't issuing the endpoint connect requests in a
timely manner.
Attached are 3 padb dumps during the stall. Anybody see anything
interesting in these?
Any ideas how I can further debug this? Once I get above the
openib BTL layer my eyes glaze over and I get lost quickly. :) I
would greatly appreciate any ideas from the OpenMPI experts!
Thanks in advance,
Steve.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel