[OMPI devel] 1.2.7 hung in IMB at 16K collectives with MX

Bogdan Costescu Wed, 26 Nov 2008 13:08:05 -0500

Hi!

While running the IMB 3.1 with OpenMPI 1.2.7 over MX 1.2.7, I seehangs in most collective operations when testing with 16K buffers whenrunning on 128 nodes, one MPI rank per node, with the defaultsettings:


works:  PingPong, PingPing, Sendrecv, Exchange, Allreduce, Reduce,
        Reduce_scatter, Allgather, Bcast, Barrier

hangs after 8K: Allgatherv, Gather, Gatherv, Scatter, Scatterv,
                Alltoall, Alltoallv

("hangs after 8K" means that the results for 8K are printed, but thosefor 16K are not - I've allowed them several hours after the 8K resultshave been printed before killing the jobs; the processes continue touse CPU time, but no progress seems to be made). I've only recentlybeen able to run the IMB on such large number of nodes, some lowerlevel issues prevented me from running them before.

The IMB finishes successfully in the same conditions when run on 64nodes. With Allgatherv I've found that the breaking point is somewherearound 90 nodes: it works with 88 nodes and it hangs with 90 nodes.

The above tests were performed with the default settings. When Ispecify '--mca mtl mx --mca pml cm' the IMB finishes successfully on128 nodes; with MPICH-MX, the IMB also finishes successfully on 128nodes. However, I consider it to be a big problem if the defaultOpenMPI settings lead to hangs.

Is this a known (but undocumented) behaviour ? Do other sites with asimilar setup observe these hangs ? Can someone suggest what to do toavoid them or at least a way to debug this ?


Thanks in advance !

--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: [email protected]

[OMPI devel] 1.2.7 hung in IMB at 16K collectives with MX

Reply via email to