Ashley Pittman wrote:
On Tue, 2009-09-08 at 15:00 +0200, Thomas Ropars wrote:
Hi,

I'm working on r21949 of the trunk.

When I run on a single node with 4 processes this simple program calling 2 times MPI_Comm_dup , the processes hang from time to time in the 2nd dup.

I can't reproduce this, how often does it fail?  I've run it in a loop
hundreds of times here and not had one hang.
It happens once every 4 or 5 runs. And it also happens if the processes are on different nodes.

Here is the ouptut I get from padb -axt :

main() at ?:?
 PMPI_Comm_dup() at pcomm_dup.c:62
   ompi_comm_dup() at communicator/comm.c:661
     -----------------
     [0,2] (2 processes)
     -----------------
     ompi_comm_nextcid() at communicator/comm_cid.c:264
       ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
ompi_coll_tuned_allreduce_intra_dec_fixed() at coll_tuned_decision_fixed.c:61 ompi_coll_tuned_allreduce_intra_recursivedoubling() at coll_tuned_allreduce.c:223
             ompi_request_default_wait_all() at request/req_wait.c:262
               opal_condition_wait() at ../opal/threads/condition.h:99
     -----------------
     [1,3] (2 processes)
     -----------------
     ompi_comm_nextcid() at communicator/comm_cid.c:245
       ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
ompi_coll_tuned_allreduce_intra_dec_fixed() at coll_tuned_decision_fixed.c:61 ompi_coll_tuned_allreduce_intra_recursivedoubling() at coll_tuned_allreduce.c:223
             ompi_request_default_wait_all() at request/req_wait.c:262
               opal_condition_wait() at ../opal/threads/condition.h:99

Thomas
Off-topic I know but this is exactly the type of problem that padb is
designed to help with, if you could get it to hang and then run "padb
-axt" in another window on the same node and send along the output I'm
sure it would be of help.

Ashley,


Reply via email to