Ashley Pittman wrote:
On Tue, 2009-09-08 at 15:00 +0200, Thomas Ropars wrote:
Hi,
I'm working on r21949 of the trunk.
When I run on a single node with 4 processes this simple program calling
2 times MPI_Comm_dup , the processes hang from time to time in the 2nd dup.
I can't reproduce this, how often does it fail? I've run it in a loop
hundreds of times here and not had one hang.
It happens once every 4 or 5 runs. And it also happens if the processes
are on different nodes.
Here is the ouptut I get from padb -axt :
main() at ?:?
PMPI_Comm_dup() at pcomm_dup.c:62
ompi_comm_dup() at communicator/comm.c:661
-----------------
[0,2] (2 processes)
-----------------
ompi_comm_nextcid() at communicator/comm_cid.c:264
ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
ompi_coll_tuned_allreduce_intra_dec_fixed() at
coll_tuned_decision_fixed.c:61
ompi_coll_tuned_allreduce_intra_recursivedoubling() at
coll_tuned_allreduce.c:223
ompi_request_default_wait_all() at request/req_wait.c:262
opal_condition_wait() at ../opal/threads/condition.h:99
-----------------
[1,3] (2 processes)
-----------------
ompi_comm_nextcid() at communicator/comm_cid.c:245
ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
ompi_coll_tuned_allreduce_intra_dec_fixed() at
coll_tuned_decision_fixed.c:61
ompi_coll_tuned_allreduce_intra_recursivedoubling() at
coll_tuned_allreduce.c:223
ompi_request_default_wait_all() at request/req_wait.c:262
opal_condition_wait() at ../opal/threads/condition.h:99
Thomas
Off-topic I know but this is exactly the type of problem that padb is
designed to help with, if you could get it to hang and then run "padb
-axt" in another window on the same node and send along the output I'm
sure it would be of help.
Ashley,