Re: [OMPI devel] application hangs with multiple dup

Thomas Ropars Tue, 15 Sep 2009 16:25:12 -0400

Hi,

Some news about that bug ?


Thomas

Edgar Gabriel wrote:

so I can confirm that I can reproduce the hang, and we (George, Rainerand me) have looked into that and are continue digging.
I hate to say that, but it looked to us as if messages were 'lost'(sender clearly called send and but the data is not in any of thequeues on the receiver side), which seems to be consistent with twoother bug reports currently being discussed on the mailing list. Icould reproduce the hang with both sm and tcp, so its probably not abtl issue but somewhere higher.
Thanks
Edgar

Thomas Ropars wrote:
Edgar Gabriel wrote:
Two short questions: do you have any open MPI mca parameters set ina file or at runtime?
No
And second, is there any difference if you disable the hierarch collmodule (which does communicate additionally as well?) e.g.
mpirun --mca coll ^hierarch -np 4 ./mytest
No, there is no difference.
I don't know if it can help but : I've first had the problem whenlaunching bt.A.4 and sp.A.4 of the NAS Parallel Benchmarks (3.3version).
Thomas
Thanks
Edgar

Thomas Ropars wrote:
Ashley Pittman wrote:
On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:

Thank you.  I think you missed the top three lines of the output but
that doesn't matter.
main() at ?:?
  PMPI_Comm_dup() at pcomm_dup.c:62
    ompi_comm_dup() at communicator/comm.c:661
      -----------------
      [0,2] (2 processes)
      -----------------
      ompi_comm_nextcid() at communicator/comm_cid.c:264
        ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
ompi_coll_tuned_allreduce_intra_dec_fixed() atcoll_tuned_decision_fixed.c:61ompi_coll_tuned_allreduce_intra_recursivedoubling()at coll_tuned_allreduce.c:223ompi_request_default_wait_all() atrequest/req_wait.c:262opal_condition_wait() at../opal/threads/condition.h:99
      -----------------
      [1,3] (2 processes)
      -----------------
      ompi_comm_nextcid() at communicator/comm_cid.c:245
        ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
ompi_coll_tuned_allreduce_intra_dec_fixed() atcoll_tuned_decision_fixed.c:61ompi_coll_tuned_allreduce_intra_recursivedoubling()at coll_tuned_allreduce.c:223ompi_request_default_wait_all() atrequest/req_wait.c:262opal_condition_wait() at../opal/threads/condition.h:99
Lines 264 and 245 of comm_cid.c are both in a for loop which calls
allreduce() twice in a loop until a certain condition is met. Assuchit's hard to tell from this trace if it is processes [0,2] are"ahead"
or [1,3] are "behind".  Either way you look at it however the
all_reduce() should not deadlock like that so it's as likely to bea bug
in reduce as it is in ompi_comm_nextcid() from the trace.
I assume all four processes are actually in the same call tocomm_dup,re-compiling your program with -g and re-running padb wouldconfirm this
as it would show the line numbers.
Yes they are all in the second call to comm_dup.

Thomas
Ashley,
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] application hangs with multiple dup

Reply via email to