How did you configure OMPI? What launch mechanism are you using - ssh?
On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote: > I don't think so, and I'm not doing it explicitely at least. How do I know ? > > Sylvain > > On Tue, 17 Nov 2009, Ralph Castain wrote: > >> We routinely launch across thousands of nodes without a problem...I have >> never seen it stick in this fashion. >> >> Did you build and/or are using ORTE threaded by any chance? If so, that >> definitely won't work. >> >> On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: >> >>> Hi all, >>> >>> We are currently experiencing problems at launch on the 1.5 branch on >>> relatively large number of nodes (at least 80). Some processes are not >>> spawned and orted processes are deadlocked. >>> >>> When MPI processes are calling MPI_Init before send_relay is complete, the >>> send_relay function and the daemon_collective function are doing a nice >>> interlock : >>> >>> Here is the scenario : >>>> send_relay >>> performs the send tree : >>> > orte_rml_oob_send_buffer >>> > orte_rml_oob_send >>> > opal_wait_condition >>> Waiting on completion from send thus calling opal_progress() >>> > opal_progress() >>> But since a collective request arrived from the network, entered : >>> > daemon_collective >>> However, daemon_collective is waiting for the job to be initialized (wait >>> on jobdat->launch_msg_processed) before continuing, thus calling : >>> > opal_progress() >>> >>> At this time, the send may complete, but since we will never go back to >>> orte_rml_oob_send, we will never perform the launch (setting >>> jobdat->launch_msg_processed to 1). >>> >>> I may try to solve the bug (this is quite a top priority problem for me), >>> but maybe people who are more familiar with orted than I am may propose a >>> nice and clean solution ... >>> >>> For those who like real (and complete) gdb stacks, here they are : >>> #0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6 >>> #1 0x00007fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, >>> tv=0x7fff0d977880) at poll.c:167 >>> #2 0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at >>> event.c:823 >>> #3 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>> #4 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 >>> #5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, >>> data=0x97b010) at grpcomm_bad_module.c:696 >>> #6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) >>> at grpcomm_bad_module.c:901 >>> #7 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>> event.c:667 >>> #8 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at >>> event.c:839 >>> #9 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>> #10 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 >>> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, >>> data=0x9676e0) at grpcomm_bad_module.c:696 >>> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) >>> at grpcomm_bad_module.c:901 >>> #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>> event.c:667 >>> #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at >>> event.c:839 >>> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>> #16 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 >>> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, >>> data=0x97b4e0) at grpcomm_bad_module.c:696 >>> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) >>> at grpcomm_bad_module.c:901 >>> #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>> event.c:667 >>> #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at >>> event.c:839 >>> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>> #22 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 >>> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at >>> ../../../../opal/threads/condition.h:99 >>> #24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, >>> iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153 >>> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0, >>> buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270 >>> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at >>> orted/orted_comm.c:127 >>> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1, >>> data=0x965fc0) at orted/orted_comm.c:308 >>> #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>> event.c:667 >>> #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at >>> event.c:839 >>> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746 >>> #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682 >>> #32 0x00007fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at >>> orted/orted_main.c:769 >>> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at orted.c:62 >>> >>> Thanks in advance, >>> Sylvain >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel