We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fashion.
Did you build and/or are using ORTE threaded by any chance? If so, that definitely won't work. On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: > Hi all, > > We are currently experiencing problems at launch on the 1.5 branch on > relatively large number of nodes (at least 80). Some processes are not > spawned and orted processes are deadlocked. > > When MPI processes are calling MPI_Init before send_relay is complete, the > send_relay function and the daemon_collective function are doing a nice > interlock : > > Here is the scenario : >> send_relay > performs the send tree : > > orte_rml_oob_send_buffer > > orte_rml_oob_send > > opal_wait_condition > Waiting on completion from send thus calling opal_progress() > > opal_progress() > But since a collective request arrived from the network, entered : > > daemon_collective > However, daemon_collective is waiting for the job to be initialized (wait on > jobdat->launch_msg_processed) before continuing, thus calling : > > opal_progress() > > At this time, the send may complete, but since we will never go back to > orte_rml_oob_send, we will never perform the launch (setting > jobdat->launch_msg_processed to 1). > > I may try to solve the bug (this is quite a top priority problem for me), but > maybe people who are more familiar with orted than I am may propose a nice > and clean solution ... > > For those who like real (and complete) gdb stacks, here they are : > #0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6 > #1 0x00007fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, > tv=0x7fff0d977880) at poll.c:167 > #2 0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at > event.c:823 > #3 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 > #4 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 > #5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, data=0x97b010) > at grpcomm_bad_module.c:696 > #6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) at > grpcomm_bad_module.c:901 > #7 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667 > #8 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at > event.c:839 > #9 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 > #10 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 > #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, data=0x9676e0) > at grpcomm_bad_module.c:696 > #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) at > grpcomm_bad_module.c:901 > #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667 > #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at > event.c:839 > #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 > #16 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 > #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, data=0x97b4e0) > at grpcomm_bad_module.c:696 > #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) at > grpcomm_bad_module.c:901 > #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667 > #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at > event.c:839 > #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 > #22 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 > #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at > ../../../../opal/threads/condition.h:99 > #24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, > iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153 > #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0, > buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270 > #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at > orted/orted_comm.c:127 > #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1, > data=0x965fc0) at orted/orted_comm.c:308 > #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667 > #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at > event.c:839 > #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746 > #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682 > #32 0x00007fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at > orted/orted_main.c:769 > #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at orted.c:62 > > Thanks in advance, > Sylvain > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel