We routinely launch across thousands of nodes without a problem...I have never 
seen it stick in this fashion.

Did you build and/or are using ORTE threaded by any chance? If so, that 
definitely won't work.

On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:

> Hi all,
> 
> We are currently experiencing problems at launch on the 1.5 branch on 
> relatively large number of nodes (at least 80). Some processes are not 
> spawned and orted processes are deadlocked.
> 
> When MPI processes are calling MPI_Init before send_relay is complete, the 
> send_relay function and the daemon_collective function are doing a nice 
> interlock :
> 
> Here is the scenario :
>> send_relay
> performs the send tree :
>  > orte_rml_oob_send_buffer
>    > orte_rml_oob_send
>      > opal_wait_condition
> Waiting on completion from send thus calling opal_progress()
>        > opal_progress()
> But since a collective request arrived from the network, entered :
>          > daemon_collective
> However, daemon_collective is waiting for the job to be initialized (wait on 
> jobdat->launch_msg_processed) before continuing, thus calling :
>            > opal_progress()
> 
> At this time, the send may complete, but since we will never go back to 
> orte_rml_oob_send, we will never perform the launch (setting 
> jobdat->launch_msg_processed to 1).
> 
> I may try to solve the bug (this is quite a top priority problem for me), but 
> maybe people who are more familiar with orted than I am may propose a nice 
> and clean solution ...
> 
> For those who like real (and complete) gdb stacks, here they are :
> #0  0x0000003b7fed4f38 in poll () from /lib64/libc.so.6
> #1  0x00007fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, 
> tv=0x7fff0d977880) at poll.c:167
> #2  0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at 
> event.c:823
> #3  0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
> #4  0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
> #5  0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, data=0x97b010) 
> at grpcomm_bad_module.c:696
> #6  0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) at 
> grpcomm_bad_module.c:901
> #7  0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
> #8  0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
> event.c:839
> #9  0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
> #10 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, data=0x9676e0) 
> at grpcomm_bad_module.c:696
> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) at 
> grpcomm_bad_module.c:901
> #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
> #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
> event.c:839
> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
> #16 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, data=0x97b4e0) 
> at grpcomm_bad_module.c:696
> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) at 
> grpcomm_bad_module.c:901
> #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
> #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
> event.c:839
> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
> #22 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at 
> ../../../../opal/threads/condition.h:99
> #24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, 
> iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0, 
> buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270
> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at 
> orted/orted_comm.c:127
> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1, 
> data=0x965fc0) at orted/orted_comm.c:308
> #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
> #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at 
> event.c:839
> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746
> #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682
> #32 0x00007fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at 
> orted/orted_main.c:769
> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at orted.c:62
> 
> Thanks in advance,
> Sylvain
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to