On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: > Thank you Ralph for this precious help. > > I setup a quick-and-dirty patch basically postponing process_msg (hence > daemon_collective) until the launch is done. In process_msg, I therefore > requeue a process_msg handler and return.
That is basically the idea I proposed, just done in a slightly different place > > In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, > I don't think that blocking calls like the one in daemon_collective should be > allowed. This also applies to the blocking one in send_relay. [Well, > actually, one is okay, 2 may lead to interlocking.] Well, that would be problematic - you will find "progressed_wait" used repeatedly in the code. Removing them all would take a -lot- of effort and a major rewrite. I'm not yet convinced it is required. There may be something strange in how you are setup, or your cluster - like I said, this is the first report of a problem we have had, and people with much bigger slurm clusters have been running this code every day for over a year. > > If you have time doing a nicer patch, it would be great and I would be happy > to test it. Otherwise, I will try to implement your idea properly next week > (with my limited knowledge of orted). Either way is fine - I'll see if I can get to it. Thanks Ralph > > For the record, here is the patch I'm currently testing at large scale : > > diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c > --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100 > +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100 > @@ -687,14 +687,6 @@ > opal_list_append(&orte_local_jobdata, &jobdat->super); > } > > - /* it may be possible to get here prior to having actually finished > processing our > - * local launch msg due to the race condition between different nodes > and when > - * they start their individual procs. Hence, we have to first ensure > that we > - * -have- finished processing the launch msg, or else we won't know > whether > - * or not to wait before sending this on > - */ > - ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1); > - > /* unpack the collective type */ > n = 1; > if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, &jobdat->collective_type, > &n, ORTE_GRPCOMM_COLL_T))) { > @@ -894,6 +886,28 @@ > > proc = &mev->sender; > buf = mev->buffer; > + > + jobdat = NULL; > + for (item = opal_list_get_first(&orte_local_jobdata); > + item != opal_list_get_end(&orte_local_jobdata); > + item = opal_list_get_next(item)) { > + jobdat = (orte_odls_job_t*)item; > + > + /* is this the specified job? */ > + if (jobdat->jobid == proc->jobid) { > + break; > + } > + } > + if (NULL == jobdat || jobdat->launch_msg_processed != 1) { > + /* it may be possible to get here prior to having actually finished > processing our > + * local launch msg due to the race condition between different > nodes and when > + * they start their individual procs. Hence, we have to first ensure > that we > + * -have- finished processing the launch msg. Requeue this event > until it is done. > + */ > + int tag = &mev->tag; > + ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg); > + return; > + } > > /* is the sender a local proc, or a daemon relaying the collective? */ > if (ORTE_PROC_MY_NAME->jobid == proc->jobid) { > > Sylvain > > On Thu, 19 Nov 2009, Ralph Castain wrote: > >> Very strange. As I said, we routinely launch jobs spanning several hundred >> nodes without problem. You can see the platform files for that setup in >> contrib/platform/lanl/tlcc >> >> That said, it is always possible you are hitting some kind of race condition >> we don't hit. In looking at the code, one possibility would be to make all >> the communications flow through the daemon cmd processor in >> orte/orted_comm.c. This is the way it used to work until I reorganized the >> code a year ago for other reasons that never materialized. >> >> Unfortunately, the daemon collective has to wait until the local launch cmd >> has been completely processed so it can know whether or not to wait for >> contributions from local procs before sending along the collective message, >> so this kinda limits our options. >> >> About the only other thing you could do would be to not send the relay at >> all until -after- processing the local launch cmd. You can then remove the >> "wait" in the daemon collective as you will know how many local procs are >> involved, if any. >> >> I used to do it that way and it guarantees it will work. The negative is >> that we lose some launch speed as the next nodes in the tree don't get the >> launch message until this node finishes launching all its procs. >> >> The way around that, of course, would be to: >> >> 1. process the launch message, thus extracting the number of any local >> procs and setting up all data structures...but do -not- launch the procs at >> this time (as this is what takes all the time) >> >> 2. send the relay - the daemon collective can now proceed without a "wait" >> in it >> >> 3. now launch the local procs >> >> It would be a fairly simple reorganization of the code in the orte/mca/odls >> area. I can do it this weekend if you like, or you can do it - either way is >> fine, but if you do it, please contribute it back to the trunk. >> >> Ralph >> >> >> On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote: >> >>> I would say I use the default settings, i.e. I don't set anything "special" >>> at configure. >>> >>> I'm launching my processes with SLURM (salloc + mpirun). >>> >>> Sylvain >>> >>> On Wed, 18 Nov 2009, Ralph Castain wrote: >>> >>>> How did you configure OMPI? >>>> >>>> What launch mechanism are you using - ssh? >>>> >>>> On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote: >>>> >>>>> I don't think so, and I'm not doing it explicitely at least. How do I >>>>> know ? >>>>> >>>>> Sylvain >>>>> >>>>> On Tue, 17 Nov 2009, Ralph Castain wrote: >>>>> >>>>>> We routinely launch across thousands of nodes without a problem...I have >>>>>> never seen it stick in this fashion. >>>>>> >>>>>> Did you build and/or are using ORTE threaded by any chance? If so, that >>>>>> definitely won't work. >>>>>> >>>>>> On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> We are currently experiencing problems at launch on the 1.5 branch on >>>>>>> relatively large number of nodes (at least 80). Some processes are not >>>>>>> spawned and orted processes are deadlocked. >>>>>>> >>>>>>> When MPI processes are calling MPI_Init before send_relay is complete, >>>>>>> the send_relay function and the daemon_collective function are doing a >>>>>>> nice interlock : >>>>>>> >>>>>>> Here is the scenario : >>>>>>>> send_relay >>>>>>> performs the send tree : >>>>>>>> orte_rml_oob_send_buffer >>>>>>>> orte_rml_oob_send >>>>>>> > opal_wait_condition >>>>>>> Waiting on completion from send thus calling opal_progress() >>>>>>> > opal_progress() >>>>>>> But since a collective request arrived from the network, entered : >>>>>>> > daemon_collective >>>>>>> However, daemon_collective is waiting for the job to be initialized >>>>>>> (wait on jobdat->launch_msg_processed) before continuing, thus calling : >>>>>>> > opal_progress() >>>>>>> >>>>>>> At this time, the send may complete, but since we will never go back to >>>>>>> orte_rml_oob_send, we will never perform the launch (setting >>>>>>> jobdat->launch_msg_processed to 1). >>>>>>> >>>>>>> I may try to solve the bug (this is quite a top priority problem for >>>>>>> me), but maybe people who are more familiar with orted than I am may >>>>>>> propose a nice and clean solution ... >>>>>>> >>>>>>> For those who like real (and complete) gdb stacks, here they are : >>>>>>> #0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6 >>>>>>> #1 0x00007fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, >>>>>>> tv=0x7fff0d977880) at poll.c:167 >>>>>>> #2 0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) >>>>>>> at event.c:823 >>>>>>> #3 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>>>>>> #4 0x00007fd0de5aeb6d in opal_progress () at >>>>>>> runtime/opal_progress.c:189 >>>>>>> #5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, >>>>>>> data=0x97b010) at grpcomm_bad_module.c:696 >>>>>>> #6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, >>>>>>> data=0x97af20) at grpcomm_bad_module.c:901 >>>>>>> #7 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>>>>>> event.c:667 >>>>>>> #8 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) >>>>>>> at event.c:839 >>>>>>> #9 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>>>>>> #10 0x00007fd0de5aeb6d in opal_progress () at >>>>>>> runtime/opal_progress.c:189 >>>>>>> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, >>>>>>> data=0x9676e0) at grpcomm_bad_module.c:696 >>>>>>> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, >>>>>>> data=0x9796d0) at grpcomm_bad_module.c:901 >>>>>>> #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>>>>>> event.c:667 >>>>>>> #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) >>>>>>> at event.c:839 >>>>>>> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>>>>>> #16 0x00007fd0de5aeb6d in opal_progress () at >>>>>>> runtime/opal_progress.c:189 >>>>>>> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, >>>>>>> data=0x97b4e0) at grpcomm_bad_module.c:696 >>>>>>> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, >>>>>>> data=0x97b3f0) at grpcomm_bad_module.c:901 >>>>>>> #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>>>>>> event.c:667 >>>>>>> #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) >>>>>>> at event.c:839 >>>>>>> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>>>>>> #22 0x00007fd0de5aeb6d in opal_progress () at >>>>>>> runtime/opal_progress.c:189 >>>>>>> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) >>>>>>> at ../../../../opal/threads/condition.h:99 >>>>>>> #24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, >>>>>>> iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153 >>>>>>> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer >>>>>>> (peer=0x7fff0d9785a0, buffer=0x7fff0d9786b0, tag=1, flags=0) at >>>>>>> rml_oob_send.c:270 >>>>>>> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at >>>>>>> orted/orted_comm.c:127 >>>>>>> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, >>>>>>> opal_event=1, data=0x965fc0) at orted/orted_comm.c:308 >>>>>>> #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>>>>>> event.c:667 >>>>>>> #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) >>>>>>> at event.c:839 >>>>>>> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746 >>>>>>> #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682 >>>>>>> #32 0x00007fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at >>>>>>> orted/orted_main.c:769 >>>>>>> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at >>>>>>> orted.c:62 >>>>>>> >>>>>>> Thanks in advance, >>>>>>> Sylvain >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel