BTW: does this reproduce on the trunk and/or 1.3.4 as well? I'm wondering because we know the 1.5 branch is skewed relative to the trunk. Could well be a bug sitting over there.
On Nov 20, 2009, at 7:06 AM, Ralph Castain wrote: > Thanks! I'll give it a try. > > My tests are all conducted with fast launches (just running slurm on large > clusters) and using an mpi hello world that calls mpi_init at first > instruction. I'll see if adding the delay causes it to misbehave. > > > On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: > >> Hi Ralph, >> >> Thanks for your efforts. I will look at our configuration and see how it may >> differ from ours. >> >> Here is a patch which helps reproducing the bug even with a small number of >> nodes. >> >> diff -r b622b9e8f1ac orte/orted/orted_comm.c >> --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100 >> +++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100 >> @@ -126,6 +126,13 @@ >> ORTE_ERROR_LOG(ret); >> goto CLEANUP; >> } >> + { /* Add delay to reproduce bug */ >> + char * str = getenv("ORTE_RELAY_DELAY"); >> + int sec = str ? atoi(str) : 0; >> + if (sec) { >> + sleep(sec); >> + } >> + } >> } >> >> CLEANUP: >> >> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug. >> >> During our experiments, the bug disappeared when we added a delay before >> calling MPI_Init. So, configurations where processes are launched slowly or >> take some time before MPI_Init should be immune to this bug. >> >> We usually reproduce the bug with one ppn (faster to spawn). >> >> Sylvain >> >> On Thu, 19 Nov 2009, Ralph Castain wrote: >> >>> Hi Sylvain >>> >>> I've spent several hours trying to replicate the behavior you described on >>> clusters up to a couple of hundred nodes (all running slurm), without >>> success. I'm becoming increasingly convinced that this is a configuration >>> issue as opposed to a code issue. >>> >>> I have enclosed the platform file I use below. Could you compare it to your >>> configuration? I'm wondering if there is something critical about the >>> config that may be causing the problem (perhaps we have a problem in our >>> default configuration). >>> >>> Also, is there anything else you can tell us about your configuration? How >>> many ppn triggers it, or do you always get the behavior every time you >>> launch over a certain number of nodes? >>> >>> Meantime, I will look into this further. I am going to introduce a "slow >>> down" param that will force the situation you encountered - i.e., will >>> ensure that the relay is still being sent when the daemon receives the >>> first collective input. We can then use that to try and force replication >>> of the behavior you are encountering. >>> >>> Thanks >>> Ralph >>> >>> enable_dlopen=no >>> enable_pty_support=no >>> with_blcr=no >>> with_openib=yes >>> with_memory_manager=no >>> enable_mem_debug=yes >>> enable_mem_profile=no >>> enable_debug_symbols=yes >>> enable_binaries=yes >>> with_devel_headers=yes >>> enable_heterogeneous=no >>> enable_picky=yes >>> enable_debug=yes >>> enable_shared=yes >>> enable_static=yes >>> with_slurm=yes >>> enable_contrib_no_build=libnbc,vt >>> enable_visibility=yes >>> enable_memchecker=no >>> enable_ipv6=no >>> enable_mpi_f77=no >>> enable_mpi_f90=no >>> enable_mpi_cxx=no >>> enable_mpi_cxx_seek=no >>> enable_mca_no_build=pml-dr,pml-crcp2,crcp >>> enable_io_romio=no >>> >>> On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote: >>> >>>> >>>> On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: >>>> >>>>> Thank you Ralph for this precious help. >>>>> >>>>> I setup a quick-and-dirty patch basically postponing process_msg (hence >>>>> daemon_collective) until the launch is done. In process_msg, I therefore >>>>> requeue a process_msg handler and return. >>>> >>>> That is basically the idea I proposed, just done in a slightly different >>>> place >>>> >>>>> >>>>> In this "all-must-be-non-blocking-and-done-through-opal_progress" >>>>> algorithm, I don't think that blocking calls like the one in >>>>> daemon_collective should be allowed. This also applies to the blocking >>>>> one in send_relay. [Well, actually, one is okay, 2 may lead to >>>>> interlocking.] >>>> >>>> Well, that would be problematic - you will find "progressed_wait" used >>>> repeatedly in the code. Removing them all would take a -lot- of effort and >>>> a major rewrite. I'm not yet convinced it is required. There may be >>>> something strange in how you are setup, or your cluster - like I said, >>>> this is the first report of a problem we have had, and people with much >>>> bigger slurm clusters have been running this code every day for over a >>>> year. >>>> >>>>> >>>>> If you have time doing a nicer patch, it would be great and I would be >>>>> happy to test it. Otherwise, I will try to implement your idea properly >>>>> next week (with my limited knowledge of orted). >>>> >>>> Either way is fine - I'll see if I can get to it. >>>> >>>> Thanks >>>> Ralph >>>> >>>>> >>>>> For the record, here is the patch I'm currently testing at large scale : >>>>> >>>>> diff -r ec68298b3169 -r b622b9e8f1ac >>>>> orte/mca/grpcomm/bad/grpcomm_bad_module.c >>>>> --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 >>>>> +0100 >>>>> +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 >>>>> +0100 >>>>> @@ -687,14 +687,6 @@ >>>>> opal_list_append(&orte_local_jobdata, &jobdat->super); >>>>> } >>>>> >>>>> - /* it may be possible to get here prior to having actually finished >>>>> processing our >>>>> - * local launch msg due to the race condition between different >>>>> nodes and when >>>>> - * they start their individual procs. Hence, we have to first ensure >>>>> that we >>>>> - * -have- finished processing the launch msg, or else we won't know >>>>> whether >>>>> - * or not to wait before sending this on >>>>> - */ >>>>> - ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1); >>>>> - >>>>> /* unpack the collective type */ >>>>> n = 1; >>>>> if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, >>>>> &jobdat->collective_type, &n, ORTE_GRPCOMM_COLL_T))) { >>>>> @@ -894,6 +886,28 @@ >>>>> >>>>> proc = &mev->sender; >>>>> buf = mev->buffer; >>>>> + >>>>> + jobdat = NULL; >>>>> + for (item = opal_list_get_first(&orte_local_jobdata); >>>>> + item != opal_list_get_end(&orte_local_jobdata); >>>>> + item = opal_list_get_next(item)) { >>>>> + jobdat = (orte_odls_job_t*)item; >>>>> + >>>>> + /* is this the specified job? */ >>>>> + if (jobdat->jobid == proc->jobid) { >>>>> + break; >>>>> + } >>>>> + } >>>>> + if (NULL == jobdat || jobdat->launch_msg_processed != 1) { >>>>> + /* it may be possible to get here prior to having actually >>>>> finished processing our >>>>> + * local launch msg due to the race condition between different >>>>> nodes and when >>>>> + * they start their individual procs. Hence, we have to first >>>>> ensure that we >>>>> + * -have- finished processing the launch msg. Requeue this event >>>>> until it is done. >>>>> + */ >>>>> + int tag = &mev->tag; >>>>> + ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg); >>>>> + return; >>>>> + } >>>>> >>>>> /* is the sender a local proc, or a daemon relaying the collective? */ >>>>> if (ORTE_PROC_MY_NAME->jobid == proc->jobid) { >>>>> >>>>> Sylvain >>>>> >>>>> On Thu, 19 Nov 2009, Ralph Castain wrote: >>>>> >>>>>> Very strange. As I said, we routinely launch jobs spanning several >>>>>> hundred nodes without problem. You can see the platform files for that >>>>>> setup in contrib/platform/lanl/tlcc >>>>>> >>>>>> That said, it is always possible you are hitting some kind of race >>>>>> condition we don't hit. In looking at the code, one possibility would be >>>>>> to make all the communications flow through the daemon cmd processor in >>>>>> orte/orted_comm.c. This is the way it used to work until I reorganized >>>>>> the code a year ago for other reasons that never materialized. >>>>>> >>>>>> Unfortunately, the daemon collective has to wait until the local launch >>>>>> cmd has been completely processed so it can know whether or not to wait >>>>>> for contributions from local procs before sending along the collective >>>>>> message, so this kinda limits our options. >>>>>> >>>>>> About the only other thing you could do would be to not send the relay >>>>>> at all until -after- processing the local launch cmd. You can then >>>>>> remove the "wait" in the daemon collective as you will know how many >>>>>> local procs are involved, if any. >>>>>> >>>>>> I used to do it that way and it guarantees it will work. The negative is >>>>>> that we lose some launch speed as the next nodes in the tree don't get >>>>>> the launch message until this node finishes launching all its procs. >>>>>> >>>>>> The way around that, of course, would be to: >>>>>> >>>>>> 1. process the launch message, thus extracting the number of any local >>>>>> procs and setting up all data structures...but do -not- launch the procs >>>>>> at this time (as this is what takes all the time) >>>>>> >>>>>> 2. send the relay - the daemon collective can now proceed without a >>>>>> "wait" in it >>>>>> >>>>>> 3. now launch the local procs >>>>>> >>>>>> It would be a fairly simple reorganization of the code in the >>>>>> orte/mca/odls area. I can do it this weekend if you like, or you can do >>>>>> it - either way is fine, but if you do it, please contribute it back to >>>>>> the trunk. >>>>>> >>>>>> Ralph >>>>>> >>>>>> >>>>>> On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote: >>>>>> >>>>>>> I would say I use the default settings, i.e. I don't set anything >>>>>>> "special" at configure. >>>>>>> >>>>>>> I'm launching my processes with SLURM (salloc + mpirun). >>>>>>> >>>>>>> Sylvain >>>>>>> >>>>>>> On Wed, 18 Nov 2009, Ralph Castain wrote: >>>>>>> >>>>>>>> How did you configure OMPI? >>>>>>>> >>>>>>>> What launch mechanism are you using - ssh? >>>>>>>> >>>>>>>> On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote: >>>>>>>> >>>>>>>>> I don't think so, and I'm not doing it explicitely at least. How do I >>>>>>>>> know ? >>>>>>>>> >>>>>>>>> Sylvain >>>>>>>>> >>>>>>>>> On Tue, 17 Nov 2009, Ralph Castain wrote: >>>>>>>>> >>>>>>>>>> We routinely launch across thousands of nodes without a problem...I >>>>>>>>>> have never seen it stick in this fashion. >>>>>>>>>> >>>>>>>>>> Did you build and/or are using ORTE threaded by any chance? If so, >>>>>>>>>> that definitely won't work. >>>>>>>>>> >>>>>>>>>> On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> We are currently experiencing problems at launch on the 1.5 branch >>>>>>>>>>> on relatively large number of nodes (at least 80). Some processes >>>>>>>>>>> are not spawned and orted processes are deadlocked. >>>>>>>>>>> >>>>>>>>>>> When MPI processes are calling MPI_Init before send_relay is >>>>>>>>>>> complete, the send_relay function and the daemon_collective >>>>>>>>>>> function are doing a nice interlock : >>>>>>>>>>> >>>>>>>>>>> Here is the scenario : >>>>>>>>>>>> send_relay >>>>>>>>>>> performs the send tree : >>>>>>>>>>>> orte_rml_oob_send_buffer >>>>>>>>>>>> orte_rml_oob_send >>>>>>>>>>>> opal_wait_condition >>>>>>>>>>> Waiting on completion from send thus calling opal_progress() >>>>>>>>>>>> opal_progress() >>>>>>>>>>> But since a collective request arrived from the network, entered : >>>>>>>>>>>> daemon_collective >>>>>>>>>>> However, daemon_collective is waiting for the job to be initialized >>>>>>>>>>> (wait on jobdat->launch_msg_processed) before continuing, thus >>>>>>>>>>> calling : >>>>>>>>>>>> opal_progress() >>>>>>>>>>> >>>>>>>>>>> At this time, the send may complete, but since we will never go >>>>>>>>>>> back to orte_rml_oob_send, we will never perform the launch >>>>>>>>>>> (setting jobdat->launch_msg_processed to 1). >>>>>>>>>>> >>>>>>>>>>> I may try to solve the bug (this is quite a top priority problem >>>>>>>>>>> for me), but maybe people who are more familiar with orted than I >>>>>>>>>>> am may propose a nice and clean solution ... >>>>>>>>>>> >>>>>>>>>>> For those who like real (and complete) gdb stacks, here they are : >>>>>>>>>>> #0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6 >>>>>>>>>>> #1 0x00007fd0de5d861a in poll_dispatch (base=0x930230, >>>>>>>>>>> arg=0x91a4b0, tv=0x7fff0d977880) at poll.c:167 >>>>>>>>>>> #2 0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, >>>>>>>>>>> flags=1) at event.c:823 >>>>>>>>>>> #3 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>>>>>>>>>> #4 0x00007fd0de5aeb6d in opal_progress () at >>>>>>>>>>> runtime/opal_progress.c:189 >>>>>>>>>>> #5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, >>>>>>>>>>> data=0x97b010) at grpcomm_bad_module.c:696 >>>>>>>>>>> #6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, >>>>>>>>>>> data=0x97af20) at grpcomm_bad_module.c:901 >>>>>>>>>>> #7 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>>>>>>>>>> event.c:667 >>>>>>>>>>> #8 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, >>>>>>>>>>> flags=1) at event.c:839 >>>>>>>>>>> #9 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>>>>>>>>>> #10 0x00007fd0de5aeb6d in opal_progress () at >>>>>>>>>>> runtime/opal_progress.c:189 >>>>>>>>>>> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, >>>>>>>>>>> data=0x9676e0) at grpcomm_bad_module.c:696 >>>>>>>>>>> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, >>>>>>>>>>> data=0x9796d0) at grpcomm_bad_module.c:901 >>>>>>>>>>> #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>>>>>>>>>> event.c:667 >>>>>>>>>>> #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, >>>>>>>>>>> flags=1) at event.c:839 >>>>>>>>>>> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>>>>>>>>>> #16 0x00007fd0de5aeb6d in opal_progress () at >>>>>>>>>>> runtime/opal_progress.c:189 >>>>>>>>>>> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, >>>>>>>>>>> data=0x97b4e0) at grpcomm_bad_module.c:696 >>>>>>>>>>> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, >>>>>>>>>>> data=0x97b3f0) at grpcomm_bad_module.c:901 >>>>>>>>>>> #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>>>>>>>>>> event.c:667 >>>>>>>>>>> #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, >>>>>>>>>>> flags=1) at event.c:839 >>>>>>>>>>> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746 >>>>>>>>>>> #22 0x00007fd0de5aeb6d in opal_progress () at >>>>>>>>>>> runtime/opal_progress.c:189 >>>>>>>>>>> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, >>>>>>>>>>> m=0x97b1a8) at ../../../../opal/threads/condition.h:99 >>>>>>>>>>> #24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, >>>>>>>>>>> iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153 >>>>>>>>>>> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer >>>>>>>>>>> (peer=0x7fff0d9785a0, buffer=0x7fff0d9786b0, tag=1, flags=0) at >>>>>>>>>>> rml_oob_send.c:270 >>>>>>>>>>> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at >>>>>>>>>>> orted/orted_comm.c:127 >>>>>>>>>>> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, >>>>>>>>>>> opal_event=1, data=0x965fc0) at orted/orted_comm.c:308 >>>>>>>>>>> #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at >>>>>>>>>>> event.c:667 >>>>>>>>>>> #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, >>>>>>>>>>> flags=0) at event.c:839 >>>>>>>>>>> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746 >>>>>>>>>>> #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682 >>>>>>>>>>> #32 0x00007fd0de86e339 in orte_daemon (argc=19, >>>>>>>>>>> argv=0x7fff0d979ca8) at orted/orted_main.c:769 >>>>>>>>>>> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at >>>>>>>>>>> orted.c:62 >>>>>>>>>>> >>>>>>>>>>> Thanks in advance, >>>>>>>>>>> Sylvain >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >