Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

Ralph Castain Thu, 26 Nov 2009 15:19:36 -0500

Just to clarify something: I have been testing with the trunk, NOT the 1.5 
branch. I haven't even bothered to look at that code since it was branched.


>From what little I have heard plus what I (and others) have done since the 
>branch, I strongly suspect a complete ORTE refresh will be required on that 
>branch prior to any release. So I wouldn't personally spend a lot of time 
>chasing a problem on that branch.

See if you can replicate it on the trunk - if you can, please let me know as I 
am unable to do so.

HTH
Ralph

On Nov 26, 2009, at 12:28 PM, Ralph Castain wrote:

> Hi Sylvain
> 
> Well, I hate to tell you this, but I cannot reproduce the "bug" even with 
> this code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The system 
> runs really slow as I increase the delay, but it completes the job just fine. 
> I ran jobs across 16 nodes on a slurm machine, 1-4 ppn, a "hello world" app 
> that calls MPI_Init immediately upon execution.
> 
> So I have to conclude this is a problem in your setup/config. Are you sure 
> you didn't --enable-progress-threads?? That is the only way I can recreate 
> this behavior.
> 
> I plan to modify the relay/message processing method anyway to clean it up. 
> But there doesn't appear to be anything wrong with the current code.
> Ralph
> 
> On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:
> 
>> Hi Ralph,
>> 
>> Thanks for your efforts. I will look at our configuration and see how it may 
>> differ from ours.
>> 
>> Here is a patch which helps reproducing the bug even with a small number of 
>> nodes.
>> 
>> diff -r b622b9e8f1ac orte/orted/orted_comm.c
>> --- a/orte/orted/orted_comm.c   Wed Nov 18 09:27:55 2009 +0100
>> +++ b/orte/orted/orted_comm.c   Fri Nov 20 14:47:39 2009 +0100
>> @@ -126,6 +126,13 @@
>>            ORTE_ERROR_LOG(ret);
>>            goto CLEANUP;
>>        }
>> +        { /* Add delay to reproduce bug */
>> +            char * str = getenv("ORTE_RELAY_DELAY");
>> +            int sec = str ? atoi(str) : 0;
>> +            if (sec) {
>> +                sleep(sec);
>> +            }
>> +        }
>>    }
>> 
>> CLEANUP:
>> 
>> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.
>> 
>> During our experiments, the bug disappeared when we added a delay before 
>> calling MPI_Init. So, configurations where processes are launched slowly or 
>> take some time before MPI_Init should be immune to this bug.
>> 
>> We usually reproduce the bug with one ppn (faster to spawn).
>> 
>> Sylvain
>> 
>> On Thu, 19 Nov 2009, Ralph Castain wrote:
>> 
>>> Hi Sylvain
>>> 
>>> I've spent several hours trying to replicate the behavior you described on 
>>> clusters up to a couple of hundred nodes (all running slurm), without 
>>> success. I'm becoming increasingly convinced that this is a configuration 
>>> issue as opposed to a code issue.
>>> 
>>> I have enclosed the platform file I use below. Could you compare it to your 
>>> configuration? I'm wondering if there is something critical about the 
>>> config that may be causing the problem (perhaps we have a problem in our 
>>> default configuration).
>>> 
>>> Also, is there anything else you can tell us about your configuration? How 
>>> many ppn triggers it, or do you always get the behavior every time you 
>>> launch over a certain number of nodes?
>>> 
>>> Meantime, I will look into this further. I am going to introduce a "slow 
>>> down" param that will force the situation you encountered - i.e., will 
>>> ensure that the relay is still being sent when the daemon receives the 
>>> first collective input. We can then use that to try and force replication 
>>> of the behavior you are encountering.
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> enable_dlopen=no
>>> enable_pty_support=no
>>> with_blcr=no
>>> with_openib=yes
>>> with_memory_manager=no
>>> enable_mem_debug=yes
>>> enable_mem_profile=no
>>> enable_debug_symbols=yes
>>> enable_binaries=yes
>>> with_devel_headers=yes
>>> enable_heterogeneous=no
>>> enable_picky=yes
>>> enable_debug=yes
>>> enable_shared=yes
>>> enable_static=yes
>>> with_slurm=yes
>>> enable_contrib_no_build=libnbc,vt
>>> enable_visibility=yes
>>> enable_memchecker=no
>>> enable_ipv6=no
>>> enable_mpi_f77=no
>>> enable_mpi_f90=no
>>> enable_mpi_cxx=no
>>> enable_mpi_cxx_seek=no
>>> enable_mca_no_build=pml-dr,pml-crcp2,crcp
>>> enable_io_romio=no
>>> 
>>> On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:
>>> 
>>>> 
>>>> On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:
>>>> 
>>>>> Thank you Ralph for this precious help.
>>>>> 
>>>>> I setup a quick-and-dirty patch basically postponing process_msg (hence 
>>>>> daemon_collective) until the launch is done. In process_msg, I therefore 
>>>>> requeue a process_msg handler and return.
>>>> 
>>>> That is basically the idea I proposed, just done in a slightly different 
>>>> place
>>>> 
>>>>> 
>>>>> In this "all-must-be-non-blocking-and-done-through-opal_progress" 
>>>>> algorithm, I don't think that blocking calls like the one in 
>>>>> daemon_collective should be allowed. This also applies to the blocking 
>>>>> one in send_relay. [Well, actually, one is okay, 2 may lead to 
>>>>> interlocking.]
>>>> 
>>>> Well, that would be problematic - you will find "progressed_wait" used 
>>>> repeatedly in the code. Removing them all would take a -lot- of effort and 
>>>> a major rewrite. I'm not yet convinced it is required. There may be 
>>>> something strange in how you are setup, or your cluster - like I said, 
>>>> this is the first report of a problem we have had, and people with much 
>>>> bigger slurm clusters have been running this code every day for over a 
>>>> year.
>>>> 
>>>>> 
>>>>> If you have time doing a nicer patch, it would be great and I would be 
>>>>> happy to test it. Otherwise, I will try to implement your idea properly 
>>>>> next week (with my limited knowledge of orted).
>>>> 
>>>> Either way is fine - I'll see if I can get to it.
>>>> 
>>>> Thanks
>>>> Ralph
>>>> 
>>>>> 
>>>>> For the record, here is the patch I'm currently testing at large scale :
>>>>> 
>>>>> diff -r ec68298b3169 -r b622b9e8f1ac 
>>>>> orte/mca/grpcomm/bad/grpcomm_bad_module.c
>>>>> --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 
>>>>> +0100
>>>>> +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 
>>>>> +0100
>>>>> @@ -687,14 +687,6 @@
>>>>>      opal_list_append(&orte_local_jobdata, &jobdat->super);
>>>>>  }
>>>>> 
>>>>> -    /* it may be possible to get here prior to having actually finished 
>>>>> processing our
>>>>> -     * local launch msg due to the race condition between different 
>>>>> nodes and when
>>>>> -     * they start their individual procs. Hence, we have to first ensure 
>>>>> that we
>>>>> -     * -have- finished processing the launch msg, or else we won't know 
>>>>> whether
>>>>> -     * or not to wait before sending this on
>>>>> -     */
>>>>> -    ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);
>>>>> -
>>>>>  /* unpack the collective type */
>>>>>  n = 1;
>>>>>  if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, 
>>>>> &jobdat->collective_type, &n, ORTE_GRPCOMM_COLL_T))) {
>>>>> @@ -894,6 +886,28 @@
>>>>> 
>>>>>  proc = &mev->sender;
>>>>>  buf = mev->buffer;
>>>>> +
>>>>> +    jobdat = NULL;
>>>>> +    for (item = opal_list_get_first(&orte_local_jobdata);
>>>>> +         item != opal_list_get_end(&orte_local_jobdata);
>>>>> +         item = opal_list_get_next(item)) {
>>>>> +        jobdat = (orte_odls_job_t*)item;
>>>>> +
>>>>> +        /* is this the specified job? */
>>>>> +        if (jobdat->jobid == proc->jobid) {
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>> +    if (NULL == jobdat || jobdat->launch_msg_processed != 1) {
>>>>> +        /* it may be possible to get here prior to having actually 
>>>>> finished processing our
>>>>> +         * local launch msg due to the race condition between different 
>>>>> nodes and when
>>>>> +         * they start their individual procs. Hence, we have to first 
>>>>> ensure that we
>>>>> +         * -have- finished processing the launch msg. Requeue this event 
>>>>> until it is done.
>>>>> +         */
>>>>> +        int tag = &mev->tag;
>>>>> +        ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg);
>>>>> +        return;
>>>>> +    }
>>>>> 
>>>>>  /* is the sender a local proc, or a daemon relaying the collective? */
>>>>>  if (ORTE_PROC_MY_NAME->jobid == proc->jobid) {
>>>>> 
>>>>> Sylvain
>>>>> 
>>>>> On Thu, 19 Nov 2009, Ralph Castain wrote:
>>>>> 
>>>>>> Very strange. As I said, we routinely launch jobs spanning several 
>>>>>> hundred nodes without problem. You can see the platform files for that 
>>>>>> setup in contrib/platform/lanl/tlcc
>>>>>> 
>>>>>> That said, it is always possible you are hitting some kind of race 
>>>>>> condition we don't hit. In looking at the code, one possibility would be 
>>>>>> to make all the communications flow through the daemon cmd processor in 
>>>>>> orte/orted_comm.c. This is the way it used to work until I reorganized 
>>>>>> the code a year ago for other reasons that never materialized.
>>>>>> 
>>>>>> Unfortunately, the daemon collective has to wait until the local launch 
>>>>>> cmd has been completely processed so it can know whether or not to wait 
>>>>>> for contributions from local procs before sending along the collective 
>>>>>> message, so this kinda limits our options.
>>>>>> 
>>>>>> About the only other thing you could do would be to not send the relay 
>>>>>> at all until -after- processing the local launch cmd. You can then 
>>>>>> remove the "wait" in the daemon collective as you will know how many 
>>>>>> local procs are involved, if any.
>>>>>> 
>>>>>> I used to do it that way and it guarantees it will work. The negative is 
>>>>>> that we lose some launch speed as the next nodes in the tree don't get 
>>>>>> the launch message until this node finishes launching all its procs.
>>>>>> 
>>>>>> The way around that, of course, would be to:
>>>>>> 
>>>>>> 1.  process the launch message, thus extracting the number of any local 
>>>>>> procs and setting up all data structures...but do -not- launch the procs 
>>>>>> at this time (as this is what takes all the time)
>>>>>> 
>>>>>> 2. send the relay - the daemon collective can now proceed without a 
>>>>>> "wait" in it
>>>>>> 
>>>>>> 3. now launch the local procs
>>>>>> 
>>>>>> It would be a fairly simple reorganization of the code in the 
>>>>>> orte/mca/odls area. I can do it this weekend if you like, or you can do 
>>>>>> it - either way is fine, but if you do it, please contribute it back to 
>>>>>> the trunk.
>>>>>> 
>>>>>> Ralph
>>>>>> 
>>>>>> 
>>>>>> On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote:
>>>>>> 
>>>>>>> I would say I use the default settings, i.e. I don't set anything 
>>>>>>> "special" at configure.
>>>>>>> 
>>>>>>> I'm launching my processes with SLURM (salloc + mpirun).
>>>>>>> 
>>>>>>> Sylvain
>>>>>>> 
>>>>>>> On Wed, 18 Nov 2009, Ralph Castain wrote:
>>>>>>> 
>>>>>>>> How did you configure OMPI?
>>>>>>>> 
>>>>>>>> What launch mechanism are you using - ssh?
>>>>>>>> 
>>>>>>>> On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:
>>>>>>>> 
>>>>>>>>> I don't think so, and I'm not doing it explicitely at least. How do I 
>>>>>>>>> know ?
>>>>>>>>> 
>>>>>>>>> Sylvain
>>>>>>>>> 
>>>>>>>>> On Tue, 17 Nov 2009, Ralph Castain wrote:
>>>>>>>>> 
>>>>>>>>>> We routinely launch across thousands of nodes without a problem...I 
>>>>>>>>>> have never seen it stick in this fashion.
>>>>>>>>>> 
>>>>>>>>>> Did you build and/or are using ORTE threaded by any chance? If so, 
>>>>>>>>>> that definitely won't work.
>>>>>>>>>> 
>>>>>>>>>> On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> We are currently experiencing problems at launch on the 1.5 branch 
>>>>>>>>>>> on relatively large number of nodes (at least 80). Some processes 
>>>>>>>>>>> are not spawned and orted processes are deadlocked.
>>>>>>>>>>> 
>>>>>>>>>>> When MPI processes are calling MPI_Init before send_relay is 
>>>>>>>>>>> complete, the send_relay function and the daemon_collective 
>>>>>>>>>>> function are doing a nice interlock :
>>>>>>>>>>> 
>>>>>>>>>>> Here is the scenario :
>>>>>>>>>>>> send_relay
>>>>>>>>>>> performs the send tree :
>>>>>>>>>>>> orte_rml_oob_send_buffer
>>>>>>>>>>>> orte_rml_oob_send
>>>>>>>>>>>> opal_wait_condition
>>>>>>>>>>> Waiting on completion from send thus calling opal_progress()
>>>>>>>>>>>> opal_progress()
>>>>>>>>>>> But since a collective request arrived from the network, entered :
>>>>>>>>>>>> daemon_collective
>>>>>>>>>>> However, daemon_collective is waiting for the job to be initialized 
>>>>>>>>>>> (wait on jobdat->launch_msg_processed) before continuing, thus 
>>>>>>>>>>> calling :
>>>>>>>>>>>> opal_progress()
>>>>>>>>>>> 
>>>>>>>>>>> At this time, the send may complete, but since we will never go 
>>>>>>>>>>> back to orte_rml_oob_send, we will never perform the launch 
>>>>>>>>>>> (setting jobdat->launch_msg_processed to 1).
>>>>>>>>>>> 
>>>>>>>>>>> I may try to solve the bug (this is quite a top priority problem 
>>>>>>>>>>> for me), but maybe people who are more familiar with orted than I 
>>>>>>>>>>> am may propose a nice and clean solution ...
>>>>>>>>>>> 
>>>>>>>>>>> For those who like real (and complete) gdb stacks, here they are :
>>>>>>>>>>> #0  0x0000003b7fed4f38 in poll () from /lib64/libc.so.6
>>>>>>>>>>> #1  0x00007fd0de5d861a in poll_dispatch (base=0x930230, 
>>>>>>>>>>> arg=0x91a4b0, tv=0x7fff0d977880) at poll.c:167
>>>>>>>>>>> #2  0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, 
>>>>>>>>>>> flags=1) at event.c:823
>>>>>>>>>>> #3  0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>>>>>> #4  0x00007fd0de5aeb6d in opal_progress () at 
>>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>>> #5  0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, 
>>>>>>>>>>> data=0x97b010) at grpcomm_bad_module.c:696
>>>>>>>>>>> #6  0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, 
>>>>>>>>>>> data=0x97af20) at grpcomm_bad_module.c:901
>>>>>>>>>>> #7  0x00007fd0de5d5334 in event_process_active (base=0x930230) at 
>>>>>>>>>>> event.c:667
>>>>>>>>>>> #8  0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, 
>>>>>>>>>>> flags=1) at event.c:839
>>>>>>>>>>> #9  0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>>>>>> #10 0x00007fd0de5aeb6d in opal_progress () at 
>>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>>> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, 
>>>>>>>>>>> data=0x9676e0) at grpcomm_bad_module.c:696
>>>>>>>>>>> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, 
>>>>>>>>>>> data=0x9796d0) at grpcomm_bad_module.c:901
>>>>>>>>>>> #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at 
>>>>>>>>>>> event.c:667
>>>>>>>>>>> #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, 
>>>>>>>>>>> flags=1) at event.c:839
>>>>>>>>>>> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>>>>>> #16 0x00007fd0de5aeb6d in opal_progress () at 
>>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>>> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, 
>>>>>>>>>>> data=0x97b4e0) at grpcomm_bad_module.c:696
>>>>>>>>>>> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, 
>>>>>>>>>>> data=0x97b3f0) at grpcomm_bad_module.c:901
>>>>>>>>>>> #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at 
>>>>>>>>>>> event.c:667
>>>>>>>>>>> #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, 
>>>>>>>>>>> flags=1) at event.c:839
>>>>>>>>>>> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>>>>>> #22 0x00007fd0de5aeb6d in opal_progress () at 
>>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>>> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, 
>>>>>>>>>>> m=0x97b1a8) at ../../../../opal/threads/condition.h:99
>>>>>>>>>>> #24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, 
>>>>>>>>>>> iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
>>>>>>>>>>> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer 
>>>>>>>>>>> (peer=0x7fff0d9785a0, buffer=0x7fff0d9786b0, tag=1, flags=0) at 
>>>>>>>>>>> rml_oob_send.c:270
>>>>>>>>>>> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at 
>>>>>>>>>>> orted/orted_comm.c:127
>>>>>>>>>>> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, 
>>>>>>>>>>> opal_event=1, data=0x965fc0) at orted/orted_comm.c:308
>>>>>>>>>>> #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at 
>>>>>>>>>>> event.c:667
>>>>>>>>>>> #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, 
>>>>>>>>>>> flags=0) at event.c:839
>>>>>>>>>>> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746
>>>>>>>>>>> #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682
>>>>>>>>>>> #32 0x00007fd0de86e339 in orte_daemon (argc=19, 
>>>>>>>>>>> argv=0x7fff0d979ca8) at orted/orted_main.c:769
>>>>>>>>>>> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at 
>>>>>>>>>>> orted.c:62
>>>>>>>>>>> 
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Sylvain
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> [email protected]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> [email protected]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

Reply via email to