Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

Ralph Castain via devel Tue, 12 Nov 2019 15:08:54 -0800

George beat me to the response - I agree entirely with his statement. Let's not 
go down a deadend here.

Personally, I have never been entirely comfortable with the claim that the PMIx 
modification was the solution to the problem being discussed here. We have 
never seen a report of an application hanging in that spot outside of a 
debugger. Not one report. Yet that code has been "in the wild" now for several 
years.

What I suspect is actually happening is that the debugger is interfering with 
the OMPI internals that are involved in a way that creates a potential loss of 
the release event. The modified timing of the PMIx update biases that race 
sufficiently to make it happen "virtually never", which only means that it 
doesn't trigger when you run it a few times in quick succession. I don't know 
how to further debug it, nor am I particularly motivated to do so as the 
PMIx-based tools work within (not alongside) the release mechanism and are 
unlikely to evince the same behavior.

For now, it appears 4.0.2 is "good enough".
Ralph

On Nov 12, 2019, at 3:01 PM, George Bosilca via devel <[email protected] 
<mailto:[email protected]> > wrote:

As indicated by this discussion, the proper usage of volatile is certainly 
misunderstood.

However,  the usage of the volatile we are doing in this particular instance is 
correct and valid even in multi-threaded cases. We are using it for a 
__single__ trigger, __one way__ synchronization similar to point 2 in the link 
you posted, aka a variable modified in another context that is used __once__.

Here are some well documented usage scenarios, with a way better explanation 
mine [1] and [2].

  George.

[1] https://barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword (check 
Multithreaded Applications)
[2] https://www.geeksforgeeks.org/understanding-volatile-qualifier-in-c/ (check 
2)

On Tue, Nov 12, 2019 at 4:57 PM Austen W Lauria via devel 
<[email protected] <mailto:[email protected]> > wrote:
I agree that the use of volatile is insufficient if we want to adhere to proper 
multi-threaded programming standards:

"Note that volatile variables are not suitable for communication between 
threads; they do not offer atomicity, synchronization, or memory ordering. A 
read from a volatile variable that is modified by another thread without 
synchronization or concurrent modification from two unsynchronized threads is 
undefined behavior due to a data race."

https://en.cppreference.com/w/c/language/volatile

With proper synchronization, the volatile isn't needed at all for 
multi-threaded programming.

While for this issue the problem is not the use of volatile, it's just a 
ticking time bomb either way. That said I don't know how important the MPIR 
path is here since I understand it is being deprecated.

<graycol.gif>Larry Baker via devel ---11/12/2019 04:38:46 PM---"allowing us to 
weakly synchronize two threads" concerns me if the synchronization is important 
or m

From: Larry Baker via devel <[email protected] 
<mailto:[email protected]> >
To: Open MPI Developers <[email protected] 
<mailto:[email protected]> >
Cc: Larry Baker <[email protected] <mailto:[email protected]> >, devel 
<[email protected] <mailto:[email protected]> >
Date: 11/12/2019 04:38 PM
Subject: Re: [OMPI devel] [EXTERNAL] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView
Sent by: "devel" <[email protected] 
<mailto:[email protected]> >

--------------------------------

"allowing us to weakly synchronize two threads" concerns me if the 
synchronization is important or must be reliable. I do not understand how 
volatile alone provides reliable synchronization without a mechanism to order 
visible changes to memory. If the flag(s) in question are suppposed to indicate 
some state has changed in this weakly synchronized behavior, without proper 
memory barriers, there is no guarantee that memory changes will be viewed by 
the two threads in the same order they were issued. It is quite possible that 
the updated state that is flagged as being "good" or "done" or whatever will 
not yet be visible across multiple cores, even though the updated flag 
indicator may have become visible. Only if the flag itself is the data can this 
work, it seems to me. If it is a flag that something has been completed, 
volatile is not sufficient to guarantee the corresponding changes in state will 
be visible. I have had such experience from code that used volatile as a proxy 
for memory barriers. I was told "it has never been a problem". Rare events can, 
and do, occur. In my case, it did after over 3 years running the code without 
interruption. I doubt anyone had ever run the code for such a long sample 
interval. We found out because we missed recording an important earthquake a 
week after the race condition was tripped. Murphy's law triumphs again. :)

Larry Baker
US Geological Survey
650-329-5608
[email protected] <mailto:[email protected]> 

On 12 Nov 2019, at 1:05:31 PM, George Bosilca via devel 
<[email protected] <mailto:[email protected]> > wrote:

If the issue was some kind of memory consistently between threads, then 
printing that variable in the context of the debugger would show the value of 
debugger_event_active being false.

volatile is not a memory barrier, it simply forces a load for each access of 
the data, allowing us to weakly synchronize two threads, as long as we dot 
expect the synchronization to be immediate.

Anyway, good to see that the issue has been solved.

 George.

On Tue, Nov 12, 2019 at 2:25 PM John DelSignore via devel 
<[email protected] <mailto:[email protected]> > wrote:Hi Austen,

Thanks for the reply. What I am seeing is consistent with your thought, in that 
when I see the hang, one or more processes did not have a flag updated. I don't 
understand how the Open MPI code works well enough to say if it is a memory 
barrier problem or not. It almost looks like a event delivery or dropped event 
problem to me.

The place in the MPI_init() code where the MPI processes hang and the number of 
"hung" processes seems to vary from run to run. In some cases the processes are 
waiting for an event or waiting for a fence (whatever that is).

I did the following run today, which shows that it can hang waiting for an 
event that apparently was not generated or was dropped:

1. Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
2. Ran the mpirun process until it hit the MPIR_Breakpoint() event.
3. TV attached to all four of the MPI processes and left all five processes 
stopped.
4. Continued all of the processes/threads and let them run freely for about 60 
seconds. They should have run to completion in that amount of time.
5. Halted all of the processes. I included an aggregated backtrace of all of 
the processes below.

In this particular run, all four MPI processes were waiting in 
ompi_rte_wait_for_debugger() in rte_orte_module.c at line 196, which is:

        /* let the MPI progress engine run while we wait for debugger release */
        OMPI_WAIT_FOR_COMPLETION(debugger_event_active);

I don't know how that is supposed to work, but I can clearly see that 
debugger_event_active was true in all of the processes, even though TV set 
MPIR_debug_gate to 1:

d1.<> f {2.1 3.1 4.1 5.1} p debugger_event_active
Thread 2.1:
 debugger_event_active = true (1)
Thread 3.1:
 debugger_event_active = true (1)
Thread 4.1:
 debugger_event_active = true (1)
Thread 5.1:
 debugger_event_active = true (1)
d1.<> f {2.1 3.1 4.1 5.1} p MPIR_debug_gate
Thread 2.1:
 MPIR_debug_gate = 0x00000001 (1)
Thread 3.1:
 MPIR_debug_gate = 0x00000001 (1)
Thread 4.1:
 MPIR_debug_gate = 0x00000001 (1)
Thread 5.1:
 MPIR_debug_gate = 0x00000001 (1)
d1.<> 
I think the _release_fn() function in rte_orte_module.c is supposed to set 
debugger_event_active to false, but that apparently did not happen in this 
case. So, AFAICT, the reason debugger_event_active would not be set to false is 
that the event was never delivered, so the _release_fn() function was never 
called. If that's the case, then the lack of a memory barrier is probably a 
moot point, and the problem is likely related to event generation or dropped 
events.

Cheers, John D.

FWIW: Here's the aggregated backtrace after the whole job was allowed to run 
freely for about 60 seconds, and then stopped:

d1.<> f g w -g f+l

+/
 +__clone : 5:12[0-3.2-3, p1.2-5]
 |+start_thread
 | +listen_thread : 1:2[p1.3, p1.5]
 | |+__select_nocancel
 | +progress_engine@opal_progress_threads.c 
<mailto:progress_engine@opal_progress_threads.c> #105 : 4:4[0-3.2]
 | |[email protected] 
<mailto:[email protected]> #1630
 | | [email protected] <mailto:[email protected]> #165
 | |  +__poll_nocancel
 | +progress_engine@pmix_progress_threads.c 
<mailto:progress_engine@pmix_progress_threads.c> #109 : 4:4[0-3.3]
 | |[email protected] 
<mailto:[email protected]> #1630
 | | [email protected] <mailto:[email protected]> #407
 | |  +__epoll_wait_nocancel
 | +progress_engine : 1:2[p1.2, p1.4]
 |  [email protected] 
<mailto:[email protected]> #1630
 |   [email protected] <mailto:[email protected]> #407 : 1:1[p1.2]
 |   |+__epoll_wait_nocancel
 |   [email protected] <mailto:[email protected]> #165 : 1:1[p1.4]
 |    +__poll_nocancel
 +_start : 5:5[0-3.1, p1.1]
  +__libc_start_main
   [email protected]#27 <mailto:[email protected]#27>  : 4:4[0-3.1]
   |[email protected] <mailto:[email protected]> #67
   | +ompi_mpi_init@ompi_mpi_init.c <mailto:ompi_mpi_init@ompi_mpi_init.c> #890
   |  +ompi_rte_wait_for_debugger@rte_orte_module.c#196 
<mailto:ompi_rte_wait_for_debugger@rte_orte_module.c#196> 
   |   +opal_progress@opal_progress.c <mailto:opal_progress@opal_progress.c> 
#245 : 1:1[0.1]
   |   |+opal_progress_events@opal_progress.c 
<mailto:opal_progress_events@opal_progress.c> #191
   |   | [email protected] 
<mailto:[email protected]> #1630
   |   |  [email protected] <mailto:[email protected]> #165
   |   |   +__poll_nocancel
   |   +opal_progress@opal_progress.c <mailto:opal_progress@opal_progress.c> 
#247 : 3:3[1-3.1]
   |    +opal_progress_events@opal_progress.c 
<mailto:opal_progress_events@opal_progress.c> #191
   |     [email protected] 
<mailto:[email protected]> #1630
   |      [email protected] <mailto:[email protected]> #165
   |       +__poll_nocancel
   +orterun : 1:1[p1.1]
    [email protected] 
<mailto:[email protected]> #1630
     [email protected] <mailto:[email protected]> #165
      +__poll_nocancel

d1.<> 

On 11/12/19 9:47 AM, Austen W Lauria via devel wrote:

Could it be that some processes are not seeing the flag get updated? I don't 
think just using a simple while loop with a volatile variable is sufficient in 
all cases in a multi-threaded environment. It's my understanding that the 
volatile keyword just tells the compiler to not optimize or do anything funky 
with it - because it can change at any time. However, this doesn't provide any 
memory barrier - so it's possible that the thread polling on this variable is 
never seeing the update.

Looking at the code - I see:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \
while ((flg)) { \
opal_progress(); \
usleep(100); \
} \
}while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \

pthread_mutex_lock(&lock); \
while(flag) { \ 
pthread_cond_wait(&cond, &lock); \ //Releases the lock while waiting for a 
signal from another thread to wake up
} \
pthread_mutex_unlock(&lock); \

}while(0);

Is much more standard when dealing with threads updating a shared variable - 
and might lead to a more expected result in this case.

On the other end, this would require the thread updating this variable to:

pthread_mutex_lock(&lock);
flg = new_val;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&lock);

This provides the memory barrier for the thread polling on the flag to see the 
update - something the volatile keyword doesn't do on its own. I think it's 
also much cleaner as it eliminates an arbitrary sleep from the code - which I 
see as a good thing as well.

<graycol.gif>"Ralph Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 
2019, at 4:53 PM, Gilles Gouaillardet via devel <[email protected]> 
<mailto:[email protected]> wrote: >

From: "Ralph Castain via devel" <[email protected]> 
<mailto:[email protected]> 
To: "OpenMPI Devel" <[email protected]> 
<mailto:[email protected]> 
Cc: "Ralph Castain" <[email protected]> <mailto:[email protected]> 
Date: 11/12/2019 09:24 AM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView
Sent by: "devel" <[email protected]> 
<mailto:[email protected]> 

--------------------------------

> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
> <[email protected]> <mailto:[email protected]>  wrote:
> 
> John,
> 
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> 
> 
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped when 
> this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in mpirun/orted 
> too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
> 
> 
> 
> Ralph,
> 
> does the v4.0.x branch still need the oob/tcp progress thread running inside 
> the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted are 
> handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA param. 
We don't need that code any more as PMIx takes care of it.

> 
> Cheers,
> 
> Gilles
> 
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>> 
>> Sorry to say, but there is no way to really answer your question as the OMPI 
>> community doesn't actively test MPIR support. I haven't seen any reports of 
>> hangs during MPI_Init from any release series, including 4.x. My guess is 
>> that it may have something to do with the debugger interactions as opposed 
>> to being a true race condition.
>> 
>> Ralph
>> 
>> 
>>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
>>> <[email protected] <mailto:[email protected]>  
>>> <mailto:[email protected] <mailto:[email protected]> >> wrote:
>>> 
>>> Hi,
>>> 
>>> An LLNL TotalView user on a Mac reported that their MPI job was hanging 
>>> inside MPI_Init() when started under the control of TotalView. They were 
>>> using Open MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we 
>>> don't support the PMIx debugging hooks yet).
>>> 
>>> I was able to reproduce the hang on my own Linux system with my own build 
>>> of Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, 
>>> there is some sort of race inside of Open MPI 4.0.1, because if I placed 
>>> breakpoints at certain points in the Open MPI code, and thus change the 
>>> timing slightly, that was enough to avoid the hang.
>>> 
>>> When the code hangs, it appeared as if one or more MPI processes are 
>>> waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to 
>>> be released. In one of the runs, rank 0 was the only one the was hanging 
>>> there (though I have seen runs where two ranks were hung there).
>>> 
>>> Here's a backtrace of the first thread in the rank 0 process in the case 
>>> where one rank was hung:
>>> 
>>> d1.<> f 10.1 w
>>> >  0 __nanosleep_nocancel PC=0x7ffff74e2efd, FP=0x7fffffffd1e0 
>>> > [/lib64/libc.so.6]
>>>    1 usleep PC=0x7ffff7513b2f, FP=0x7fffffffd200 [/lib64/libc.so.6]
>>>    2 ompi_mpi_init PC=0x7ffff7a64009, FP=0x7fffffffd350 
>>> [/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
>>>    3 PMPI_Init PC=0x7ffff7ab0be4, FP=0x7fffffffd390 
>>> [/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
>>>    4 main             PC=0x00400c5e, FP=0x7fffffffd550 
>>> [/home/jdelsign/cpi.c#27]
>>>    5 __libc_start_main PC=0x7ffff7446b13, FP=0x7fffffffd610 
>>> [/lib64/libc.so.6]
>>>    6 _start           PC=0x00400b04, FP=0x7fffffffd618 
>>> [/amd/home/jdelsign/cpi]
>>> 
>>> Here's the block of code where the thread is hung:
>>> 
>>>     /* if we executed the above fence in the background, then
>>>      * we have to wait here for it to complete. However, there
>>>      * is no reason to do two barriers! */
>>>     if (background_fence) {
>>> OMPI_LAZY_WAIT_FOR_COMPLETION(active);
>>>     } else if (!ompi_async_mpi_init) {
>>>         /* wait for everyone to reach this point - this is a hard
>>>          * barrier requirement at this time, though we hope to relax
>>>          * it at a later point */
>>>         if (NULL != opal_pmix.fence_nb) {
>>>             active = true;
>>> OPAL_POST_OBJECT(&active);
>>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
>>> fence_release, (void*)&active))) {
>>>                 error = "opal_pmix.fence_nb() failed";
>>>                 goto error;
>>>             }
>>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); *<<<<----- STUCK HERE WAITING FOR 
>>> THE FENCE TO BE RELEASED*
>>>         } else {
>>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) {
>>>                 error = "opal_pmix.fence() failed";
>>>                 goto error;
>>>             }
>>>         }
>>>     }
>>> 
>>> And here is an aggregated backtrace of all of the processes and threads in 
>>> the job:
>>> 
>>> d1.<> f g w -g f+l
>>> +/
>>>  +__clone : 5:12[0-3.2-3, p1.2-5]
>>>  |+start_thread
>>>  | +listen_thread@oob_tcp_listener.c 
>>> <mailto:listen_thread@oob_tcp_listener.c> 
>>> <mailto:listen_thread@oob_tcp_listener.c 
>>> <mailto:listen_thread@oob_tcp_listener.c> >#705 : 1:1[p1.5]
>>>  | |+__select_nocancel
>>>  | +listen_thread@ptl_base_listener.c 
>>> <mailto:listen_thread@ptl_base_listener.c> 
>>> <mailto:listen_thread@ptl_base_listener.c 
>>> <mailto:listen_thread@ptl_base_listener.c> >#214 : 1:1[p1.3]
>>>  | |+__select_nocancel
>>>  | +progress_engine@opal_progress_threads.c 
>>> <mailto:progress_engine@opal_progress_threads.c> 
>>> <mailto:progress_engine@opal_progress_threads.c 
>>> <mailto:progress_engine@opal_progress_threads.c> >#105 : 5:5[0-3.2, p1.4]
>>>  | |[email protected] 
>>> <mailto:[email protected]>  
>>> <mailto:[email protected] 
>>> <mailto:[email protected]> >#1632
>>>  | | [email protected] <mailto:[email protected]>  
>>> <mailto:[email protected] <mailto:[email protected]> >#167
>>>  | |  +__poll_nocancel
>>>  | +progress_engine@pmix_progress_threads.c 
>>> <mailto:progress_engine@pmix_progress_threads.c> 
>>> <mailto:progress_engine@pmix_progress_threads.c 
>>> <mailto:progress_engine@pmix_progress_threads.c> >#108 : 5:5[0-3.3, p1.2]
>>>  |  [email protected] 
>>> <mailto:[email protected]>  
>>> <mailto:[email protected] 
>>> <mailto:[email protected]> >#1632
>>>  |   [email protected] <mailto:[email protected]>  
>>> <mailto:[email protected] <mailto:[email protected]> >#409
>>>  |    +__epoll_wait_nocancel
>>>  +_start : 5:5[0-3.1, p1.1]
>>>   +__libc_start_main
>>>    [email protected] <mailto:[email protected]>  <mailto:[email protected] <mailto:[email protected]> 
>>> >#27 : 4:4[0-3.1]
>>>    |[email protected] <mailto:[email protected]>  
>>> <mailto:[email protected] <mailto:[email protected]> >#67
>>>    | +*ompi_mpi_init@ompi_mpi_init.c <mailto:ompi_mpi_init@ompi_mpi_init.c> 
>>> #890 : 3:3[1-3.1]**<<<<---- THE 3 OTHER MPI PROCS MADE IT PAST FENCE*
>>>    | |+ompi_rte_wait_for_debugger@rte_orte_module.c 
>>> <mailto:ompi_rte_wait_for_debugger@rte_orte_module.c> 
>>> <mailto:ompi_rte_wait_for_debugger@rte_orte_module.c 
>>> <mailto:ompi_rte_wait_for_debugger@rte_orte_module.c> >#196
>>>    | | +opal_progress@opal_progress.c 
>>> <mailto:opal_progress@opal_progress.c> 
>>> <mailto:opal_progress@opal_progress.c 
>>> <mailto:opal_progress@opal_progress.c> >#251
>>>    | |  +opal_progress_events@opal_progress.c 
>>> <mailto:opal_progress_events@opal_progress.c> 
>>> <mailto:opal_progress_events@opal_progress.c 
>>> <mailto:opal_progress_events@opal_progress.c> >#191
>>>    | |   [email protected] 
>>> <mailto:[email protected]>  
>>> <mailto:[email protected] 
>>> <mailto:[email protected]> >#1632
>>>    | |    [email protected] <mailto:[email protected]>  
>>> <mailto:[email protected] <mailto:[email protected]> >#167
>>>    | |     +__poll_nocancel
>>>    | +*ompi_mpi_init@ompi_mpi_init.c <mailto:ompi_mpi_init@ompi_mpi_init.c> 
>>> #904 : 1:1[0.1]**<<<<----**THE THREAD THAT IS STUCK*
>>>    |  +usleep
>>>    |   +__nanosleep_nocancel
>>>    [email protected] <mailto:[email protected]>  <mailto:[email protected] 
>>> <mailto:[email protected]> >#14 : 1:1[p1.1]
>>>     [email protected] <mailto:[email protected]>  
>>> <mailto:[email protected] <mailto:[email protected]> >#200
>>>      [email protected] 
>>> <mailto:[email protected]>  
>>> <mailto:[email protected] 
>>> <mailto:[email protected]> >#1632
>>>       [email protected] <mailto:[email protected]>  
>>> <mailto:[email protected] <mailto:[email protected]> >#167
>>>        +__poll_nocancel
>>> 
>>> d1.<>
>>> 
>>> I have tested Open MPI 4.0.2 dozens of times, and the hang does not seem to 
>>> happen. My concern is that if the problem is indeed a race, then it's 
>>> /possible/ (but perhaps not likely) that the same race exists in Open MPI 
>>> 4.0.2, but the timing could be slightly different such that it doesn't hang 
>>> using my simple test setup. In other words, maybe I've just been "lucky" 
>>> with my testing of Open MPI 4.0.2 and have failed to provoke the hang yet.
>>> 
>>> My question is: Was this a known problem in Open MPI 4.0.1 that was fixed 
>>> in Open MPI 4.0.2?
>>> 
>>> Thanks, John D.
>>> 
>>> 
>> 

CAUTION: This email originated from outside of the organization. Do not click 
on links or open attachments unless you recognize the sender and know the 
content is safe.

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

Reply via email to