Just to be clear as well: you cannot use the pthread method you propose because 
you must loop over opal_progress - the "usleep" is in there simply to avoid 
consuming 100% cpu while we wait.


On Nov 12, 2019, at 8:52 AM, George Bosilca via devel <devel@lists.open-mpi.org 
<mailto:devel@lists.open-mpi.org> > wrote:

I don't think there is a need any protection around that variable. It will 
change value only once (in a callback triggered from opal_progress), and the 
volatile guarantees that loads will be issued for every access, so the waiting 
thread will eventually notice the change.

 George.


On Tue, Nov 12, 2019 at 9:48 AM Austen W Lauria via devel 
<devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > wrote:
Could it be that some processes are not seeing the flag get updated? I don't 
think just using a simple while loop with a volatile variable is sufficient in 
all cases in a multi-threaded environment. It's my understanding that the 
volatile keyword just tells the compiler to not optimize or do anything funky 
with it - because it can change at any time. However, this doesn't provide any 
memory barrier - so it's possible that the thread polling on this variable is 
never seeing the update.

Looking at the code - I see:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
 do { \
 opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
 "%s lazy waiting on RTE event at %s:%d", \
 OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
 __FILE__, __LINE__); \
 while ((flg)) { \
 opal_progress(); \
 usleep(100); \
 } \
 }while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
 do { \
 opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
 "%s lazy waiting on RTE event at %s:%d", \
 OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
 __FILE__, __LINE__); \

pthread_mutex_lock(&lock); \
while(flag) { \ 
 pthread_cond_wait(&cond, &lock); \ //Releases the lock while waiting for a 
signal from another thread to wake up
} \
pthread_mutex_unlock(&lock); \

 }while(0);

Is much more standard when dealing with threads updating a shared variable - 
and might lead to a more expected result in this case.

On the other end, this would require the thread updating this variable to:

pthread_mutex_lock(&lock);
flg = new_val;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&lock);

This provides the memory barrier for the thread polling on the flag to see the 
update - something the volatile keyword doesn't do on its own. I think it's 
also much cleaner as it eliminates an arbitrary sleep from the code - which I 
see as a good thing as well.


<graycol.gif>"Ralph Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 
2019, at 4:53 PM, Gilles Gouaillardet via devel <devel@lists.open-mpi.org 
<mailto:devel@lists.open-mpi.org> > wrote: >

From: "Ralph Castain via devel" <devel@lists.open-mpi.org 
<mailto:devel@lists.open-mpi.org> >
To: "OpenMPI Devel" <devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
>
Cc: "Ralph Castain" <r...@open-mpi.org <mailto:r...@open-mpi.org> >
Date: 11/12/2019 09:24 AM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView
Sent by: "devel" <devel-boun...@lists.open-mpi.org 
<mailto:devel-boun...@lists.open-mpi.org> >

--------------------------------





> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
> <devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > wrote:
> 
> John,
> 
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> 
> 
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped when 
> this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in mpirun/orted 
> too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
> 
> 
> 
> Ralph,
> 
> does the v4.0.x branch still need the oob/tcp progress thread running inside 
> the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted are 
> handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA param. 
We don't need that code any more as PMIx takes care of it.

> 
> Cheers,
> 
> Gilles
> 
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>> 
>> Sorry to say, but there is no way to really answer your question as the OMPI 
>> community doesn't actively test MPIR support. I haven't seen any reports of 
>> hangs during MPI_Init from any release series, including 4.x. My guess is 
>> that it may have something to do with the debugger interactions as opposed 
>> to being a true race condition.
>> 
>> Ralph
>> 
>> 
>>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
>>> <devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> 
>>> <mailto:devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> wrote:
>>> 
>>> Hi,
>>> 
>>> An LLNL TotalView user on a Mac reported that their MPI job was hanging 
>>> inside MPI_Init() when started under the control of TotalView. They were 
>>> using Open MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we 
>>> don't support the PMIx debugging hooks yet).
>>> 
>>> I was able to reproduce the hang on my own Linux system with my own build 
>>> of Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, 
>>> there is some sort of race inside of Open MPI 4.0.1, because if I placed 
>>> breakpoints at certain points in the Open MPI code, and thus change the 
>>> timing slightly, that was enough to avoid the hang.
>>> 
>>> When the code hangs, it appeared as if one or more MPI processes are 
>>> waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to 
>>> be released. In one of the runs, rank 0 was the only one the was hanging 
>>> there (though I have seen runs where two ranks were hung there).
>>> 
>>> Here's a backtrace of the first thread in the rank 0 process in the case 
>>> where one rank was hung:
>>> 
>>> d1.<> f 10.1 w
>>> >  0 __nanosleep_nocancel PC=0x7ffff74e2efd, FP=0x7fffffffd1e0 
>>> > [/lib64/libc.so.6]
>>>    1 usleep PC=0x7ffff7513b2f, FP=0x7fffffffd200 [/lib64/libc.so.6]
>>>    2 ompi_mpi_init PC=0x7ffff7a64009, FP=0x7fffffffd350 
>>> [/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
>>>    3 PMPI_Init PC=0x7ffff7ab0be4, FP=0x7fffffffd390 
>>> [/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
>>>    4 main             PC=0x00400c5e, FP=0x7fffffffd550 
>>> [/home/jdelsign/cpi.c#27]
>>>    5 __libc_start_main PC=0x7ffff7446b13, FP=0x7fffffffd610 
>>> [/lib64/libc.so.6]
>>>    6 _start           PC=0x00400b04, FP=0x7fffffffd618 
>>> [/amd/home/jdelsign/cpi]
>>> 
>>> Here's the block of code where the thread is hung:
>>> 
>>>     /* if we executed the above fence in the background, then
>>>      * we have to wait here for it to complete. However, there
>>>      * is no reason to do two barriers! */
>>>     if (background_fence) {
>>> OMPI_LAZY_WAIT_FOR_COMPLETION(active);
>>>     } else if (!ompi_async_mpi_init) {
>>>         /* wait for everyone to reach this point - this is a hard
>>>          * barrier requirement at this time, though we hope to relax
>>>          * it at a later point */
>>>         if (NULL != opal_pmix.fence_nb) {
>>>             active = true;
>>> OPAL_POST_OBJECT(&active);
>>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
>>> fence_release, (void*)&active))) {
>>>                 error = "opal_pmix.fence_nb() failed";
>>>                 goto error;
>>>             }
>>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); *<<<<----- STUCK HERE WAITING FOR 
>>> THE FENCE TO BE RELEASED*
>>>         } else {
>>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) {
>>>                 error = "opal_pmix.fence() failed";
>>>                 goto error;
>>>             }
>>>         }
>>>     }
>>> 
>>> And here is an aggregated backtrace of all of the processes and threads in 
>>> the job:
>>> 
>>> d1.<> f g w -g f+l
>>> +/
>>>  +__clone : 5:12[0-3.2-3, p1.2-5]
>>>  |+start_thread
>>>  | +listen_thread@oob_tcp_listener.c 
>>> <mailto:listen_thread@oob_tcp_listener.c> 
>>> <mailto:listen_thread@oob_tcp_listener.c 
>>> <mailto:listen_thread@oob_tcp_listener.c> >#705 : 1:1[p1.5]
>>>  | |+__select_nocancel
>>>  | +listen_thread@ptl_base_listener.c 
>>> <mailto:listen_thread@ptl_base_listener.c> 
>>> <mailto:listen_thread@ptl_base_listener.c 
>>> <mailto:listen_thread@ptl_base_listener.c> >#214 : 1:1[p1.3]
>>>  | |+__select_nocancel
>>>  | +progress_engine@opal_progress_threads.c 
>>> <mailto:progress_engine@opal_progress_threads.c> 
>>> <mailto:progress_engine@opal_progress_threads.c 
>>> <mailto:progress_engine@opal_progress_threads.c> >#105 : 5:5[0-3.2, p1.4]
>>>  | |+opal_libevent2022_event_base_loop@event.c 
>>> <mailto:opal_libevent2022_event_base_loop@event.c> 
>>> <mailto:opal_libevent2022_event_base_loop@event.c 
>>> <mailto:opal_libevent2022_event_base_loop@event.c> >#1632
>>>  | | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> 
>>> <mailto:poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> >#167
>>>  | |  +__poll_nocancel
>>>  | +progress_engine@pmix_progress_threads.c 
>>> <mailto:progress_engine@pmix_progress_threads.c> 
>>> <mailto:progress_engine@pmix_progress_threads.c 
>>> <mailto:progress_engine@pmix_progress_threads.c> >#108 : 5:5[0-3.3, p1.2]
>>>  |  +opal_libevent2022_event_base_loop@event.c 
>>> <mailto:opal_libevent2022_event_base_loop@event.c> 
>>> <mailto:opal_libevent2022_event_base_loop@event.c 
>>> <mailto:opal_libevent2022_event_base_loop@event.c> >#1632
>>>  |   +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c> 
>>> <mailto:epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c> >#409
>>>  |    +__epoll_wait_nocancel
>>>  +_start : 5:5[0-3.1, p1.1]
>>>   +__libc_start_main
>>>    +main@cpi.c <mailto:main@cpi.c> <mailto:main@cpi.c <mailto:main@cpi.c> 
>>> >#27 : 4:4[0-3.1]
>>>    |+PMPI_Init@pinit.c <mailto:PMPI_Init@pinit.c> <mailto:PMPI_Init@pinit.c 
>>> <mailto:PMPI_Init@pinit.c> >#67
>>>    | +*ompi_mpi_init@ompi_mpi_init.c <mailto:ompi_mpi_init@ompi_mpi_init.c> 
>>> #890 : 3:3[1-3.1]**<<<<---- THE 3 OTHER MPI PROCS MADE IT PAST FENCE*
>>>    | |+ompi_rte_wait_for_debugger@rte_orte_module.c 
>>> <mailto:ompi_rte_wait_for_debugger@rte_orte_module.c> 
>>> <mailto:ompi_rte_wait_for_debugger@rte_orte_module.c 
>>> <mailto:ompi_rte_wait_for_debugger@rte_orte_module.c> >#196
>>>    | | +opal_progress@opal_progress.c 
>>> <mailto:opal_progress@opal_progress.c> 
>>> <mailto:opal_progress@opal_progress.c 
>>> <mailto:opal_progress@opal_progress.c> >#251
>>>    | |  +opal_progress_events@opal_progress.c 
>>> <mailto:opal_progress_events@opal_progress.c> 
>>> <mailto:opal_progress_events@opal_progress.c 
>>> <mailto:opal_progress_events@opal_progress.c> >#191
>>>    | |   +opal_libevent2022_event_base_loop@event.c 
>>> <mailto:opal_libevent2022_event_base_loop@event.c> 
>>> <mailto:opal_libevent2022_event_base_loop@event.c 
>>> <mailto:opal_libevent2022_event_base_loop@event.c> >#1632
>>>    | |    +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> 
>>> <mailto:poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> >#167
>>>    | |     +__poll_nocancel
>>>    | +*ompi_mpi_init@ompi_mpi_init.c <mailto:ompi_mpi_init@ompi_mpi_init.c> 
>>> #904 : 1:1[0.1]**<<<<----**THE THREAD THAT IS STUCK*
>>>    |  +usleep
>>>    |   +__nanosleep_nocancel
>>>    +main@main.c <mailto:main@main.c> <mailto:main@main.c 
>>> <mailto:main@main.c> >#14 : 1:1[p1.1]
>>>     +orterun@orterun.c <mailto:orterun@orterun.c> <mailto:orterun@orterun.c 
>>> <mailto:orterun@orterun.c> >#200
>>>      +opal_libevent2022_event_base_loop@event.c 
>>> <mailto:opal_libevent2022_event_base_loop@event.c> 
>>> <mailto:opal_libevent2022_event_base_loop@event.c 
>>> <mailto:opal_libevent2022_event_base_loop@event.c> >#1632
>>>       +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> 
>>> <mailto:poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> >#167
>>>        +__poll_nocancel
>>> 
>>> d1.<>
>>> 
>>> I have tested Open MPI 4.0.2 dozens of times, and the hang does not seem to 
>>> happen. My concern is that if the problem is indeed a race, then it's 
>>> /possible/ (but perhaps not likely) that the same race exists in Open MPI 
>>> 4.0.2, but the timing could be slightly different such that it doesn't hang 
>>> using my simple test setup. In other words, maybe I've just been "lucky" 
>>> with my testing of Open MPI 4.0.2 and have failed to provoke the hang yet.
>>> 
>>> My question is: Was this a known problem in Open MPI 4.0.1 that was fixed 
>>> in Open MPI 4.0.2?
>>> 
>>> Thanks, John D.
>>> 
>>> 
>> 








Reply via email to