Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

Austen W Lauria via devel Tue, 12 Nov 2019 13:59:56 -0800

I agree that the use of volatile is insufficient if we want to adhere to
proper multi-threaded programming standards:


"Note that volatile variables are not suitable for communication between
threads; they do not offer atomicity, synchronization, or memory ordering.
A read from a volatile variable that is modified by another thread without
synchronization or concurrent modification from two unsynchronized threads
is undefined behavior due to a data race."

https://en.cppreference.com/w/c/language/volatile

With proper synchronization, the volatile isn't needed at all for
multi-threaded programming.

While for this issue the problem is not the use of volatile, it's just a
ticking time bomb either way. That said I don't know how important the MPIR
path is here since I understand it is being deprecated.



From:   Larry Baker via devel <devel@lists.open-mpi.org>
To:     Open MPI Developers <devel@lists.open-mpi.org>
Cc:     Larry Baker <ba...@usgs.gov>, devel
            <devel-boun...@lists.open-mpi.org>
Date:   11/12/2019 04:38 PM
Subject:        Re: [OMPI devel] [EXTERNAL] Open MPI v4.0.1: Process is hanging
            inside MPI_Init() when debugged with TotalView
Sent by:        "devel" <devel-boun...@lists.open-mpi.org>



"allowing us to weakly synchronize two threads" concerns me if the
synchronization is important or must be reliable.  I do not understand how
volatile alone provides reliable synchronization without a mechanism to
order visible changes to memory.  If the flag(s) in question are suppposed
to indicate some state has changed in this weakly synchronized behavior,
without proper memory barriers, there is no guarantee that memory changes
will be viewed by the two threads in the same order they were issued.  It
is quite possible that the updated state that is flagged as being "good" or
"done" or whatever will not yet be visible across multiple cores, even
though the updated flag indicator may have become visible.  Only if the
flag itself is the data can this work, it seems to me.  If it is a flag
that something has been completed, volatile is not sufficient to guarantee
the corresponding changes in state will be visible.  I have had such
experience from code that used volatile as a proxy for memory barriers.  I
was told "it has never been a problem".  Rare events can, and do, occur.
In my case, it did after over 3 years running the code without
interruption.  I doubt anyone had ever run the code for such a long sample
interval.  We found out because we missed recording an important earthquake
a week after the race condition was tripped.  Murphy's law triumphs
again. :)

Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov



      On 12 Nov 2019, at 1:05:31 PM, George Bosilca via devel <
      devel@lists.open-mpi.org> wrote:

      If the issue was some kind of memory consistently between threads,
      then printing that variable in the context of the debugger would show
      the value of debugger_event_active being false.

      volatile is not a memory barrier, it simply forces a load for each
      access of the data, allowing us to weakly synchronize two threads, as
      long as we dot expect the synchronization to be immediate.

      Anyway, good to see that the issue has been solved.

        George.


      On Tue, Nov 12, 2019 at 2:25 PM John DelSignore via devel <
      devel@lists.open-mpi.org> wrote:
        Hi Austen,


        Thanks for the reply. What I am seeing is consistent with your
        thought, in that when I see the hang, one or more processes did not
        have a flag updated. I don't understand how the Open MPI code works
        well enough to say if it is a memory barrier problem or not. It
        almost looks like a event delivery or dropped event problem to me.


        The place in the MPI_init() code where the MPI processes hang and
        the number of "hung" processes seems to vary from run to run. In
        some cases the processes are waiting for an event or waiting for a
        fence (whatever that is).


        I did the following run today, which shows that it can hang waiting
        for an event that apparently was not generated or was dropped:
           1.   Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
           2.   Ran the mpirun process until it hit the MPIR_Breakpoint()
              event.
           3.   TV attached to all four of the MPI processes and left all
              five processes stopped.
           4.   Continued all of the processes/threads and let them run
              freely for about 60 seconds. They should have run to
              completion in that amount of time.
           5.   Halted all of the processes. I included an aggregated
              backtrace of all of the processes below.


        In this particular run, all four MPI processes were waiting in
        ompi_rte_wait_for_debugger() in rte_orte_module.c at line 196,
        which is:


                /* let the MPI progress engine run while we wait for
        debugger release */
                OMPI_WAIT_FOR_COMPLETION(debugger_event_active);


        I don't know how that is supposed to work, but I can clearly see
        that debugger_event_active was true in all of the processes, even
        though TV set MPIR_debug_gate to 1:


              d1.<> f {2.1 3.1 4.1 5.1} p debugger_event_active
              Thread 2.1:
               debugger_event_active = true (1)
              Thread 3.1:
               debugger_event_active = true (1)
              Thread 4.1:
               debugger_event_active = true (1)
              Thread 5.1:
               debugger_event_active = true (1)
              d1.<> f {2.1 3.1 4.1 5.1} p MPIR_debug_gate
              Thread 2.1:
               MPIR_debug_gate = 0x00000001 (1)
              Thread 3.1:
               MPIR_debug_gate = 0x00000001 (1)
              Thread 4.1:
               MPIR_debug_gate = 0x00000001 (1)
              Thread 5.1:
               MPIR_debug_gate = 0x00000001 (1)
              d1.<>


        I think the _release_fn() function in rte_orte_module.c is supposed
        to set debugger_event_active to false, but that apparently did not
        happen in this case. So, AFAICT, the reason debugger_event_active
        would not be set to false is that the event was never delivered, so
        the _release_fn() function was never called. If that's the case,
        then the lack of a memory barrier is probably a moot point, and the
        problem is likely related to event generation or dropped events.


        Cheers, John D.


        FWIW: Here's the aggregated backtrace after the whole job was
        allowed to run freely for about 60 seconds, and then stopped:


              d1.<> f g w -g f+l


              +/
               +__clone : 5:12[0-3.2-3, p1.2-5]
               |+start_thread
               | +listen_thread : 1:2[p1.3, p1.5]
               | |+__select_nocancel
               | +progress_engine@opal_progress_threads.c#105 : 4:4[0-3.2]
               | |+opal_libevent2022_event_base_loop@event.c#1630
               | | +poll_dispatch@poll.c#165
               | |  +__poll_nocancel
               | +progress_engine@pmix_progress_threads.c#109 : 4:4[0-3.3]
               | |+opal_libevent2022_event_base_loop@event.c#1630
               | | +epoll_dispatch@epoll.c#407
               | |  +__epoll_wait_nocancel
               | +progress_engine : 1:2[p1.2, p1.4]
               |  +opal_libevent2022_event_base_loop@event.c#1630
               |   +epoll_dispatch@epoll.c#407 : 1:1[p1.2]
               |   |+__epoll_wait_nocancel
               |   +poll_dispatch@poll.c#165 : 1:1[p1.4]
               |    +__poll_nocancel
               +_start : 5:5[0-3.1, p1.1]
                +__libc_start_main
                 +main@cpi.c#27 : 4:4[0-3.1]
                 |+PMPI_Init@pinit.c#67
                 | +ompi_mpi_init@ompi_mpi_init.c#890
                 |  +ompi_rte_wait_for_debugger@rte_orte_module.c#196
                 |   +opal_progress@opal_progress.c#245 : 1:1[0.1]
                 |   |+opal_progress_events@opal_progress.c#191
                 |   | +opal_libevent2022_event_base_loop@event.c#1630
                 |   |  +poll_dispatch@poll.c#165
                 |   |   +__poll_nocancel
                 |   +opal_progress@opal_progress.c#247 : 3:3[1-3.1]
                 |    +opal_progress_events@opal_progress.c#191
                 |     +opal_libevent2022_event_base_loop@event.c#1630
                 |      +poll_dispatch@poll.c#165
                 |       +__poll_nocancel
                 +orterun : 1:1[p1.1]
                  +opal_libevent2022_event_base_loop@event.c#1630
                   +poll_dispatch@poll.c#165
                    +__poll_nocancel

              d1.<>





        On 11/12/19 9:47 AM, Austen W Lauria via devel wrote:


              Could it be that some processes are not seeing the flag get
              updated? I don't think just using a simple while loop with a
              volatile variable is sufficient in all cases in a
              multi-threaded environment. It's my understanding that the
              volatile keyword just tells the compiler to not optimize or
              do anything funky with it - because it can change at any
              time. However, this doesn't provide any memory barrier - so
              it's possible that the thread polling on this variable is
              never seeing the update.

              Looking at the code - I see:

              #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
              do { \
              opal_output_verbose(1,
              ompi_rte_base_framework.framework_output, \
              "%s lazy waiting on RTE event at %s:%d", \
              OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
              __FILE__, __LINE__); \
              while ((flg)) { \
              opal_progress(); \
              usleep(100); \
              } \
              }while(0);

              I think replacing that with:

              #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
              do { \
              opal_output_verbose(1,
              ompi_rte_base_framework.framework_output, \
              "%s lazy waiting on RTE event at %s:%d", \
              OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
              __FILE__, __LINE__); \

              pthread_mutex_lock(&lock); \
              while(flag) { \
              pthread_cond_wait(&cond, &lock); \ //Releases the lock while
              waiting for a signal from another thread to wake up
              } \
              pthread_mutex_unlock(&lock); \

              }while(0);

              Is much more standard when dealing with threads updating a
              shared variable - and might lead to a more expected result in
              this case.

              On the other end, this would require the thread updating this
              variable to:

              pthread_mutex_lock(&lock);
              flg = new_val;
              pthread_cond_signal(&cond);
              pthread_mutex_unlock(&lock);

              This provides the memory barrier for the thread polling on
              the flag to see the update - something the volatile keyword
              doesn't do on its own. I think it's also much cleaner as it
              eliminates an arbitrary sleep from the code - which I see as
              a good thing as well.


              <graycol.gif>"Ralph Castain via devel" ---11/12/2019 09:24:23
              AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via
              devel <devel@lists.open-mpi.org> wrote: >

              From: "Ralph Castain via devel" <devel@lists.open-mpi.org>
              To: "OpenMPI Devel" <devel@lists.open-mpi.org>
              Cc: "Ralph Castain" <r...@open-mpi.org>
              Date: 11/12/2019 09:24 AM
              Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process
              is hanging inside MPI_Init() when debugged with TotalView
              Sent by: "devel" <devel-boun...@lists.open-mpi.org>







              > On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel
              <devel@lists.open-mpi.org> wrote:
              >
              > John,
              >
              > OMPI_LAZY_WAIT_FOR_COMPLETION(active)
              >
              >
              > is a simple loop that periodically checks the (volatile)
              "active" condition, that is expected to be updated by an
              other thread.
              > So if you set your breakpoint too early, and **all**
              threads are stopped when this breakpoint is hit, you might
              experience
              > what looks like a race condition.
              > I guess a similar scenario can occur if the breakpoint is
              set in mpirun/orted too early, and prevents the pmix (or
              oob/tcp) thread
              > from sending the message to all MPI tasks)
              >
              >
              >
              > Ralph,
              >
              > does the v4.0.x branch still need the oob/tcp progress
              thread running inside the MPI app?
              > or are we missing some commits (since all interactions with
              mpirun/orted are handled by PMIx, at least in the master
              branch) ?

              IIRC, that progress thread only runs if explicitly asked to
              do so by MCA param. We don't need that code any more as PMIx
              takes care of it.

              >
              > Cheers,
              >
              > Gilles
              >
              > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
              >> Hi John
              >>
              >> Sorry to say, but there is no way to really answer your
              question as the OMPI community doesn't actively test MPIR
              support. I haven't seen any reports of hangs during MPI_Init
              from any release series, including 4.x. My guess is that it
              may have something to do with the debugger interactions as
              opposed to being a true race condition.
              >>
              >> Ralph
              >>
              >>
              >>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel <
              devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>>
              wrote:
              >>>
              >>> Hi,
              >>>
              >>> An LLNL TotalView user on a Mac reported that their MPI
              job was hanging inside MPI_Init() when started under the
              control of TotalView. They were using Open MPI 4.0.1, and
              TotalView was using the MPIR Interface (sorry, we don't
              support the PMIx debugging hooks yet).
              >>>
              >>> I was able to reproduce the hang on my own Linux system
              with my own build of Open MPI 4.0.1, which I built with debug
              symbols. As far as I can tell, there is some sort of race
              inside of Open MPI 4.0.1, because if I placed breakpoints at
              certain points in the Open MPI code, and thus change the
              timing slightly, that was enough to avoid the hang.
              >>>
              >>> When the code hangs, it appeared as if one or more MPI
              processes are waiting inside ompi_mpi_init() at line
              ompi_mpi_init.c#904 for a fence to be released. In one of the
              runs, rank 0 was the only one the was hanging there (though I
              have seen runs where two ranks were hung there).
              >>>
              >>> Here's a backtrace of the first thread in the rank 0
              process in the case where one rank was hung:
              >>>
              >>> d1.<> f 10.1 w
              >>> >  0 __nanosleep_nocancel PC=0x7ffff74e2efd,
              FP=0x7fffffffd1e0 [/lib64/libc.so.6]
              >>>    1 usleep PC=0x7ffff7513b2f, FP=0x7fffffffd200
              [/lib64/libc.so.6]
              >>>    2 ompi_mpi_init PC=0x7ffff7a64009, FP=0x7fffffffd350
              
[/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]

              >>>    3 PMPI_Init PC=0x7ffff7ab0be4, FP=0x7fffffffd390
              
[/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]

              >>>    4 main             PC=0x00400c5e, FP=0x7fffffffd550
              [/home/jdelsign/cpi.c#27]
              >>>    5 __libc_start_main PC=0x7ffff7446b13,
              FP=0x7fffffffd610 [/lib64/libc.so.6]
              >>>    6 _start           PC=0x00400b04, FP=0x7fffffffd618
              [/amd/home/jdelsign/cpi]
              >>>
              >>> Here's the block of code where the thread is hung:
              >>>
              >>>     /* if we executed the above fence in the background,
              then
              >>>      * we have to wait here for it to complete. However,
              there
              >>>      * is no reason to do two barriers! */
              >>>     if (background_fence) {
              >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active);
              >>>     } else if (!ompi_async_mpi_init) {
              >>>         /* wait for everyone to reach this point - this
              is a hard
              >>>          * barrier requirement at this time, though we
              hope to relax
              >>>          * it at a later point */
              >>>         if (NULL != opal_pmix.fence_nb) {
              >>>             active = true;
              >>> OPAL_POST_OBJECT(&active);
              >>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb
              (NULL, false,
              >>> fence_release, (void*)&active))) {
              >>>                 error = "opal_pmix.fence_nb() failed";
              >>>                 goto error;
              >>>             }
              >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); *<<<<----- STUCK
              HERE WAITING FOR THE FENCE TO BE RELEASED*
              >>>         } else {
              >>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence
              (NULL, false))) {
              >>>                 error = "opal_pmix.fence() failed";
              >>>                 goto error;
              >>>             }
              >>>         }
              >>>     }
              >>>
              >>> And here is an aggregated backtrace of all of the
              processes and threads in the job:
              >>>
              >>> d1.<> f g w -g f+l
              >>> +/
              >>>  +__clone : 5:12[0-3.2-3, p1.2-5]
              >>>  |+start_thread
              >>>  | +listen_thread@oob_tcp_listener.c <
              mailto:listen_thread@oob_tcp_listener.c>#705 : 1:1[p1.5]
              >>>  | |+__select_nocancel
              >>>  | +listen_thread@ptl_base_listener.c <
              mailto:listen_thread@ptl_base_listener.c>#214 : 1:1[p1.3]
              >>>  | |+__select_nocancel
              >>>  | +progress_engine@opal_progress_threads.c <
              mailto:progress_engine@opal_progress_threads.c>#105 : 5:5
              [0-3.2, p1.4]
              >>>  | |+opal_libevent2022_event_base_loop@event.c <
              mailto:opal_libevent2022_event_base_loop@event.c>#1632
              >>>  | | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c
              >#167
              >>>  | |  +__poll_nocancel
              >>>  | +progress_engine@pmix_progress_threads.c <
              mailto:progress_engine@pmix_progress_threads.c>#108 : 5:5
              [0-3.3, p1.2]
              >>>  |  +opal_libevent2022_event_base_loop@event.c <
              mailto:opal_libevent2022_event_base_loop@event.c>#1632
              >>>  |   +epoll_dispatch@epoll.c <
              mailto:epoll_dispatch@epoll.c>#409
              >>>  |    +__epoll_wait_nocancel
              >>>  +_start : 5:5[0-3.1, p1.1]
              >>>   +__libc_start_main
              >>>    +main@cpi.c <mailto:main@cpi.c>#27 : 4:4[0-3.1]
              >>>    |+PMPI_Init@pinit.c <mailto:PMPI_Init@pinit.c>#67
              >>>    | +*ompi_mpi_init@ompi_mpi_init.c#890 : 3:3
              [1-3.1]**<<<<---- THE 3 OTHER MPI PROCS MADE IT PAST FENCE*
              >>>    | |+ompi_rte_wait_for_debugger@rte_orte_module.c <
              mailto:ompi_rte_wait_for_debugger@rte_orte_module.c>#196
              >>>    | | +opal_progress@opal_progress.c <
              mailto:opal_progress@opal_progress.c>#251
              >>>    | |  +opal_progress_events@opal_progress.c <
              mailto:opal_progress_events@opal_progress.c>#191
              >>>    | |   +opal_libevent2022_event_base_loop@event.c <
              mailto:opal_libevent2022_event_base_loop@event.c>#1632
              >>>    | |    +poll_dispatch@poll.c <
              mailto:poll_dispatch@poll.c>#167
              >>>    | |     +__poll_nocancel
              >>>    | +*ompi_mpi_init@ompi_mpi_init.c#904 : 1:1
              [0.1]**<<<<----**THE THREAD THAT IS STUCK*
              >>>    |  +usleep
              >>>    |   +__nanosleep_nocancel
              >>>    +main@main.c <mailto:main@main.c>#14 : 1:1[p1.1]
              >>>     +orterun@orterun.c <mailto:orterun@orterun.c>#200
              >>>      +opal_libevent2022_event_base_loop@event.c <
              mailto:opal_libevent2022_event_base_loop@event.c>#1632
              >>>       +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c
              >#167
              >>>        +__poll_nocancel
              >>>
              >>> d1.<>
              >>>
              >>> I have tested Open MPI 4.0.2 dozens of times, and the
              hang does not seem to happen. My concern is that if the
              problem is indeed a race, then it's /possible/ (but perhaps
              not likely) that the same race exists in Open MPI 4.0.2, but
              the timing could be slightly different such that it doesn't
              hang using my simple test setup. In other words, maybe I've
              just been "lucky" with my testing of Open MPI 4.0.2 and have
              failed to provoke the hang yet.
              >>>
              >>> My question is: Was this a known problem in Open MPI
              4.0.1 that was fixed in Open MPI 4.0.2?
              >>>
              >>> Thanks, John D.
              >>>
              >>>
              >>







|------------------------------------------------------------------|
|CAUTION: This email originated from outside of the organization.  |
|Do not click on links or open attachments unless you recognize the|
|sender and know the content is safe.                              |
|------------------------------------------------------------------|

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

Reply via email to