Yes that was an omission on my part.

Regarding volatile being sufficient - I don't think that is the case in all
situations. It might work under most conditions - but it can lead to the
"it works on my machine..." type of bugs. In particular it doesn't
guarantee that the waiting thread will ever see the updated value.

This is easy enough to see - can a print be added or break on the line
where it is updated to confirm that it is in fact being set for all ranks
under the users conditions?



From:   "Ralph Castain via devel" <devel@lists.open-mpi.org>
To:     "OpenMPI Devel" <devel@lists.open-mpi.org>
Cc:     "Ralph Castain" <r...@open-mpi.org>
Date:   11/12/2019 01:28 PM
Subject:        [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging
            inside MPI_Init() when debugged with TotalView
Sent by:        "devel" <devel-boun...@lists.open-mpi.org>



Just to be clear as well: you cannot use the pthread method you propose
because you must loop over opal_progress - the "usleep" is in there simply
to avoid consuming 100% cpu while we wait.


      On Nov 12, 2019, at 8:52 AM, George Bosilca via devel <
      devel@lists.open-mpi.org> wrote:

      I don't think there is a need any protection around that variable. It
      will change value only once (in a callback triggered from
      opal_progress), and the volatile guarantees that loads will be issued
      for every access, so the waiting thread will eventually notice the
      change.

       George.


      On Tue, Nov 12, 2019 at 9:48 AM Austen W Lauria via devel <
      devel@lists.open-mpi.org> wrote:
        Could it be that some processes are not seeing the flag get
        updated? I don't think just using a simple while loop with a
        volatile variable is sufficient in all cases in a multi-threaded
        environment. It's my understanding that the volatile keyword just
        tells the compiler to not optimize or do anything funky with it -
        because it can change at any time. However, this doesn't provide
        any memory barrier - so it's possible that the thread polling on
        this variable is never seeing the update.

        Looking at the code - I see:

        #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
        do { \
        opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
        "%s lazy waiting on RTE event at %s:%d", \
        OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
        __FILE__, __LINE__); \
        while ((flg)) { \
        opal_progress(); \
        usleep(100); \
        } \
        }while(0);

        I think replacing that with:

        #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
        do { \
        opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
        "%s lazy waiting on RTE event at %s:%d", \
        OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
        __FILE__, __LINE__); \

        pthread_mutex_lock(&lock); \
        while(flag) { \
        pthread_cond_wait(&cond, &lock); \ //Releases the lock while
        waiting for a signal from another thread to wake up
        } \
        pthread_mutex_unlock(&lock); \

        }while(0);

        Is much more standard when dealing with threads updating a shared
        variable - and might lead to a more expected result in this case.

        On the other end, this would require the thread updating this
        variable to:

        pthread_mutex_lock(&lock);
        flg = new_val;
        pthread_cond_signal(&cond);
        pthread_mutex_unlock(&lock);

        This provides the memory barrier for the thread polling on the flag
        to see the update - something the volatile keyword doesn't do on
        its own. I think it's also much cleaner as it eliminates an
        arbitrary sleep from the code - which I see as a good thing as
        well.


        <graycol.gif>"Ralph Castain via devel" ---11/12/2019 09:24:23
        AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel <
        devel@lists.open-mpi.org> wrote: >

        From: "Ralph Castain via devel" <devel@lists.open-mpi.org>
        To: "OpenMPI Devel" <devel@lists.open-mpi.org>
        Cc: "Ralph Castain" <r...@open-mpi.org>
        Date: 11/12/2019 09:24 AM
        Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is
        hanging inside MPI_Init() when debugged with TotalView
        Sent by: "devel" <devel-boun...@lists.open-mpi.org>





        > On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel <
        devel@lists.open-mpi.org> wrote:
        >
        > John,
        >
        > OMPI_LAZY_WAIT_FOR_COMPLETION(active)
        >
        >
        > is a simple loop that periodically checks the (volatile) "active"
        condition, that is expected to be updated by an other thread.
        > So if you set your breakpoint too early, and **all** threads are
        stopped when this breakpoint is hit, you might experience
        > what looks like a race condition.
        > I guess a similar scenario can occur if the breakpoint is set in
        mpirun/orted too early, and prevents the pmix (or oob/tcp) thread
        > from sending the message to all MPI tasks)
        >
        >
        >
        > Ralph,
        >
        > does the v4.0.x branch still need the oob/tcp progress thread
        running inside the MPI app?
        > or are we missing some commits (since all interactions with
        mpirun/orted are handled by PMIx, at least in the master branch) ?

        IIRC, that progress thread only runs if explicitly asked to do so
        by MCA param. We don't need that code any more as PMIx takes care
        of it.

        >
        > Cheers,
        >
        > Gilles
        >
        > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
        >> Hi John
        >>
        >> Sorry to say, but there is no way to really answer your question
        as the OMPI community doesn't actively test MPIR support. I haven't
        seen any reports of hangs during MPI_Init from any release series,
        including 4.x. My guess is that it may have something to do with
        the debugger interactions as opposed to being a true race
        condition.
        >>
        >> Ralph
        >>
        >>
        >>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel <
        devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>> wrote:
        >>>
        >>> Hi,
        >>>
        >>> An LLNL TotalView user on a Mac reported that their MPI job was
        hanging inside MPI_Init() when started under the control of
        TotalView. They were using Open MPI 4.0.1, and TotalView was using
        the MPIR Interface (sorry, we don't support the PMIx debugging
        hooks yet).
        >>>
        >>> I was able to reproduce the hang on my own Linux system with my
        own build of Open MPI 4.0.1, which I built with debug symbols. As
        far as I can tell, there is some sort of race inside of Open MPI
        4.0.1, because if I placed breakpoints at certain points in the
        Open MPI code, and thus change the timing slightly, that was enough
        to avoid the hang.
        >>>
        >>> When the code hangs, it appeared as if one or more MPI
        processes are waiting inside ompi_mpi_init() at line
        ompi_mpi_init.c#904 for a fence to be released. In one of the runs,
        rank 0 was the only one the was hanging there (though I have seen
        runs where two ranks were hung there).
        >>>
        >>> Here's a backtrace of the first thread in the rank 0 process in
        the case where one rank was hung:
        >>>
        >>> d1.<> f 10.1 w
        >>> >  0 __nanosleep_nocancel PC=0x7ffff74e2efd, FP=0x7fffffffd1e0
        [/lib64/libc.so.6]
        >>>    1 usleep PC=0x7ffff7513b2f, FP=0x7fffffffd200
        [/lib64/libc.so.6]
        >>>    2 ompi_mpi_init PC=0x7ffff7a64009, FP=0x7fffffffd350
        
[/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]

        >>>    3 PMPI_Init PC=0x7ffff7ab0be4, FP=0x7fffffffd390
        
[/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]

        >>>    4 main             PC=0x00400c5e, FP=0x7fffffffd550
        [/home/jdelsign/cpi.c#27]
        >>>    5 __libc_start_main PC=0x7ffff7446b13, FP=0x7fffffffd610
        [/lib64/libc.so.6]
        >>>    6 _start           PC=0x00400b04, FP=0x7fffffffd618
        [/amd/home/jdelsign/cpi]
        >>>
        >>> Here's the block of code where the thread is hung:
        >>>
        >>>     /* if we executed the above fence in the background, then
        >>>      * we have to wait here for it to complete. However, there
        >>>      * is no reason to do two barriers! */
        >>>     if (background_fence) {
        >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active);
        >>>     } else if (!ompi_async_mpi_init) {
        >>>         /* wait for everyone to reach this point - this is a
        hard
        >>>          * barrier requirement at this time, though we hope to
        relax
        >>>          * it at a later point */
        >>>         if (NULL != opal_pmix.fence_nb) {
        >>>             active = true;
        >>> OPAL_POST_OBJECT(&active);
        >>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL,
        false,
        >>> fence_release, (void*)&active))) {
        >>>                 error = "opal_pmix.fence_nb() failed";
        >>>                 goto error;
        >>>             }
        >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); *<<<<----- STUCK HERE
        WAITING FOR THE FENCE TO BE RELEASED*
        >>>         } else {
        >>>             if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL,
        false))) {
        >>>                 error = "opal_pmix.fence() failed";
        >>>                 goto error;
        >>>             }
        >>>         }
        >>>     }
        >>>
        >>> And here is an aggregated backtrace of all of the processes and
        threads in the job:
        >>>
        >>> d1.<> f g w -g f+l
        >>> +/
        >>>  +__clone : 5:12[0-3.2-3, p1.2-5]
        >>>  |+start_thread
        >>>  | +listen_thread@oob_tcp_listener.c <
        mailto:listen_thread@oob_tcp_listener.c>#705 : 1:1[p1.5]
        >>>  | |+__select_nocancel
        >>>  | +listen_thread@ptl_base_listener.c <
        mailto:listen_thread@ptl_base_listener.c>#214 : 1:1[p1.3]
        >>>  | |+__select_nocancel
        >>>  | +progress_engine@opal_progress_threads.c <
        mailto:progress_engine@opal_progress_threads.c>#105 : 5:5[0-3.2,
        p1.4]
        >>>  | |+opal_libevent2022_event_base_loop@event.c <
        mailto:opal_libevent2022_event_base_loop@event.c>#1632
        >>>  | | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c>#167
        >>>  | |  +__poll_nocancel
        >>>  | +progress_engine@pmix_progress_threads.c <
        mailto:progress_engine@pmix_progress_threads.c>#108 : 5:5[0-3.3,
        p1.2]
        >>>  |  +opal_libevent2022_event_base_loop@event.c <
        mailto:opal_libevent2022_event_base_loop@event.c>#1632
        >>>  |   +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c
        >#409
        >>>  |    +__epoll_wait_nocancel
        >>>  +_start : 5:5[0-3.1, p1.1]
        >>>   +__libc_start_main
        >>>    +main@cpi.c <mailto:main@cpi.c>#27 : 4:4[0-3.1]
        >>>    |+PMPI_Init@pinit.c <mailto:PMPI_Init@pinit.c>#67
        >>>    | +*ompi_mpi_init@ompi_mpi_init.c#890 : 3:3[1-3.1]**<<<<----
        THE 3 OTHER MPI PROCS MADE IT PAST FENCE*
        >>>    | |+ompi_rte_wait_for_debugger@rte_orte_module.c <
        mailto:ompi_rte_wait_for_debugger@rte_orte_module.c>#196
        >>>    | | +opal_progress@opal_progress.c <
        mailto:opal_progress@opal_progress.c>#251
        >>>    | |  +opal_progress_events@opal_progress.c <
        mailto:opal_progress_events@opal_progress.c>#191
        >>>    | |   +opal_libevent2022_event_base_loop@event.c <
        mailto:opal_libevent2022_event_base_loop@event.c>#1632
        >>>    | |    +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c
        >#167
        >>>    | |     +__poll_nocancel
        >>>    | +*ompi_mpi_init@ompi_mpi_init.c#904 : 1:1
        [0.1]**<<<<----**THE THREAD THAT IS STUCK*
        >>>    |  +usleep
        >>>    |   +__nanosleep_nocancel
        >>>    +main@main.c <mailto:main@main.c>#14 : 1:1[p1.1]
        >>>     +orterun@orterun.c <mailto:orterun@orterun.c>#200
        >>>      +opal_libevent2022_event_base_loop@event.c <
        mailto:opal_libevent2022_event_base_loop@event.c>#1632
        >>>       +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c>#167
        >>>        +__poll_nocancel
        >>>
        >>> d1.<>
        >>>
        >>> I have tested Open MPI 4.0.2 dozens of times, and the hang does
        not seem to happen. My concern is that if the problem is indeed a
        race, then it's /possible/ (but perhaps not likely) that the same
        race exists in Open MPI 4.0.2, but the timing could be slightly
        different such that it doesn't hang using my simple test setup. In
        other words, maybe I've just been "lucky" with my testing of Open
        MPI 4.0.2 and have failed to provoke the hang yet.
        >>>
        >>> My question is: Was this a known problem in Open MPI 4.0.1 that
        was fixed in Open MPI 4.0.2?
        >>>
        >>> Thanks, John D.
        >>>
        >>>
        >>









Reply via email to