> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel > <devel@lists.open-mpi.org> wrote: > > John, > > OMPI_LAZY_WAIT_FOR_COMPLETION(active) > > > is a simple loop that periodically checks the (volatile) "active" condition, > that is expected to be updated by an other thread. > So if you set your breakpoint too early, and **all** threads are stopped when > this breakpoint is hit, you might experience > what looks like a race condition. > I guess a similar scenario can occur if the breakpoint is set in mpirun/orted > too early, and prevents the pmix (or oob/tcp) thread > from sending the message to all MPI tasks) > > > > Ralph, > > does the v4.0.x branch still need the oob/tcp progress thread running inside > the MPI app? > or are we missing some commits (since all interactions with mpirun/orted are > handled by PMIx, at least in the master branch) ?
IIRC, that progress thread only runs if explicitly asked to do so by MCA param. We don't need that code any more as PMIx takes care of it. > > Cheers, > > Gilles > > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote: >> Hi John >> >> Sorry to say, but there is no way to really answer your question as the OMPI >> community doesn't actively test MPIR support. I haven't seen any reports of >> hangs during MPI_Init from any release series, including 4.x. My guess is >> that it may have something to do with the debugger interactions as opposed >> to being a true race condition. >> >> Ralph >> >> >>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel >>> <devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>> wrote: >>> >>> Hi, >>> >>> An LLNL TotalView user on a Mac reported that their MPI job was hanging >>> inside MPI_Init() when started under the control of TotalView. They were >>> using Open MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we >>> don't support the PMIx debugging hooks yet). >>> >>> I was able to reproduce the hang on my own Linux system with my own build >>> of Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, >>> there is some sort of race inside of Open MPI 4.0.1, because if I placed >>> breakpoints at certain points in the Open MPI code, and thus change the >>> timing slightly, that was enough to avoid the hang. >>> >>> When the code hangs, it appeared as if one or more MPI processes are >>> waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to >>> be released. In one of the runs, rank 0 was the only one the was hanging >>> there (though I have seen runs where two ranks were hung there). >>> >>> Here's a backtrace of the first thread in the rank 0 process in the case >>> where one rank was hung: >>> >>> d1.<> f 10.1 w >>> > 0 __nanosleep_nocancel PC=0x7ffff74e2efd, FP=0x7fffffffd1e0 >>> > [/lib64/libc.so.6] >>> 1 usleep PC=0x7ffff7513b2f, FP=0x7fffffffd200 [/lib64/libc.so.6] >>> 2 ompi_mpi_init PC=0x7ffff7a64009, FP=0x7fffffffd350 >>> [/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904] >>> 3 PMPI_Init PC=0x7ffff7ab0be4, FP=0x7fffffffd390 >>> [/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67] >>> 4 main PC=0x00400c5e, FP=0x7fffffffd550 >>> [/home/jdelsign/cpi.c#27] >>> 5 __libc_start_main PC=0x7ffff7446b13, FP=0x7fffffffd610 >>> [/lib64/libc.so.6] >>> 6 _start PC=0x00400b04, FP=0x7fffffffd618 >>> [/amd/home/jdelsign/cpi] >>> >>> Here's the block of code where the thread is hung: >>> >>> /* if we executed the above fence in the background, then >>> * we have to wait here for it to complete. However, there >>> * is no reason to do two barriers! */ >>> if (background_fence) { >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); >>> } else if (!ompi_async_mpi_init) { >>> /* wait for everyone to reach this point - this is a hard >>> * barrier requirement at this time, though we hope to relax >>> * it at a later point */ >>> if (NULL != opal_pmix.fence_nb) { >>> active = true; >>> OPAL_POST_OBJECT(&active); >>> if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false, >>> fence_release, (void*)&active))) { >>> error = "opal_pmix.fence_nb() failed"; >>> goto error; >>> } >>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); *<<<<----- STUCK HERE WAITING FOR >>> THE FENCE TO BE RELEASED* >>> } else { >>> if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) { >>> error = "opal_pmix.fence() failed"; >>> goto error; >>> } >>> } >>> } >>> >>> And here is an aggregated backtrace of all of the processes and threads in >>> the job: >>> >>> d1.<> f g w -g f+l >>> +/ >>> +__clone : 5:12[0-3.2-3, p1.2-5] >>> |+start_thread >>> | +listen_thread@oob_tcp_listener.c >>> <mailto:listen_thread@oob_tcp_listener.c>#705 : 1:1[p1.5] >>> | |+__select_nocancel >>> | +listen_thread@ptl_base_listener.c >>> <mailto:listen_thread@ptl_base_listener.c>#214 : 1:1[p1.3] >>> | |+__select_nocancel >>> | +progress_engine@opal_progress_threads.c >>> <mailto:progress_engine@opal_progress_threads.c>#105 : 5:5[0-3.2, p1.4] >>> | |+opal_libevent2022_event_base_loop@event.c >>> <mailto:opal_libevent2022_event_base_loop@event.c>#1632 >>> | | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c>#167 >>> | | +__poll_nocancel >>> | +progress_engine@pmix_progress_threads.c >>> <mailto:progress_engine@pmix_progress_threads.c>#108 : 5:5[0-3.3, p1.2] >>> | +opal_libevent2022_event_base_loop@event.c >>> <mailto:opal_libevent2022_event_base_loop@event.c>#1632 >>> | +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c>#409 >>> | +__epoll_wait_nocancel >>> +_start : 5:5[0-3.1, p1.1] >>> +__libc_start_main >>> +main@cpi.c <mailto:main@cpi.c>#27 : 4:4[0-3.1] >>> |+PMPI_Init@pinit.c <mailto:PMPI_Init@pinit.c>#67 >>> | +*ompi_mpi_init@ompi_mpi_init.c#890 : 3:3[1-3.1]**<<<<---- THE 3 OTHER >>> MPI PROCS MADE IT PAST FENCE* >>> | |+ompi_rte_wait_for_debugger@rte_orte_module.c >>> <mailto:ompi_rte_wait_for_debugger@rte_orte_module.c>#196 >>> | | +opal_progress@opal_progress.c >>> <mailto:opal_progress@opal_progress.c>#251 >>> | | +opal_progress_events@opal_progress.c >>> <mailto:opal_progress_events@opal_progress.c>#191 >>> | | +opal_libevent2022_event_base_loop@event.c >>> <mailto:opal_libevent2022_event_base_loop@event.c>#1632 >>> | | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c>#167 >>> | | +__poll_nocancel >>> | +*ompi_mpi_init@ompi_mpi_init.c#904 : 1:1[0.1]**<<<<----**THE THREAD >>> THAT IS STUCK* >>> | +usleep >>> | +__nanosleep_nocancel >>> +main@main.c <mailto:main@main.c>#14 : 1:1[p1.1] >>> +orterun@orterun.c <mailto:orterun@orterun.c>#200 >>> +opal_libevent2022_event_base_loop@event.c >>> <mailto:opal_libevent2022_event_base_loop@event.c>#1632 >>> +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c>#167 >>> +__poll_nocancel >>> >>> d1.<> >>> >>> I have tested Open MPI 4.0.2 dozens of times, and the hang does not seem to >>> happen. My concern is that if the problem is indeed a race, then it's >>> /possible/ (but perhaps not likely) that the same race exists in Open MPI >>> 4.0.2, but the timing could be slightly different such that it doesn't hang >>> using my simple test setup. In other words, maybe I've just been "lucky" >>> with my testing of Open MPI 4.0.2 and have failed to provoke the hang yet. >>> >>> My question is: Was this a known problem in Open MPI 4.0.1 that was fixed >>> in Open MPI 4.0.2? >>> >>> Thanks, John D. >>> >>> >>