Hi, Moving this over to the devel list... I’m not sure if it's is a problem with PMIx or with OMPI’s integration with that. It looks like wait_cbfunc callback enqueued as part of the PMIX_PTL_SEND_RECV at pmix_client_connect.c:329 is never called, and so the main thread is never woken from the PMIX_WAIT_THREAD at pmix_client_connect.c:232. (This is for PMIx v2.1.1.) But I haven’t worked out why that callback is not being called yet… looking at the output, I think that it’s expecting a message back from the PMIx server that it’s never getting.
[raijin7:05505] pmix: disconnect called [raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to server [raijin7:05505] posting recv on tag 119 [raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645 [raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER 1746468864:0 tag 119 with NON-NULL msg [raijin7:05505] ptl:base:send_handler SENDING MSG [raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer 1746468865:0 [raijin7:05493] ptl:base:recv:handler allocate new recv msg [raijin7:05493] ptl:base:recv:handler read hdr on socket 27 [raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645 [raijin7:05493] ptl:base:recv:handler allocate data region of size 645 [raijin7:05505] ptl:base:send_handler MSG SENT [raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 BYTES FOR TAG 119 ON PEER SOCKET 27 [raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post msg [raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on socket 27 [raijin7:05493] checking msg on tag 119 for tag 0 [raijin7:05493] checking msg on tag 119 for tag 4294967295 [raijin7:05505] pmix: disconnect completed [raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119 [raijin7:05493] SWITCHYARD for 1746468865:0:27 [raijin7:05493] recvd pmix cmd 11 from 1746468865:0 [raijin7:05493] recvd CONNECT from peer 1746468865:0 [raijin7:05493] get_tracker called with 32 procs [raijin7:05493] 1746468864:0 CALLBACK COMPLETE Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of the MPI processes (i.e. the original one along with the dynamically launched ones) look to be waiting on the same pthread_cond_wait in the backtrace below, while the mpirun is just in the standard event loops (event_base_loop, oob_tcp_listener, opal_progress_threads, ptl_base_listener, and pmix_progress_threads). That said, I’m not sure why get_tracker is reporting 32 procs — there’s only 16 running here (i.e. 1 original + 15 spawned). Or should I post this over in the PMIx list instead? Cheers, Ben > On 17 May 2018, at 9:59 am, Ben Menadue <ben.mena...@nci.org.au> wrote: > > Hi, > > I’m trying to debug a user’s program that uses dynamic process management > through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of the > processes is in > > #0 0x00007ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=<value > optimized out>, info=<value optimized out>, ninfo=0) at > ../../src/client/pmix_client_connect.c:232 > #2 0x00007ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at > ext2x_client.c:1432 > #3 0x00007ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at > ../../../../../ompi/dpm/dpm.c:596 > #4 0x00007ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at > pcomm_disconnect.c:67 > #5 0x00007ff71a7466b9 in mpi_comm_disconnect () from > /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so > > This is using 3.1.0 against and external install of PMIx 2.1.1. But I see > exactly the same issue with 3.0.1 using its internal PMIx. It looks similar > to issue #4542, but the corresponding patch in PR#4549 doesn’t seem to help > (it just hangs in PMIx_fence instead of PMIx_disconnect). > > Attached is the offending R script, it hangs in the “closeCluster” call. Has > anyone seen this issue? I’m not sure what approach to take to debug it, but I > have builds of the MPI libraries with --enable-debug available if needed. > > Cheers, > Ben > > <Rmpi_test.r> > _______________________________________________ > users mailing list > us...@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel