Yes, running over the socket provider. I configured libfabric-1.5.3 with default providers; udp and socket are the only ones - plus rxm and rxd, but I don't think they apply.
FWIW, I saw the same hang with 1.3.0 and 1.4.2, and I see the same hang with OpenVPN and libfabric on QEMU (though I haven't looked into OpenVPN in as much detail). It shouldn't matter, but I'm running QEMU/KVM on an AMD box, so there could be some hidden Intel-ism that's causing the problem. (My latent paranoia is showing...) Thanks! John -----Original Message----- From: Hefty, Sean [mailto:sean.he...@intel.com] Sent: Monday, February 05, 2018 2:15 PM To: Wilkes, John <john.wil...@amd.com>; libfabric-us...@lists.openfabrics.org; ofiwg@lists.openfabrics.org Subject: RE: libfabric hangs on QEMU/KVM virtual cluster copying ofiwg mailing list as well Are you running over the socket provider? I'm not aware of any issues running over QEMU, but I don't know of anyone who has tested it. I'll check on the testing with MPICH to see what's been tested and how recently it's been run. - Sean > I have a four node cluster of QEMU/KVM virtual machines. I installed > MPICH-3.2 and ran the mpi-hello-world program with no problem. > > > > I installed libfabric-1.5.3 and ran fabtests-1.5.3: > > > > $ $PWD/runfabtests.sh -p /nfs/fabtests/bin sockets 192.168.100.201 > 192.168.100.203 > > > > And all tests pass: > > > > # -------------------------------------------------------------- > > # Total Pass 73 > > # Total Notrun 0 > > # Total Fail 0 > > # Percentage of Pass 100 > > # -------------------------------------------------------------- > > > > I rebuilt MPICH after configuring it to use libfabric. I recompiled > the mpi-hello-world program. When I run mpi-hello-world with > libfabric, it prints the "hello" message from all four nodes but hangs > in MPI_Finalize. > > > > I rebuilt libfabric and MPICH with debugging enabled and generated a > log file when running mpi-hello-world on just two nodes (i.e. using "- > n 2" instead of "-n 4"). The log file indicates that it is stuck > "Waiting for 1 close operations", repeating "MPID_nem_ofi_poll" over > and over until I stop the program with control-C: > > ... > > <"MPID_nem_ofi_poll"(3e-06) > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124] > > >"MPID_nem_ofi_poll" > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45] > > >"MPID_nem_ofi_cts_send_callback" > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[188] > > >"MPID_nem_ofi_handle_packet" > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[167] > > <"MPID_nem_ofi_handle_packet"(3e-06) > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[175] > > <"MPID_nem_ofi_cts_send_callback"(9e-06) > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[191] > > >"MPID_nem_ofi_data_callback" > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[124] > > <"MPID_nem_ofi_data_callback"(3e-06) > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[173] > > <"MPID_nem_ofi_poll"(0.00404) > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124] > > <MPIDI_CH3I_PROGRESS(0.00796) > src/mpid/ch3/channels/nemesis/src/ch3_progress.c[659] > > Waiting for 1 close operations > src/mpid/ch3/src/ch3u_handle_connection.c[382] > > >MPIDI_CH3I_PROGRESS > src/mpid/ch3/channels/nemesis/src/ch3_progress.c[424] > > >"MPID_nem_ofi_poll" > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45] > > <"MPID_nem_ofi_poll"(3e-06) > src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124] > > ... > > > > I get the same behavior with OpenVPN; mpi-hello-world prints the > "hello" message from all four nodes and hangs. Without libfabric, it > runs normally. > > > > Is there a known issue with libfabric on a QEMU/KVM virtual cluster? > It seems like this should work. > > > > -- > > John Wilkes | AMD Research | john.wil...@amd.com > <mailto:john.wil...@amd.com> | office: +1 425.586.6412 (x26412) > > _______________________________________________ ofiwg mailing list ofiwg@lists.openfabrics.org http://lists.openfabrics.org/mailman/listinfo/ofiwg