Hi! I'm trying to get dmtcp working w/ OpenHCP. Non-MPI programs work fine (even multi-threaded) w/ dmtcp. MPI programs do not, most likely due to prun being used to execute instead of srun.
I'm using: git clone https://github.com/dmtcp/dmtcp.git cd dmtcp ./configure --enable-infiniband-support export of PATH, LD_LIBRARY_PATH is set (verified). Add'l paranoia setting (shouldn't really be needed): export DMTCP_COORD_HOST=c105 export DMTCP_COORD_PORT=7779 #for Open MPI since it claims to use SIGUSR2: export DMTCP_SIGCKPT=16 #OpenHPC uses srun for Slurm reservations and prun to execute inside, #but mpirun instead of prun works as well srun -w c[105-107] --pty /bin/bash 1) using mpirun last: MVAPICH2 2.2 w/ gcc 5.4.0 works w/o ckpt: dmtcp_launch --rm --ib mpirun ./mpi_hello but w/ ckpt it fails: dmtcp_launch --rm --ib -i 3 mpirun ./mpi_hello Hello from task 0 on c105! MASTER: Number of MPI tasks is: 3 Hello from task 2 on c107! Hello from task 1 on c106! [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; REASON='JWARNING(false) failed' _dataSockets[i]->socket().sockfd() = 9 buffer.size() = 129 WARN_INTERVAL_SEC = 10 Message: Still draining socket... perhaps remote host is not running under DMTCP? [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; REASON='JWARNING(false) failed' _dataSockets[i]->socket().sockfd() = 9 buffer.size() = 129 WARN_INTERVAL_SEC = 10 Message: Still draining socket... perhaps remote host is not running under DMTCP? etc. On c105, I see fmuelle 7536 1.6 0.0 68536 6184 pts/0 Sl+ 14:56 0:00 mpirun ./mpi_hello fmuelle 7540 0.0 0.0 19968 2832 pts/0 S+ 14:56 0:00 /home/fmuelle/projects/dmtcp/dmtcp/bin/dmtcp_coordinator --quiet --exit-on-last --daemon fmuelle 7543 1.0 0.0 67140 5496 ? Ssl 14:56 0:00 dmtcp_srun_helper dmtcp_nocheckpoint /usr/bin/srun -N 3 -n 3 dmtcp_launch --coord-host 10.4.1.106 --coord-port 7779 --ckpt-signal 16 --ckptdir /home/fmuelle --infiniband --batch-queue --explicit-srun /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy --control-port c105:46791 --rmk slurm --launcher slurm --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id -1 fmuelle 7548 3.0 0.0 367644 6264 ? Sl 14:56 0:00 /usr/bin/srun -N 3 -n 3 dmtcp_launch --coord-host 10.4.1.106 --coord-port 7779 --ckpt-signal 16 --ckptdir /home/fmuelle --infiniband --batch-queue --explicit-srun /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy --control-port c105:46791 --rmk slurm --launcher slurm --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id -1 fmuelle 7560 0.0 0.0 29768 632 ? S 14:56 0:00 /usr/bin/srun -N 3 -n 3 dmtcp_launch --coord-host 10.4.1.106 --coord-port 7779 --ckpt-signal 16 --ckptdir /home/fmuelle --infiniband --batch-queue --explicit-srun /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy --control-port c105:46791 --rmk slurm --launcher slurm --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id -1 fmuelle 7576 4.0 0.0 70572 6884 ? Sl 14:56 0:00 /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy --control-port c105:46791 --rmk slurm --launcher slurm --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id -1 fmuelle 7589 69.0 0.0 123256 16592 ? Rsl 14:56 0:00 ./mpi_hello On c106/7, I see fmuelle 23078 0.1 0.0 136176 8760 ? Sl 14:56 0:00 /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy --control-port c105:46791 --rmk slurm --launcher slurm --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id -1 fmuelle 23091 0.3 0.0 123272 23292 ? Ssl 14:56 0:00 ./mpi_hello So mpi_hello is not dmtcp-wrapped. Is that the issue? 2) using mpirun first: mpirun /home/fmuelle/projects/dmtcp/dmtcp/bin/dmtcp_launch --ib ~/mpi_hello =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 14985 RUNNING AT c107 = EXIT CODE: 99 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:0@c105] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed [proxy:0:0@c105] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:0@c105] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event [proxy:0:1@c106] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed [proxy:0:1@c106] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:1@c106] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event srun: error: c106: task 1: Exited with exit code 7 srun: error: c105: task 0: Exited with exit code 7 [mpiexec@c105] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@c105] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@c105] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion [mpiexec@c105] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion Open MPI 1.10.4 w/ gcc 5.4.0 does not work either: 1) w/ IB: dmtcp_launch --rm --ib mpirun ./mpi_hello [40000] WARNING at socketconnection.cpp:231 in TcpConnection; REASON='JWARNING(false) failed' type = 2 Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node c106 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- 2) w/ TCP over Ethernet: dmtcp_launch --rm mpirun -mca btl self,tcp ./mpi_hello [40000] WARNING at socketconnection.cpp:231 in TcpConnection; REASON='JWARNING(false) failed' type = 2 Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [c105:40000] [[17213,0],0]->[[17213,0],1] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 15] [c105:40000] [[17213,0],0]-[[17213,0],1] mca_oob_tcp_peer_send_handler: unable to send header -------------------------------------------------------------------------- mpirun noticed that process rank 2 with PID 0 on node c107 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- Any idea? -- Frank Mueller Department of Computer Science North Carolina State University 3266 EB2 Raleigh, NC 27695-8206 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum