Hi!

I'm trying to get dmtcp working w/ OpenHCP.
Non-MPI programs work fine (even multi-threaded) w/ dmtcp.
MPI programs do not, most likely due to prun being used to execute
instead of srun.

I'm using:

git clone https://github.com/dmtcp/dmtcp.git
cd dmtcp
./configure --enable-infiniband-support

export of PATH, LD_LIBRARY_PATH is set (verified).

Add'l paranoia setting (shouldn't really be needed):
export DMTCP_COORD_HOST=c105
export DMTCP_COORD_PORT=7779
#for Open MPI since it claims to use SIGUSR2:
export DMTCP_SIGCKPT=16

#OpenHPC uses srun for Slurm reservations and prun to execute inside,
#but mpirun instead of prun works as well

srun -w c[105-107] --pty /bin/bash

1) using mpirun last:

MVAPICH2 2.2 w/ gcc 5.4.0 works w/o ckpt:
dmtcp_launch --rm --ib mpirun ./mpi_hello

but w/ ckpt it fails:
dmtcp_launch --rm --ib -i 3 mpirun ./mpi_hello
Hello from task 0 on c105!
MASTER: Number of MPI tasks is: 3
Hello from task 2 on c107!
Hello from task 1 on c106!
[40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval;
REASON='JWARNING(false) failed'
     _dataSockets[i]->socket().sockfd() = 9
     buffer.size() = 129
     WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running
under DMTCP?
[40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval;
REASON='JWARNING(false) failed'
     _dataSockets[i]->socket().sockfd() = 9
     buffer.size() = 129
     WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running
under DMTCP?
etc.

On c105, I see

fmuelle   7536  1.6  0.0  68536  6184 pts/0    Sl+  14:56   0:00
mpirun ./mpi_hello
fmuelle   7540  0.0  0.0  19968  2832 pts/0    S+   14:56
0:00 /home/fmuelle/projects/dmtcp/dmtcp/bin/dmtcp_coordinator --quiet
--exit-on-last --daemon
fmuelle   7543  1.0  0.0  67140  5496 ?        Ssl  14:56   0:00
dmtcp_srun_helper dmtcp_nocheckpoint /usr/bin/srun -N 3 -n 3
dmtcp_launch --coord-host 10.4.1.106 --coord-port 7779 --ckpt-signal 16
--ckptdir /home/fmuelle --infiniband --batch-queue
--explicit-srun /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy
--control-port c105:46791 --rmk slurm --launcher slurm --demux poll
--pgid 0 --retries 10 --usize -2 --proxy-id -1 fmuelle   7548  3.0  0.0
367644  6264 ?        Sl   14:56 0:00 /usr/bin/srun -N 3 -n 3
dmtcp_launch --coord-host 10.4.1.106 --coord-port 7779 --ckpt-signal 16
--ckptdir /home/fmuelle --infiniband --batch-queue
--explicit-srun /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy
--control-port c105:46791 --rmk slurm --launcher slurm --demux poll
--pgid 0 --retries 10 --usize -2 --proxy-id -1 fmuelle   7560  0.0
0.0  29768   632 ?        S    14:56 0:00 /usr/bin/srun -N 3 -n 3
dmtcp_launch --coord-host 10.4.1.106 --coord-port 7779 --ckpt-signal 16
--ckptdir /home/fmuelle --infiniband --batch-queue
--explicit-srun /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy
--control-port c105:46791 --rmk slurm --launcher slurm --demux poll
--pgid 0 --retries 10 --usize -2 --proxy-id -1 fmuelle   7576  4.0
0.0  70572  6884 ?        Sl   14:56
0:00 /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy
--control-port c105:46791 --rmk slurm --launcher slurm --demux poll
--pgid 0 --retries 10 --usize -2 --proxy-id -1 fmuelle   7589 69.0  0.0
123256 16592 ?        Rsl  14:56 0:00 ./mpi_hello


On c106/7, I see
fmuelle  23078  0.1  0.0 136176  8760 ?        Sl   14:56
0:00 /opt/ohpc/pub/mpi/mvapich2-gnu/2.2/bin/hydra_pmi_proxy
--control-port c105:46791 --rmk slurm --launcher slurm --demux poll
--pgid 0 --retries 10 --usize -2 --proxy-id -1
fmuelle  23091  0.3  0.0 123272 23292 ?        Ssl  14:56
0:00 ./mpi_hello

So mpi_hello is not dmtcp-wrapped. Is that the issue?

2) using mpirun first:
mpirun /home/fmuelle/projects/dmtcp/dmtcp/bin/dmtcp_launch --ib
~/mpi_hello

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 14985 RUNNING AT c107
=   EXIT CODE: 99
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@c105] HYD_pmcd_pmip_control_cmd_cb
(pm/pmiserv/pmip_cb.c:909): assert (!closed) failed [proxy:0:0@c105]
HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback
returned error status [proxy:0:0@c105] main (pm/pmiserv/pmip.c:206):
demux engine error waiting for event [proxy:0:1@c106]
HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert
(!closed) failed [proxy:0:1@c106] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@c106] main (pm/pmiserv/pmip.c:206): demux engine error
waiting for event srun: error: c106: task 1: Exited with exit code 7
srun: error: c105: task 0: Exited with exit code 7 [mpiexec@c105]
HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76):
one of the processes terminated badly; aborting [mpiexec@c105]
HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23):
launcher returned error waiting for completion [mpiexec@c105]
HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher
returned error waiting for completion [mpiexec@c105] main
(ui/mpich/mpiexec.c:344): process manager error waiting for completion


Open MPI 1.10.4 w/ gcc 5.4.0 does not work either:
1) w/ IB:

dmtcp_launch --rm --ib mpirun ./mpi_hello
[40000] WARNING at socketconnection.cpp:231 in TcpConnection;
REASON='JWARNING(false) failed' type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node c106 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

2) w/ TCP over Ethernet:
dmtcp_launch --rm mpirun -mca btl self,tcp ./mpi_hello
[40000] WARNING at socketconnection.cpp:231 in TcpConnection;
REASON='JWARNING(false) failed' type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection! [c105:40000] [[17213,0],0]->[[17213,0],1]
mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 15]
[c105:40000] [[17213,0],0]-[[17213,0],1] mca_oob_tcp_peer_send_handler:
unable to send header
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node c107 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Any idea?
-- 
 Frank Mueller
 Department of Computer Science
 North Carolina State University
 3266 EB2
 Raleigh, NC 27695-8206
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to