Hi All,
I am trying to get dmtcp to work on a Cray. It works OK in serial but
under MPI it does not. Can you help me resolve the issues?
Thanks
Burlen
here is the test program(app.py):
#!/usr/bin/env python
from mpi4py import *
import sys
import time
rank = MPI.COMM_WORLD.Get_rank()
i = 0
while(True):
p = ' '*(rank)*4
sys.stderr.write('%s%d:%d\n'%(p,rank,i))
time.sleep(0.5*(rank+1))
i += 1
here is the output, niether ctrl-c nor ctrl-z kills the process or
allows recovery. I had to use the queue system to cancel the job.
nid00011:~/cprs$dmtcp_launch -i 5 srun -n 2 ./app.py
[40000] NOTE at processinfo.cpp:229 in growStack; REASON='bottom-most
page of stack (page with highest address) was
invisible in /proc/self/maps. It is made visible again now.'
[40000] WARNING at dlwrappers.cpp:76 in dlopen; REASON='JWARNING(ret)
failed'
filename =
flag = 0
Message: dlopen failed!
You may also see a message 'ERROR: ld.so:'
from libdl.so.
If this happens only under DMTCP, then consider setting the
environment variable 'DMTCP_DL_PLUGIN' to "0" before
'dmtcp_launch'.
If the problem persists, please write to the DMTCP developers.
[40000] WARNING at socketconnection.cpp:488 in recvPeerInformation;
REASON='JWARNING(false) failed'
_fds[0] = 5
Message: DMTCP detected an "external" connect socket.The socket will be
restored as a dead socket.
0:0
1:0
0:1
0:2
1:1
0:3
1:2
0:4
0:5
[40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval;
REASON='JWARNING(false) failed'
_dataSockets[i]->socket().sockfd() = 13
buffer.size() = 595
WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running
under DMTCP?
[40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval;
REASON='JWARNING(false) failed'
_dataSockets[i]->socket().sockfd() = 13
buffer.size() = 595
WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running
under DMTCP?
.
.
.
Message: we don't yet support checkpointing non-accepted connections...
restore will likely fail.. closing connection
srun: interrupt (one more within 1 sec to abort)
srun: step:10018256.1 tasks 0-1: running
slurmstepd: error: Message length of 2071343164 exceeds maximum of 1024
srun: forcing job termination
srun: got SIGCONT
srun: error: _server_write write failed: Broken pipe
[40000] WARNING at socketconnection.cpp:488 in recvPeerInformation;
REASON='JWARNING(false) failed'
_fds[0] = 15
Message: DMTCP detected an "external" connect socket.The socket will be
restored as a dead socket.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
1:3
0:6
0:7
1:4
0:8
0:9
1:5
0:10
0:11
1:6
0:12
0:13
1:7
0:14
0:15
1:8
0:16
0:17
1:9
0:18
0:19
.
.
.
1:894
0:1788
0:1789
1:895
0:1790
0:1791
1:896
0:1792
0:1793
1:897
0:1794
0:1795
slurmstepd: error: *** STEP 10018256.1 ON nid00011 CANCELLED AT
2018-02-05T14:52:10 ***
srun: error: nid00011: task 0: Killed
srun: Terminating job step 10018256.0
when I restart (in a second job) it fails as follows
nid00189:~/cprs$./dmtcp_restart_script.sh
[30613] ERROR at coordinatorapi.cpp:441 in startNewCoordinator;
REASON='JASSERT(strcmp(host.c_str(), "localhost") == 0 ||
strcmp(host.c_str(), "127.0.0.1") == 0 ||
jalib::Filesystem::GetCurrentHostname() == host.c_str()) failed'
host = nid00011
jalib::Filesystem::GetCurrentHostname() = nid00189
Message: Won't automatically start coordinator because DMTCP_HOST is set
to a remote host.
dmtcp_restart (30613): Terminating...
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum