Hi All,

I am trying to get dmtcp to work on a Cray. It works OK in serial but under MPI it does not. Can you help me resolve the issues?

Thanks
Burlen

here is the test program(app.py):


#!/usr/bin/env python
from mpi4py import *
import sys
import time

rank = MPI.COMM_WORLD.Get_rank()

i = 0
while(True):
    p = ' '*(rank)*4
    sys.stderr.write('%s%d:%d\n'%(p,rank,i))
    time.sleep(0.5*(rank+1))
    i += 1


here is the output, niether ctrl-c nor ctrl-z kills the process or allows recovery. I had to use the queue system to cancel the job.

nid00011:~/cprs$dmtcp_launch -i 5 srun -n 2 ./app.py
[40000] NOTE at processinfo.cpp:229 in growStack; REASON='bottom-most page of stack (page with highest address) was
  invisible in /proc/self/maps. It is made visible again now.'
[40000] WARNING at dlwrappers.cpp:76 in dlopen; REASON='JWARNING(ret) failed'
     filename =
     flag = 0
Message: dlopen failed!
You may also see a message 'ERROR: ld.so:'
 from libdl.so.
If this happens only under DMTCP, then consider setting the
environment variable 'DMTCP_DL_PLUGIN' to "0" before
'dmtcp_launch'.
If the problem persists, please write to the DMTCP developers.

[40000] WARNING at socketconnection.cpp:488 in recvPeerInformation; REASON='JWARNING(false) failed'
     _fds[0] = 5
Message: DMTCP detected an "external" connect socket.The socket will be restored as a dead socket.
0:0
    1:0
0:1
0:2
    1:1
0:3
    1:2
0:4
0:5
[40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; REASON='JWARNING(false) failed'
     _dataSockets[i]->socket().sockfd() = 13
     buffer.size() = 595
     WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running under DMTCP? [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; REASON='JWARNING(false) failed'
     _dataSockets[i]->socket().sockfd() = 13
     buffer.size() = 595
     WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running under DMTCP?
.
.
.
Message: we don't yet support checkpointing non-accepted connections... restore will likely fail.. closing connection
srun: interrupt (one more within 1 sec to abort)
srun: step:10018256.1 tasks 0-1: running
slurmstepd: error: Message length of 2071343164 exceeds maximum of 1024
srun: forcing job termination
srun: got SIGCONT
srun: error: _server_write write failed: Broken pipe
[40000] WARNING at socketconnection.cpp:488 in recvPeerInformation; REASON='JWARNING(false) failed'
     _fds[0] = 15
Message: DMTCP detected an "external" connect socket.The socket will be restored as a dead socket.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    1:3
0:6
0:7
    1:4
0:8
0:9
    1:5
0:10
0:11
    1:6
0:12
0:13
    1:7
0:14
0:15
    1:8
0:16
0:17
    1:9
0:18
0:19
.
.
.
    1:894
0:1788
0:1789
    1:895
0:1790
0:1791
    1:896
0:1792
0:1793
    1:897
0:1794
0:1795
slurmstepd: error: *** STEP 10018256.1 ON nid00011 CANCELLED AT 2018-02-05T14:52:10 ***
srun: error: nid00011: task 0: Killed
srun: Terminating job step 10018256.0


when I restart (in a second job) it fails as follows


nid00189:~/cprs$./dmtcp_restart_script.sh
[30613] ERROR at coordinatorapi.cpp:441 in startNewCoordinator; REASON='JASSERT(strcmp(host.c_str(), "localhost") == 0 || strcmp(host.c_str(), "127.0.0.1") == 0 || jalib::Filesystem::GetCurrentHostname() == host.c_str()) failed'
     host = nid00011
     jalib::Filesystem::GetCurrentHostname() = nid00189
Message: Won't automatically start coordinator because DMTCP_HOST is set to a remote host.
dmtcp_restart (30613): Terminating...


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to