Hello,
I am working on getting a simple hello world program running on through
DMTCP with MPI on the sdsc xsede gordon compute cluster.
DMTCP version: 3.0.0
Open MPI version: 1.6.5
Run script:
export DMTCP_COORD_HOST=$HOSTNAME
export DMTCP_COORD_PORT=7779
dmtcp_coordinator --daemon --exit-on-last
dmtcp_launch --rm --ib mpirun $SCRATCH_PATH/helloworld.py
I can run a coordinator in interactive mode on a node through
*dmtcp_coordinator
*with not trouble and launch with the above script. This checkpoints and
restarts fine, but if I run with *--daemon* and *--exit-on-last* I receive
the following errors:
dmtcp_coordinator starting...
Host: gcn-4-25.sdsc.edu (198.202.100.150)
Port: 7779
Checkpoint Interval: 45
Exit on last client: 1
Backgrounding...
[40000] NOTE at socketconnlist.cpp:175 in scanForPreExisting; REASON='found
pre-existing socket... will not be restored'
fd = 12
device = pipe:[8516211]
[40000] WARNING at socketconnection.cpp:193 in TcpConnection;
REASON='JWARNING((domain == AF_INET || domain == AF_UNIX || domain ==
AF_INET6) && (type & 077) == SOCK_STREAM) failed'
domain = 0
type = 0
protocol = 0
[40000] NOTE at socketconnlist.cpp:175 in scanForPreExisting; REASON='found
pre-existing socket... will not be restored'
fd = 17
device = pipe:[8516213]
[40000] WARNING at socketconnection.cpp:193 in TcpConnection;
REASON='JWARNING((domain == AF_INET || domain == AF_UNIX || domain ==
AF_INET6) && (type & 077) == SOCK_STREAM) failed'
domain = 0
type = 0
protocol = 0
[40000] WARNING at socketconnection.cpp:188 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived
connection!
*****
along with the following as output (they repeat for as many cores as I am
running on.
*****
python2.7 (65000): Terminating...
[67000] ERROR at connectionidentifier.h:96 in assertValid;
REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
sign =
Message: read invalid message, signature mismatch. (External socket?)
python2.7 (67000): Terminating...
[61000] ERROR at connectionidentifier.h:96 in assertValid;
REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
sign =
Message: read invalid message, signature mismatch. (External socket?)
python2.7 (61000): Terminating...
[46000] ERROR at connectionidentifier.h:96 in assertValid;
REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
sign =
Message: read invalid message, signature mismatch. (External socket?)
python2.7 (46000): Terminating...
****
Not sure if related, but make check give the following error:
*bash: line 0: ulimit: virtual memory: cannot modify limit: Operation not
permitted*
Let me know if more info is needed. Sadly, I cannot provide a VM and do
not have root access, or guest account privileges to provide.
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum