Hi William, Thanks for report this. I don't have access to the Gordon Compute Cluster (through XSEDE), but I can try to reproduce this bug locally if you can share your Python program (or a simplified version of it). Meanwhile, I can suggest a few things for you to try to isolate this:
- Can you try to reproduce this with DMTCP-2.5? - Can you try to run the coordinator on the login node with `--daemon` and `--exit-on-last`? When you submit your job, you can force the client(s) to connect to the coordinator that's running on the login node by specifying something along the lines of: dmtcp_launch -h login-node -p 7779 helloworld.py as the command to execute. (Assuming the coordinator is listening on the default port, 7779). Let me know how it goes. Best, Rohan On Wed, Apr 13, 2016 at 03:13:14PM -0700, William Fox wrote: > Hello, > > I am working on getting a simple hello world program running on through > DMTCP with MPI on the sdsc xsede gordon compute cluster. > > DMTCP version: 3.0.0 > Open MPI version: 1.6.5 > > Run script: > > export DMTCP_COORD_HOST=$HOSTNAME > export DMTCP_COORD_PORT=7779 > > dmtcp_coordinator --daemon --exit-on-last > > dmtcp_launch --rm --ib mpirun $SCRATCH_PATH/helloworld.py > > I can run a coordinator in interactive mode on a node through > *dmtcp_coordinator > *with not trouble and launch with the above script. This checkpoints and > restarts fine, but if I run with *--daemon* and *--exit-on-last* I receive > the following errors: > > dmtcp_coordinator starting... > Host: gcn-4-25.sdsc.edu (198.202.100.150) > Port: 7779 > Checkpoint Interval: 45 > Exit on last client: 1 > Backgrounding... > [40000] NOTE at socketconnlist.cpp:175 in scanForPreExisting; REASON='found > pre-existing socket... will not be restored' > fd = 12 > device = pipe:[8516211] > [40000] WARNING at socketconnection.cpp:193 in TcpConnection; > REASON='JWARNING((domain == AF_INET || domain == AF_UNIX || domain == > AF_INET6) && (type & 077) == SOCK_STREAM) failed' > domain = 0 > type = 0 > protocol = 0 > [40000] NOTE at socketconnlist.cpp:175 in scanForPreExisting; REASON='found > pre-existing socket... will not be restored' > fd = 17 > device = pipe:[8516213] > [40000] WARNING at socketconnection.cpp:193 in TcpConnection; > REASON='JWARNING((domain == AF_INET || domain == AF_UNIX || domain == > AF_INET6) && (type & 077) == SOCK_STREAM) failed' > domain = 0 > type = 0 > protocol = 0 > [40000] WARNING at socketconnection.cpp:188 in TcpConnection; > REASON='JWARNING(false) failed' > type = 2 > Message: Datagram Sockets not supported. Hopefully, this is a short lived > connection! > > > ***** > along with the following as output (they repeat for as many cores as I am > running on. > ***** > python2.7 (65000): Terminating... > [67000] ERROR at connectionidentifier.h:96 in assertValid; > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > sign = > Message: read invalid message, signature mismatch. (External socket?) > python2.7 (67000): Terminating... > [61000] ERROR at connectionidentifier.h:96 in assertValid; > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > sign = > Message: read invalid message, signature mismatch. (External socket?) > python2.7 (61000): Terminating... > [46000] ERROR at connectionidentifier.h:96 in assertValid; > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > sign = > Message: read invalid message, signature mismatch. (External socket?) > python2.7 (46000): Terminating... > > **** > Not sure if related, but make check give the following error: > > > > *bash: line 0: ulimit: virtual memory: cannot modify limit: Operation not > permitted* > Let me know if more info is needed. Sadly, I cannot provide a VM and do > not have root access, or guest account privileges to provide. > ------------------------------------------------------------------------------ > Find and fix application performance issues faster with Applications Manager > Applications Manager provides deep performance insights into multiple tiers of > your business applications. It resolves application problems quickly and > reduces your MTTR. Get your free trial! > https://ad.doubleclick.net/ddm/clk/302982198;130105516;z > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum