Hi William, Please find my comments inline.
On Wed, Apr 27, 2016 at 11:49:19AM -0700, William Fox wrote: > Gave it a shot and it progress past the initilization steps, but breaks at > the fist checkpoint (before it would quit after less than 20 seconds. I Sorry, I don't understand this part. I thought the error you were seeing earlier, "Unable to create directory ...", was occurring at checkpoint time. Over here you seem to be implying that the error was at launch time. > currently simply have a python script changing the host names in the > restart script dmtcp uses (dmtcp_restart_script.sh) and that appears to be > enough to get succesful restarts. I changes the node/hosts at about line > 240 of the restart script that starts with worker_ckpts=' > > Thanks for getting back to me. Id love to help you debug this at your > convenience, but will be using the above script until we can sort these > errors out or if the checkpoint/restarts are error prone due to my python > scripts incompleteness. > > The error with the above header change that you suggested results in the > following: Sorry, I don't understand this part either. Are you saying that you can't checkpoint-restart with the change (STAMPEDE_LUSTRE_FIX) I suggested previously? If I'm reading it correctly, in the first paragraph you mention that you can get successful restarts after changing the host names in the DMTCP restart script. > > [40000] ERROR at connectionmessage.h:63 in assertValid; > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > sign = > Message: read invalid message, signature mismatch. (External socket?) > python2.7 (40000): Terminating... This indicates that there's an "external" socket that DMTCP is trying to drain at checkpoint time. (More background below.) It's possible that the socket is inherited from a parent process (some resource/process manager) that forgot to set the close-on-exec flag. Could you verify this by looking at listing of file-descriptors in /proc/<pid>/fd for a simple Python program on your system? If we can identify that a socket exists (which the simple user program never intentionally opened), we can ask DMTCP to "blacklist" it to avoid draining it at checkpoint time. There was similar issue with Python on a different cluster: the Python process ended up inheriting some file descriptors from `slurmstepd` that forgot to set the close-on-exec flag for a file-descriptor it had opened prior to exec-ing the Python interpreter. See the thread: "Restart does not work: /dev/ipmi0 Permission denied" (https://sourceforge.net/p/dmtcp/mailman/message/34899213/) on the DMTCP forum for more details. Background: For checkpointing, DMTCP classifies a socket in to two categories -- external and internal. An internal socket is when the two end points of the socket are running under DMTCP. In this case, DMTCP will, at checkpoint time, first quiesce the two processes, and then capture the in-flight data by "draining" the socket from both ends. On restart, DMTCP will restore the socket and put the captured data back on the network. The difficult case is when only one end, for example, the client in a client-server application, is running under DMTCP. In this case, the socket that the client uses to talk to the server needs to be marked as "external", implying that DMTCP will not try to drain the socket at checkpoint time. (There is a heuristic we use to detect an external socket but that does not always work.) On restart, the socket is presented as a dead socket to the client, and it's the client's responsibility to either ignore this dead socket if it's unimportant or recover by creating a new socket if it's important. > > > On Tue, Apr 26, 2016 at 6:51 PM, Rohan Garg <rohg...@ccs.neu.edu> wrote: > > > Hi William, > > > > Does the directory specified in the error message (`savedFilePath`) > > exist? If yes, then could you please retry after define-ing the > > `STAMPEDE_LUSTRE_FIX` macro in src/util_misc.cpp? Add `#define > > STAMPEDE_LUSTRE_FIX 1` at the top of the file (perhaps after all > > the includes). We saw a related issue on Stampede, I wonder if what > > you are seeing is the same thing. > > > > Thanks, > > Rohan > > > > On Tue, Apr 26, 2016 at 11:34:41AM -0700, William Fox wrote: > > > I am running dmtcp on a hpc and am hoping to run a distributed > > application > > > across several nodes and checkpoint/restart it. > > > > > > When I run with the --rm plugin in the torque environment the application > > > runs into an error (I believe directly after dmtcp_launch --rm): > > > [40000] ERROR at fileconnection.cpp:619 in preCkpt; > > > REASON='JASSERT(Util::createDirectoryTree(savedFilePath)) failed' > > > savedFilePath = /oasis/scratch/<checkpoint_directory> > > > Message: Unable to create directory in File Path > > > python2.7 (40000): Terminating... > > > > > > If I run without --rm then the host names are not adapted on a restart > > and > > > the application fails. > > > > > > Running a helloworld counting program that says the host files every 60 > > > seconds via openmpi runs smoothly with --rm and --infiniband plugins on > > the > > > same system. I have tried to track down similar errors in the forums but > > > failed to find instances of distributed systems withing the hpc > > > environment. The checkpoints occur in the same directory for both the > > > openmpi and distributed application so it is not the folder permissions. > > > > > > Any ideas? > > > > > > > ------------------------------------------------------------------------------ > > > Find and fix application performance issues faster with Applications > > Manager > > > Applications Manager provides deep performance insights into multiple > > tiers of > > > your business applications. It resolves application problems quickly and > > > reduces your MTTR. Get your free trial! > > > https://ad.doubleclick.net/ddm/clk/302982198;130105516;z > > > > > _______________________________________________ > > > Dmtcp-forum mailing list > > > Dmtcp-forum@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > ------------------------------------------------------------------------------ > Find and fix application performance issues faster with Applications Manager > Applications Manager provides deep performance insights into multiple tiers of > your business applications. It resolves application problems quickly and > reduces your MTTR. Get your free trial! > https://ad.doubleclick.net/ddm/clk/302982198;130105516;z > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum