By default, the restart script is written by the coordinator (because it is the only process in the system that has an overall view of the distributed computation) in the current checkpoint directory. The error message that you are seeing indicates that the coordinator was unable to write the restart script.
It seems like some process (perhaps through a DMTCP API call?) updated the checkpoint directory to point to the directory: "first.0". Could you verify that the directory, first.0, exists relative to the directory from where you launched the coordinator? My guess is that it doesn't exist (as pointed out by the error message). And so, you need to specify a checkpoint directory that exists on the node where the coordinator is running. However, if you want your restart script to be placed in a particular directory, regardless of the location of the checkpoint directory, you can use the `dmtcp_set_coord_ckpt_dir()` API. On Tue, May 03, 2016 at 09:01:07PM +0000, Ashutosh Varma wrote: > Hi DMTCP Forum, > > What does this indicate? I don't know why file open failed in the checkpoint > directory. > > %: dmtcp_coordinator > dmtcp_coordinator starting... > Host: vgzeburtdc5 (10.15.171.85) > Port: 7779 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 0 > Type '?' for help. > > [24357] NOTE at dmtcp_coordinator.cpp:1199 in updateCheckpointInterval; > REASON='CheckpointInterval updated (for this computation only)' > oldInterval = 0 > theCheckpointInterval = 0 > [24357] NOTE at dmtcp_coordinator.cpp:818 in onConnect; REASON='worker > connected' > hello_remote.from = 74e5ff102309e6ad-24840-572910b1 > [24357] NOTE at dmtcp_coordinator.cpp:606 in onData; REASON='Updating process > Information after exec()' > progname = sim-elab > msg.from = 74e5ff102309e6ad-40000-572910b2 > client->identity() = 74e5ff102309e6ad-24840-572910b1 > [24357] NOTE at dmtcp_coordinator.cpp:564 in onData; REASON='Updated ckptDir' > ckptDir = first.0 > [24357] NOTE at dmtcp_coordinator.cpp:1030 in startCheckpoint; > REASON='starting checkpoint, suspending all nodes' > s.numPeers = 1 > [24357] NOTE at dmtcp_coordinator.cpp:1032 in startCheckpoint; > REASON='Incremented computationGeneration' > compId.computationGeneration() = 1 > [24357] NOTE at dmtcp_coordinator.cpp:392 in updateMinimumState; > REASON='locking all nodes' > [24357] NOTE at dmtcp_coordinator.cpp:398 in updateMinimumState; > REASON='draining all nodes' > [24357] NOTE at dmtcp_coordinator.cpp:404 in updateMinimumState; > REASON='checkpointing all nodes' > [24357] ERROR at restartscript.cpp:357 in writeScript; REASON='JASSERT(fp!=0) > failed' > (strerror((*__errno_location ()))) = No such file or directory > uniqueFilename = > first.0/dmtcp_restart_script_74e5ff102309e6ad-40000-572910b1.sh > Message: failed to open file > dmtcp_coordinator (24357): Terminating... > > ------------------------------------------------------------------------------ > Find and fix application performance issues faster with Applications Manager > Applications Manager provides deep performance insights into multiple tiers of > your business applications. It resolves application problems quickly and > reduces your MTTR. Get your free trial! > https://ad.doubleclick.net/ddm/clk/302982198;130105516;z > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum