Hi,
I have a shell script that launch a program with dmtcp, if it is first time run, use dmtcp_launch, while if it is not first time run, use dmtcp_restart. let it run about 3 minutes then use dmtcp command to checkpoint and then terminated the program using dmtcp command quit, and run itself again. The purpose of this script is to try a way that a long program run be converted into a sequence of short run. The source code and the script are attached for your reference. The problem I got is this: If the program could be complete by one or two restart, it is good to get results. If it need more time, the third time when dmtcp_command -c is invoke, the running program is crashed with segmentation fault and the dmtcp checkpointing only produces a file with the name as the restart ckpt_*.dmtcp with an extension ".temp". Therefore, the script could not continue successfully. I am so puzzled that why it happened at third time of checkpointing, not second time? the command used is exactly the some. I also tried manually with two screens, it is happened in the same way. The error massage I got is the following: [23043] ERROR at dmtcpmessagetypes.cpp:56 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed' _magicBits = Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die uncleanly? dmtcp_command (23043): Terminating... /var/lib/slurmd/job202408/slurm_script: line 121: 22777 Segmentation fault dmtcp_restart -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT ckpt_*.dmtcp > num-16.even We are using the version as $ dmtcp_command --version dmtcp_command (DMTCP) 2.5.2 License LGPLv3+: GNU LGPL version 3 or later <http://gnu.org/licenses/lgpl.html>. This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. on CentOS7. Please let me know you need any more information. Thank you in advance for your help. Best, Xiaoge
README
Description: README
#!/bin/bash -login # current working directory shuld have source code dmtcp1.c # script name. This script is to be resubmit multiple times export JOBSCRIPT="manual.sh" # start dmtcp_coordinator dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file port $@ 1>/dev/null 2>&1 # start coordinater h=`hostname` # get host name p=`cat port` export DMTCP_COORD_HOST=$h export DMTCP_COORD_PORT=$p # print out some information #echo "coordinator is on host $DMTCP_COORD_HOST " #echo "port number is $DMTCP_COORD_PORT " #echo " working directory: ${SLURM_SUBMIT_DIR} " #echo " job script is $SLURM_JOBSCRIPT " ####################### BODY of the JOB ###################### # prepare work environment of the job # build the program if not exist if [ ! -f count.exe ] then cc count.c -o count.exe fi # run the program count.exe. # To run interactively: # $ ./count.exe n num.odd 1> num.even # it will count to number n and generate 2 files: # num.odd contains all the odd number; # num.even contains all the even number. # To run with DMTCP, use dmtcp commamds. # if first time launch, use "dmtcp_launch" # otherwise use "dmtcp_restart" # set checkpoint interval. This script would wait after dmtcp_launch # the job for the interval (in seconds), then do start the checkpoint. export CKPT_WAIT_SEC=$(( 3 * 60 )) # Launch or restart the execution if [ ! -f ckpt_*.dmtcp ] # no ckpt file exists, use dmtcp_launch then # first time run, use dmtcp_launch the job */ echo " call dmtcp_launch " dmtcp_launch -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --rm --ckpt-open-files ./count.exe 1200 num.odd 1> num.even & #wait for an inverval of checkpoint seconds to start checkpointing sleep $CKPT_WAIT_SEC # start checkpointing # echo " start dmtcp checkpointing" dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files --bcheckpoint # echo " finish dmtcp checkpointing" # kill the running job after checkpointing # echo " terminate job after checkpoint " dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit # echo " terminate job after checkpoint " # resubmit the job echo "resubmit $JOBSCRIPT " ./$JOBSCRIPT else # restart job with checkpoint files echo " call dmtcp_restart " dmtcp_restart -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT ckpt_*.dmtcp 1> num.even & # echo " restarted " # wait for a checkpoint interval to start checkpointing sleep $CKPT_WAIT_SEC # clean up the old image rm -r ckpt_*.dmtcp ckpt_*_files # if program is running, do the checkpoint and resubmit if dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -s 1>/dev/null 2>&1 then # echo " start checkpointing again " # clean up old ckpt files before start new ckpt dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files -bc # echo " finish checkpointing again " # kill the running program dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit # resubmit this script to slurm echo " resumit $JOBSCRIPT " ./$JOBSCRIPT else echo "job finished" fi fi
#include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc, char* argv[]) { if(argc<=1) { printf("not enough arguments.\n"); printf("Usage: ./dmtcp1 n filename \n"); exit(1); } FILE *ofp = NULL; int n = atoi(argv[1]); if (argc == 2) { ofp = fopen("odd.out", "w"); } else { ofp = fopen(argv[2], "w"); } /* fprintf(ofp,"\ncmdline args count=%d", argc); */ /* First argument is executable name only */ /* fprintf(ofp, "\nexe name=%s\n", argv[0]); */ /* Second argument is a output filename */ /* fprintf(ofp,"\nfilename=%s\n", argv[1]); */ /* Open file as writable */ if (ofp == NULL) { printf("Can't open output file %s!\n", argv[1]); exit(1); } int count = 1; while (count<=n) { fprintf(ofp," %2d\n ",count++); printf(" %2d\n ",count++); sleep(1); } fclose(ofp); return 0; }
longjob.sb
Description: longjob.sb
shortjob.sb
Description: shortjob.sb
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum