Hi,

I have a shell script that launch a program with dmtcp, if it is first time 
run, use dmtcp_launch, while if it is not first time run, use dmtcp_restart. 
let it run about 3 minutes then use dmtcp command to checkpoint and then 
terminated the program using dmtcp command quit, and run itself again. The 
purpose of this script is to try a way that a long program run be converted 
into a sequence of short run.  The source code and the script are attached for 
your reference.


The problem I got is this: If the program could be complete by one or two 
restart, it is good to get results. If it need more time, the third time when 
dmtcp_command -c is invoke, the running program is crashed with segmentation 
fault and the dmtcp checkpointing only produces a file with the name as the 
restart ckpt_*.dmtcp with an extension ".temp". Therefore, the script could not 
continue successfully. I am so puzzled that why it happened at third time of 
checkpointing, not second time? the command used is exactly the some. I also 
tried manually with two screens, it is happened in the same way. The error 
massage I got is the following:


[23043] ERROR at dmtcpmessagetypes.cpp:56 in assertValid; 
REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'

     _magicBits =

Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator die 
uncleanly?

dmtcp_command (23043): Terminating...

/var/lib/slurmd/job202408/slurm_script: line 121: 22777 Segmentation fault      
dmtcp_restart -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT ckpt_*.dmtcp > 
num-16.even


We are using the version as

$ dmtcp_command --version

dmtcp_command (DMTCP) 2.5.2

License LGPLv3+: GNU LGPL version 3 or later

    <http://gnu.org/licenses/lgpl.html>.

This program comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it

under certain conditions; see COPYING file for details.


on CentOS7.


Please let me know you need any more information.

Thank you in advance for your help.


Best,

Xiaoge

Attachment: README
Description: README

#!/bin/bash -login

# current working directory shuld have source code dmtcp1.c

# script name. This script is to be resubmit multiple times
export JOBSCRIPT="manual.sh"

# start dmtcp_coordinator
dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file port $@ 1>/dev/null 2>&1   # start coordinater
h=`hostname`                                                                            # get host name
p=`cat port`
export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$p

# print out some information
#echo "coordinator is on host $DMTCP_COORD_HOST "
#echo "port number is $DMTCP_COORD_PORT "
#echo " working directory: ${SLURM_SUBMIT_DIR} "
#echo " job script is $SLURM_JOBSCRIPT "

####################### BODY of the JOB ######################
# prepare work environment of the job

# build the program if not exist
if [ ! -f count.exe ] 
then
    cc count.c -o count.exe
fi

# run the program count.exe. 
# To run interactively: 
# $ ./count.exe n num.odd 1> num.even 
# it will count to number n and generate 2 files: 
# num.odd contains all the odd number;
# num.even contains all the even number.

# To run with DMTCP, use dmtcp commamds.
# if first time launch, use "dmtcp_launch"
# otherwise use "dmtcp_restart"

# set checkpoint interval. This script would wait after dmtcp_launch
# the job for the interval (in seconds), then do start the checkpoint. 
export CKPT_WAIT_SEC=$(( 3 * 60 ))

# Launch or restart the execution
if [ ! -f ckpt_*.dmtcp ]         # no ckpt file exists, use dmtcp_launch
then
  # first time run, use dmtcp_launch the job */
  echo " call dmtcp_launch "
  dmtcp_launch -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --rm --ckpt-open-files ./count.exe 1200 num.odd 1> num.even &

  #wait for an inverval of checkpoint seconds to start checkpointing
  sleep $CKPT_WAIT_SEC

  # start checkpointing
  # echo " start dmtcp checkpointing"
  dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files --bcheckpoint
  # echo " finish dmtcp checkpointing"

  # kill the running job after checkpointing
  # echo " terminate job after checkpoint "
  dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit
  # echo " terminate job after checkpoint "

  # resubmit the job
  echo "resubmit $JOBSCRIPT "
  ./$JOBSCRIPT

else
  # restart job with checkpoint files
  echo " call dmtcp_restart "
  dmtcp_restart -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT ckpt_*.dmtcp 1> num.even &
  # echo " restarted "

  # wait for a checkpoint interval to start checkpointing
  sleep $CKPT_WAIT_SEC
  # clean up the old image
  rm -r ckpt_*.dmtcp ckpt_*_files

  # if program is running, do the checkpoint and resubmit
  if dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -s 1>/dev/null 2>&1
  then	 
    # echo " start checkpointing again "
    # clean up old ckpt files before start new ckpt
    dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files -bc
    # echo " finish checkpointing again "
    # kill the running program
    dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit
    # resubmit this script to slurm
    echo " resumit $JOBSCRIPT "
    ./$JOBSCRIPT
  else
    echo "job finished"
  fi
fi

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char* argv[])
{
if(argc<=1) {
        printf("not enough arguments.\n");
        printf("Usage: ./dmtcp1 n filename \n");
        exit(1);
     } 
FILE *ofp = NULL;

int n = atoi(argv[1]);

if (argc == 2) { 
   ofp = fopen("odd.out", "w");
}
else {
   ofp = fopen(argv[2], "w");
}

 /* fprintf(ofp,"\ncmdline args count=%d", argc); */

 /* First argument is executable name only */
 /* fprintf(ofp, "\nexe name=%s\n", argv[0]); */

 /* Second argument is a output filename */
 /* fprintf(ofp,"\nfilename=%s\n", argv[1]); */
 
 /* Open file as writable */

 if (ofp == NULL) {
   printf("Can't open output file %s!\n", argv[1]);
   exit(1);
 }

  int count = 1;

  while (count<=n) 
  {
          fprintf(ofp," %2d\n ",count++);
          printf(" %2d\n ",count++);
          sleep(1);
  }
  fclose(ofp); 
  return 0;
}

Attachment: longjob.sb
Description: longjob.sb

Attachment: shortjob.sb
Description: shortjob.sb

_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to