Hello, Happy Victoria Day from Canada!
I am part of a team working on a few clusters with Computer Canada and I am trying to write a template for SLURM jobs with DMTCP checkpointing. Everything seems to be going well except for these two issues that I can't seem to find in the FAQ section or Compute Canada Documentation. 1. ./dmtcp_restart_script.sh not generated: As the FAQ section mentioned there should be a dmtcp_restart_script.sh generated by DMTCP which is better for safety and housekeeping. Unfortunately I can't seem to put in the correct options for it to generate that file. I have tried putting --new-coordinator as an option but it still doesn't work. Additionally, DMTCP only generates a ckpt_...._.dmtcp.temp file instead of a .dmtcp file. 2. Segmentation Fault core dumped: I am not sure if this is related to the first error - but as my python script times out (on purpose) the program raises a segmentation fault. I have attached my shell script with this email and some output from the SLRUM jobs I have submitted. Please let me know if there are anything that I am missing. Best, David
#!/bin/bash #SBATCH --account=def-aghuang #SBATCH --cpus-per-task=1 #SBATCH --time=00:04:00 #SBATCH --mem-per-cpu=2056M #SBATCH --job-name=lets_try_resub #SBATCH --output=%x-%j.out ### Script Control------------------------------------------------------------------------------------------------------ # Specifies the maximal amount this job can be resubmitted. Avoid using a huge number. job_resubmission_limit=10 # Specifies how often DMTCP writes a checkpoint file. The number is in seconds. check_point_interval=60 # Specifies the path from the folder shell script is in to the python script. python_file_location="./python_script_template.py" ### End of Script Control----------------------------------------------------------------------------------------------- ### Create a local virtual environment and load all necessary packages module load python/3.7 module load scipy-stack virtualenv --no-download $SLURM_TMPDIR/env source $SLURM_TMPDIR/env/bin/activate # pip install --upgrade pip # pip install --no-index -r requirements.txt echo "Current working directory: `pwd`" echo "Starting run at: `date`" ### If this job didn't finish, then restart the job using DMTCP script, otherwise start the job with DMTCP. if test -e "num_times_script_has_resumed.tmp"; then echo "Resuming a previous run" ./dmtcp_restart_script.sh -h $(hostname) else echo "Starting a new run" echo 0 > num_times_script_has_resumed.tmp dmtcp_launch --rm -i ${check_point_interval} python ${python_file_location} fi ### Resubmit if not all work has been completed yet and we haven't hit the job resubmission limit script_resumed_times=$(($(< num_times_script_has_resumed.tmp)+1)) if test -e "unique_flag_script_running.tmp"; then if (($script_resumed_times <= $job_resubmission_limit)); then echo "Resubmitting Job Attempt #$script_resumed_times" echo ${script_resumed_times} > num_times_script_has_resumed.tmp sbatch ${BASH_SOURCE[0]} else echo "FAILED: Job Resubmission Limit Reached! Work Incomplete" fi else echo "Work Completed after $script_resumed_times Resubmissions" rm num_times_script_has_resumed.tmp fi echo "Job finished with exit code $? at: `date`"
lets_try_resub-14980219.out
Description: Binary data
lets_try_resub-14980349.out
Description: Binary data
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum