Hello,

Happy Victoria Day from Canada!

I am part of a team working on a few clusters with Computer Canada and I am
trying to write a template for SLURM jobs with DMTCP checkpointing.
Everything seems to be going well except for these two issues that I can't
seem to find in the FAQ section or Compute Canada Documentation.

1. ./dmtcp_restart_script.sh not generated: As the FAQ section mentioned
there should be a dmtcp_restart_script.sh generated by DMTCP which is
better for safety and housekeeping. Unfortunately I can't seem to put in
the correct options for it to generate that file. I have tried putting
--new-coordinator as an option but it still doesn't work. Additionally,
DMTCP only generates a ckpt_...._.dmtcp.temp file instead of a .dmtcp file.

2. Segmentation Fault core dumped: I am not sure if this is related to the
first error - but as my python script times out (on purpose) the program
raises a segmentation fault.

I have attached my shell script with this email and some output from the
SLRUM jobs I have submitted. Please let me know if there are anything that
I am missing.

Best,

David
#!/bin/bash
#SBATCH --account=def-aghuang
#SBATCH --cpus-per-task=1
#SBATCH --time=00:04:00
#SBATCH --mem-per-cpu=2056M
#SBATCH --job-name=lets_try_resub
#SBATCH --output=%x-%j.out

### Script Control------------------------------------------------------------------------------------------------------
# Specifies the maximal amount this job can be resubmitted. Avoid using a huge number.
job_resubmission_limit=10
# Specifies how often DMTCP writes a checkpoint file. The number is in seconds.
check_point_interval=60
# Specifies the path from the folder shell script is in to the python script.
python_file_location="./python_script_template.py"
### End of Script Control-----------------------------------------------------------------------------------------------

### Create a local virtual environment and load all necessary packages
module load python/3.7
module load scipy-stack
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
# pip install --upgrade pip
# pip install --no-index -r requirements.txt

echo "Current working directory: `pwd`"
echo "Starting run at: `date`"

### If this job didn't finish, then restart the job using DMTCP script, otherwise start the job with DMTCP.
if test -e "num_times_script_has_resumed.tmp"; then
    echo "Resuming a previous run"
    ./dmtcp_restart_script.sh -h $(hostname)
else
    echo "Starting a new run"
    echo 0 > num_times_script_has_resumed.tmp
    dmtcp_launch --rm -i ${check_point_interval} python ${python_file_location}
fi

### Resubmit if not all work has been completed yet and we haven't hit the job resubmission limit
script_resumed_times=$(($(< num_times_script_has_resumed.tmp)+1))
if test -e "unique_flag_script_running.tmp"; then
    if (($script_resumed_times <= $job_resubmission_limit)); then
        echo "Resubmitting Job Attempt #$script_resumed_times"
        echo ${script_resumed_times} > num_times_script_has_resumed.tmp
        sbatch ${BASH_SOURCE[0]}
    else
        echo "FAILED: Job Resubmission Limit Reached! Work Incomplete"
    fi
else
    echo "Work Completed after $script_resumed_times Resubmissions"
    rm num_times_script_has_resumed.tmp
fi

echo "Job finished with exit code $? at: `date`"

Attachment: lets_try_resub-14980219.out
Description: Binary data

Attachment: lets_try_resub-14980349.out
Description: Binary data

_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to