Hi all,I'm having trouble using DMTCP to restart a Gaussian 09 job on my university's HPC cluster. It's TD-DFT and therefore isn't supported by Gaussian's usual Restart capabilities (I've tried all the usual combinations).
I've made an attempt based on example scripts from the university HPC service. My current issue is the syntax of the excecutable + args + checkpoint time when calling dmtcp_launch.
Specifically:runcmd = "g09 filename.com filename.out 1000" outputs a file called 1000.inp with exactly the same contents as filename.com, then performs the gaussian calculation with no checkpointing.
runcmd = "g09 < filename.com > filename.out 1000" throws a segmentation violation error after trying to run
"g09 1000 < filename.com > filename.out".I'm trying a few more variations but if someone could please point me in the right direction, that would be great.
Please see attached for example script and .sh/.err/.out files from latest attempt.
Thanks very much, Charlie -- Charlie Readman They/She NanoDTC c2015 NanoPhotonics Centre, Dept. of Physics Melville Lab, Dept. of Chemistry Churchill College University of Cambridge
#!/bin/bash #SBATCH -J dmtcp_serial #SBATCH -A CHANGE-ME #SBATCH --output=dmtcp_serial_%A.out #SBATCH --error=dmtcp_serial_%A.err #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:01:00 #SBATCH -p skylake . /etc/profile.d/modules.sh module purge module load rhel7/default-peta4 module load dmtcp/2.6.0-intel-17.0.4 ulimit -s 8192 RESTARTSCRIPT="dmtcp_restart_script.sh" export DMTCP_QUIET=2 runcmd="./example_serial 5" tint=30 echo "Start coordinator" date eval "dmtcp_coordinator --daemon --coord-logfile dmtcp_log.txt --exit-after-ckpt --exit-on-last -i "$tint" --port-file cport.txt -p 0" sleep 2 cport=$(<cport.txt) echo "$cport" h=`hostname` echo $h if [ -f "$RESTARTSCRIPT" ] then echo "Resume the application" CMD="dmtcp_restart -p "$cport" -i "$tint" ckpt*.dmtcp" echo $CMD eval $CMD else echo "Start the application" CMD="dmtcp_launch --rm --infiniband --no-gzip -h localhost -p "$cport" "$runcmd echo $CMD eval $CMD fi echo "Stopped program execution" date
#!/bin/bash #SBATCH -J IR_775_TD_dmtcp_p0 #SBATCH -A [research_group]-cpu -p [cluster] #SBATCH -N 1 -n 1 -c 16 #SBATCH -t 02:00:00 #SBATCH -o slurm-IR_775_TD_dmtcp_p0.out #SBATCH --error=error-IR_775_TD_dmtcp_p0.err #SBATCH --mail-type=ALL . /etc/profile.d/modules.sh module purge module load rhel7/default-peta4 module load dmtcp/2.6.0-intel-17.0.4 module load gaussian/009e01 ulimit -s 8192 RESTARTSCRIPT="dmtcp_restart_script.sh" export DMTCP_QUIET=0 runcmd="g09 IR_775_TD_dmtcp_p0.com IR_775_TD_dmtcp_p0.log" tint=1000 echo "Start coordinator" date pwd eval "dmtcp_coordinator --daemon --coord-logfile dmtcp_log.txt --exit-after-ckpt -i "$tint" --port-file cport_0.txt -p 0 --ckpt-open-files" sleep 2 h=`hostname` echo $h if [ -f "$RESTARTSCRIPT" ] then echo "Resume the application" cport=$(<cport_0.txt) echo "$cport" CMD="dmtcp_restart -p "$cport" -i "$tint" ckpt*.dmtcp" echo $CMD eval $CMD else echo "Start the application" module list CMD="dmtcp_launch --rm --infiniband --no-gzip -h localhost --port-file cport_0.txt -p 0 "$runcmd" "$tint echo $CMD eval $CMD fi echo "Stopped program execution" date
Usage: dmtcp_coordinator [OPTIONS] [port] Coordinates checkpoints between multiple processes. Options: -p, --coord-port PORT_NUM (environment variable DMTCP_COORD_PORT) Port to listen on (default: 7779) --port-file filename File to write listener port number. (Useful with '--port 0', which is used to assign a random port) --ckptdir (environment variable DMTCP_CHECKPOINT_DIR): Directory to store dmtcp_restart_script.sh (default: ./) --tmpdir (environment variable DMTCP_TMPDIR): Directory to store temporary files (default: env var TMDPIR or /tmp) --exit-on-last Exit automatically when last client disconnects --exit-after-ckpt Kill peer processes of computation after first checkpoint is created --daemon Run silently in the background after detaching from the parent process. -i, --interval (environment variable DMTCP_CHECKPOINT_INTERVAL): Time in seconds between automatic checkpoints (default: 0, disabled) --coord-logfile PATH (environment variable DMTCP_COORD_LOG_FILENAME Coordinator will dump its logs to the given file -q, --quiet Skip startup msg; Skip NOTE msgs; if given twice, also skip WARNINGs --help: Print this message and exit. --version: Print version information and exit. COMMANDS: type '?<return>' at runtime for list Report bugs to: dmtcp-forum@lists.sourceforge.net DMTCP home page: <http://dmtcp.sourceforge.net> Currently Loaded Modulefiles: 1) dot 10) intel/libs/idb/2017.4 2) slurm 11) intel/libs/tbb/2017.4 3) turbovnc/2.0.1 12) intel/libs/ipp/2017.4 4) vgl/2.5.1/64 13) intel/libs/daal/2017.4 5) singularity/current 14) intel/bundles/complib/2017.4 6) rhel7/global 15) cmake/latest 7) intel/compilers/2017.4 16) rhel7/default-peta4 8) intel/mkl/2017.4 17) dmtcp/2.6.0-intel-17.0.4 9) intel/impi/2017.4/intel 18) gaussian/009e01 Error: segmentation violation rax 0000000000000000, rbx 0000000000000008, rcx ffffffffffffffff rdx 00007f8fb067e640, rsp 00007ffdefb08388, rbp 0000000000009c40 rsi 000000000000000b, rdi 00000000000147ad, r8 0000000000000000 r9 00007f8fb253d100, r10 00007ffdefb077a0, r11 0000000000000206 r12 000000000000000b, r13 00007ffdefb0f9e0, r14 0000000000000000 r15 0000000000000000 /lib64/libpthread.so.0(+0xf630) [0x7f8fb0d27630] /lib64/libc.so.6(kill+0x7) [0x7f8fb067e647] /usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp_pid.so(kill+0x22) [0x7f8fb0f77b12] g09() [0x407627] g09() [0x4039f5] g09() [0x40358d] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f8fb066a545] g09() [0x403489] /var/spool/slurm/slurmd/job24247438/slurm_script: line 44: 83885 Aborted dmtcp_launch --rm --infiniband --no-gzip -h localhost --port-file cport_0.txt -p 0 g09 1000 < IR_775_TD_dmtcp_p0.com > IR_775_TD_dmtcp_p0.log
Start coordinator Fri 22 May 00:02:47 BST 2020 cpu-e-653 Start the application dmtcp_launch --rm --infiniband --no-gzip -h localhost --port-file cport_0.txt -p 0 g09 < IR_775_TD_dmtcp_p0.com > IR_775_TD_dmtcp_p0.log 1000 Stopped program execution Fri 22 May 00:02:52 BST 2020
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum