Hi all,

I'm having trouble using DMTCP to restart a Gaussian 09 job on my university's HPC cluster. It's TD-DFT and therefore isn't supported by Gaussian's usual Restart capabilities (I've tried all the usual combinations).

I've made an attempt based on example scripts from the university HPC service. My current issue is the syntax of the excecutable + args + checkpoint time when calling dmtcp_launch.

Specifically:

runcmd = "g09 filename.com filename.out 1000" outputs a file called 1000.inp with exactly the same contents as filename.com, then performs the gaussian calculation with no checkpointing.

runcmd = "g09 < filename.com > filename.out 1000" throws a segmentation violation error after trying to run
"g09 1000 < filename.com > filename.out".

I'm trying a few more variations but if someone could please point me in the right direction, that would be great.

Please see attached for example script and .sh/.err/.out files from latest attempt.

Thanks very much,
Charlie

--
Charlie Readman
They/She

NanoDTC c2015
NanoPhotonics Centre, Dept. of Physics
Melville Lab, Dept. of Chemistry
Churchill College
University of Cambridge
#!/bin/bash
#SBATCH -J dmtcp_serial
#SBATCH -A CHANGE-ME
#SBATCH --output=dmtcp_serial_%A.out
#SBATCH --error=dmtcp_serial_%A.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:01:00
#SBATCH -p skylake

. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load dmtcp/2.6.0-intel-17.0.4
ulimit -s 8192

RESTARTSCRIPT="dmtcp_restart_script.sh"
export DMTCP_QUIET=2

runcmd="./example_serial 5"
tint=30

echo "Start coordinator"
date
eval "dmtcp_coordinator --daemon --coord-logfile dmtcp_log.txt --exit-after-ckpt --exit-on-last -i "$tint" --port-file cport.txt -p 0"
sleep 2
cport=$(<cport.txt)
echo "$cport"
h=`hostname`
echo $h

if [ -f "$RESTARTSCRIPT" ]
then
    echo "Resume the application"
    CMD="dmtcp_restart -p "$cport" -i "$tint" ckpt*.dmtcp"
    echo $CMD
    eval $CMD
else
    echo "Start the application"
    CMD="dmtcp_launch --rm --infiniband --no-gzip -h localhost -p "$cport" "$runcmd
    echo $CMD
    eval $CMD
fi

echo "Stopped program execution"
date
#!/bin/bash
#SBATCH -J IR_775_TD_dmtcp_p0
#SBATCH -A [research_group]-cpu -p [cluster]
#SBATCH -N 1 -n 1 -c 16
#SBATCH -t 02:00:00
#SBATCH -o slurm-IR_775_TD_dmtcp_p0.out 
#SBATCH --error=error-IR_775_TD_dmtcp_p0.err 
#SBATCH --mail-type=ALL

. /etc/profile.d/modules.sh
module purge
module load rhel7/default-peta4
module load dmtcp/2.6.0-intel-17.0.4
module load gaussian/009e01
ulimit -s 8192

RESTARTSCRIPT="dmtcp_restart_script.sh"
export DMTCP_QUIET=0

runcmd="g09 IR_775_TD_dmtcp_p0.com IR_775_TD_dmtcp_p0.log"
tint=1000

echo "Start coordinator"
date
pwd
eval "dmtcp_coordinator --daemon --coord-logfile dmtcp_log.txt --exit-after-ckpt -i "$tint" --port-file cport_0.txt -p 0 --ckpt-open-files"
sleep 2
h=`hostname`
echo $h

if [ -f "$RESTARTSCRIPT" ]
then
    echo "Resume the application"
    cport=$(<cport_0.txt)
    echo "$cport"
    CMD="dmtcp_restart -p "$cport" -i "$tint" ckpt*.dmtcp"
    echo $CMD
    eval $CMD
else
    echo "Start the application"
    module list
    CMD="dmtcp_launch --rm --infiniband --no-gzip -h localhost --port-file cport_0.txt -p 0 "$runcmd" "$tint
    echo $CMD
    eval $CMD
fi

echo "Stopped program execution"
date
Usage: dmtcp_coordinator [OPTIONS] [port]
Coordinates checkpoints between multiple processes.

Options:
  -p, --coord-port PORT_NUM (environment variable DMTCP_COORD_PORT)
      Port to listen on (default: 7779)
  --port-file filename
      File to write listener port number.
      (Useful with '--port 0', which is used to assign a random port)
  --ckptdir (environment variable DMTCP_CHECKPOINT_DIR):
      Directory to store dmtcp_restart_script.sh (default: ./)
  --tmpdir (environment variable DMTCP_TMPDIR):
      Directory to store temporary files (default: env var TMDPIR or /tmp)
  --exit-on-last
      Exit automatically when last client disconnects
  --exit-after-ckpt
      Kill peer processes of computation after first checkpoint is created
  --daemon
      Run silently in the background after detaching from the parent process.
  -i, --interval (environment variable DMTCP_CHECKPOINT_INTERVAL):
      Time in seconds between automatic checkpoints
      (default: 0, disabled)
  --coord-logfile PATH (environment variable DMTCP_COORD_LOG_FILENAME
              Coordinator will dump its logs to the given file
  -q, --quiet 
      Skip startup msg; Skip NOTE msgs; if given twice, also skip WARNINGs
  --help:
      Print this message and exit.
  --version:
      Print version information and exit.

COMMANDS:
      type '?<return>' at runtime for list

Report bugs to: dmtcp-forum@lists.sourceforge.net
DMTCP home page: <http://dmtcp.sourceforge.net>

Currently Loaded Modulefiles:
  1) dot                           10) intel/libs/idb/2017.4
  2) slurm                         11) intel/libs/tbb/2017.4
  3) turbovnc/2.0.1                12) intel/libs/ipp/2017.4
  4) vgl/2.5.1/64                  13) intel/libs/daal/2017.4
  5) singularity/current           14) intel/bundles/complib/2017.4
  6) rhel7/global                  15) cmake/latest
  7) intel/compilers/2017.4        16) rhel7/default-peta4
  8) intel/mkl/2017.4              17) dmtcp/2.6.0-intel-17.0.4
  9) intel/impi/2017.4/intel       18) gaussian/009e01
Error: segmentation violation
   rax 0000000000000000, rbx 0000000000000008, rcx ffffffffffffffff
   rdx 00007f8fb067e640, rsp 00007ffdefb08388, rbp 0000000000009c40
   rsi 000000000000000b, rdi 00000000000147ad, r8  0000000000000000
   r9  00007f8fb253d100, r10 00007ffdefb077a0, r11 0000000000000206
   r12 000000000000000b, r13 00007ffdefb0f9e0, r14 0000000000000000
   r15 0000000000000000
  /lib64/libpthread.so.0(+0xf630) [0x7f8fb0d27630]
  /lib64/libc.so.6(kill+0x7) [0x7f8fb067e647]
  
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp_pid.so(kill+0x22)
 [0x7f8fb0f77b12]
  g09() [0x407627]
  g09() [0x4039f5]
  g09() [0x40358d]
  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f8fb066a545]
  g09() [0x403489]
/var/spool/slurm/slurmd/job24247438/slurm_script: line 44: 83885 Aborted        
         dmtcp_launch --rm --infiniband --no-gzip -h localhost --port-file 
cport_0.txt -p 0 g09 1000 < IR_775_TD_dmtcp_p0.com > IR_775_TD_dmtcp_p0.log
Start coordinator
Fri 22 May 00:02:47 BST 2020
cpu-e-653
Start the application
dmtcp_launch --rm --infiniband --no-gzip -h localhost --port-file cport_0.txt 
-p 0 g09 < IR_775_TD_dmtcp_p0.com > IR_775_TD_dmtcp_p0.log 1000
Stopped program execution
Fri 22 May 00:02:52 BST 2020
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to