Good morning all,

I am working on trying to reduce the checkpoint time of MPI
applications.  Our basic scenario is a multi node cluster. Every node
has local storage (fast) and network storage (slow). Network storage
is employed to create shared folders, accesible by all the nodes on
the cluster. Local storage is only visible by the node.

A basic approach is to configure DMTCP to employ this shared storage.
When a checkpoint operation is requested, all nodes write the
information to the specified folder. Then, on the restart step, data
is accesible by everyone (most important, by the coordinator) and it
works great.

However, a second -and more interesting step- would be to employ local
storage for the storage of checkpoints. This way, both checkpoint and
restart operations are faster.

This is not a problem with serial applications: as the coordinator can
access local files, any location can be chosen and the restart works
too. This is not only faster but also reduces network overhead, which
is good.

The problem comes with MPI jobs. Checkpoint operation is OK: if a
given folder (let's say /tmp) exists on all nodes, the checkpoint
files are written there.  If I look at dmtcp_restart.sh file, I can
see (simplified for legibility):

worker_ckpts='
 :: host1 :bg: /tmp/my_files....
 :: host2 :bg: /tmp/my_files...

which is correct. All these files exist and are accesible by the user.

However, when a restart operation is requested, it does not work.
Looking into dmtcp_script.sh, this is the operation where it gets
stuck.

+ dmtcp_restart --join --coord-host host1 --coord-port 7779
/tmp/ckpt_dmtcp_srun_helper_1b69d09fb3238b30-41000-5617e813.dmtcp
/tmp/ckpt_mpiexec.hydra_1b69d09fb3238b30-40000-5617e813.dmtcp

If I go to host 1,

# dmtcp_command -s
Coordinator:
  Host: localhost
  Port: 7779 (default port)
Status...
  NUM_PEERS=2
  RUNNING=no
  CKPT_INTERVAL=0 (checkpoint manually)

# dmtcp_command -k
This causes the dmtcp_restart operation to finish, so I assume it is
kind of working.

So, any idea or suggestion?

Just in case it is useful, please find below the full dmtcp_restart.sh trace.

Cheers,


Manuel


''''''

+ /bin/bash ./dmtcp_restart_script.sh

+ usage_str='USAGE:

  dmtcp_restart_script.sh [OPTIONS]


OPTIONS:

  --coord-host, -h, (environment variable DMTCP_COORD_HOST):

      Hostname where dmtcp_coordinator is running

  --coord-port, -p, (environment variable DMTCP_COORD_PORT):

      Port where dmtcp_coordinator is running

  --hostfile <arg0> :

      Provide a hostfile (One host per line, "#" indicates comments)

  --ckptdir, -d, (environment variable DMTCP_CHECKPOINT_DIR):

      Directory to store checkpoint images

      (default: use the same directory used in previous checkpoint)

  --restartdir, -d, (environment variable DMTCP_RESTART_DIR):

      Directory to read checkpoint images from

  --tmpdir, -t, (environment variable DMTCP_TMPDIR):

      Directory to store temporary files (default: $TMDPIR or /tmp)

  --no-strict-uid-checking:

      Disable uid checking for the checkpoint image.  This allows the

        checkpoint image to be restarted by a different user than the one

        that create it. (environment variable DMTCP_DISABLE_UID_CHECKING)

  --interval, -i, (environment variable DMTCP_CHECKPOINT_INTERVAL):

      Time in seconds between automatic checkpoints

      (Default: Use pre-checkpoint value)

  --help:

      Print this message and exit.'

+ ckpt_timestamp='Fri Oct  9 18:15:39 2015'

+ coord_host=acme12.ciemat.es

+ test -z acme12.ciemat.es

+ coord_port=7779

+ test -z 7779

+ checkpoint_interval=

+ test -z ''

+ checkpoint_interval=0

+ export DMTCP_CHECKPOINT_INTERVAL=0

+ DMTCP_CHECKPOINT_INTERVAL=0

+ '[' 0 -gt 0 ']'

+ dmt_rstr_cmd=/home/localsoft/dmtcp/bin/dmtcp_restart

+ which /home/localsoft/dmtcp/bin/dmtcp_restart

+ which /home/localsoft/dmtcp/bin/dmtcp_restart

+ which /home/localsoft/dmtcp/bin/dmtcp_restart

+ worker_ckpts='

 :: acme12.ciemat.es :bg:
/tmp/ckpt_dmtcp_srun_helper_1b69d09fb3238b30-41000-5617e813.dmtcp
/tmp/ckpt_hydra_pmi_proxy_1b69d09fb3238b30-43000-5617e813.dmtcp
/tmp/ckpt_mpiexec.hydra_1b69d09fb3238b30-40000-5617e813.dmtcp
/tmp/ckpt_helloWorldMPI_1b69d09fb3238b30-44000-5617e813.dmtcp

 :: acme13.ciemat.es :bg:
/tmp/ckpt_hydra_pmi_proxy_54385264162a2589-45000-5617e813.dmtcp
/tmp/ckpt_helloWorldMPI_54385264162a2589-46000-5617e814.dmtcp

'

++ which ibrun

+ ibrun_path=

+ '[' '!' -n '' ']'

++ which dmtcp_discover_rm

+ discover_rm_path=/home/localsoft/dmtcp/bin/dmtcp_discover_rm

+ '[' -n /home/localsoft/dmtcp/bin/dmtcp_discover_rm ']'

++ dmtcp_discover_rm -t

+ eval RES_MANAGER=SLURM 'manager_resources="*acme12:2' '"'

++ RES_MANAGER=SLURM

++ manager_resources='*acme12:2 '

++ which srun

+ srun_path=/home/localsoft/slurm/soft/bin/srun

++ which dmtcp_rm_loclaunch

+ llaunch=/home/localsoft/dmtcp/bin/dmtcp_rm_loclaunch

+ '[' SLURM = SLURM ']'

+ '[' -n /home/localsoft/slurm/soft/bin/srun ']'

++ dmtcp_discover_rm -n '

 :: acme12.ciemat.es :bg:
/tmp/ckpt_dmtcp_srun_helper_1b69d09fb3238b30-41000-5617e813.dmtcp
/tmp/ckpt_hydra_pmi_proxy_1b69d09fb3238b30-43000-5617e813.dmtcp
/tmp/ckpt_mpiexec.hydra_1b69d09fb3238b30-40000-5617e813.dmtcp
/tmp/ckpt_helloWorldMPI_1b69d09fb3238b30-44000-5617e813.dmtcp

 :: acme13.ciemat.es :bg:
/tmp/ckpt_hydra_pmi_proxy_54385264162a2589-45000-5617e813.dmtcp
/tmp/ckpt_helloWorldMPI_54385264162a2589-46000-5617e814.dmtcp

'

+ eval RES_MANAGER=SLURM 'manager_resources="*acme12:2' '"'
'input_config="*acme12.ciemat.es:1' acme13.ciemat.es:1 '"'
'DMTCP_DISCOVER_PM_TYPE='\''HYDRA'\''' 'DMTCP_LAUNCH_CKPTS='\'''
/tmp/ckpt_dmtcp_srun_helper_1b69d09fb3238b30-41000-5617e813.dmtcp
'/tmp/ckpt_mpiexec.hydra_1b69d09fb3238b30-40000-5617e813.dmtcp'\'''
DMTCP_REMLAUNCH_NODES=1 DMTCP_REMLAUNCH_0_SLOTS=1
'DMTCP_REMLAUNCH_0_0='\''/tmp/ckpt_helloWorldMPI_1b69d09fb3238b30-44000-5617e813.dmtcp'
/tmp/ckpt_hydra_pmi_proxy_1b69d09fb3238b30-43000-5617e813.dmtcp
/tmp/ckpt_helloWorldMPI_54385264162a2589-46000-5617e814.dmtcp
/tmp/ckpt_hydra_pmi_proxy_54385264162a2589-45000-5617e813.dmtcp ''\'''

++ RES_MANAGER=SLURM

++ manager_resources='*acme12:2 '

++ input_config='*acme12.ciemat.es:1 acme13.ciemat.es:1 '

++ DMTCP_DISCOVER_PM_TYPE=HYDRA

++ DMTCP_LAUNCH_CKPTS='
/tmp/ckpt_dmtcp_srun_helper_1b69d09fb3238b30-41000-5617e813.dmtcp
/tmp/ckpt_mpiexec.hydra_1b69d09fb3238b30-40000-5617e813.dmtcp'

++ DMTCP_REMLAUNCH_NODES=1

++ DMTCP_REMLAUNCH_0_SLOTS=1

++ 
DMTCP_REMLAUNCH_0_0='/tmp/ckpt_helloWorldMPI_1b69d09fb3238b30-44000-5617e813.dmtcp
/tmp/ckpt_hydra_pmi_proxy_1b69d09fb3238b30-43000-5617e813.dmtcp
/tmp/ckpt_helloWorldMPI_54385264162a2589-46000-5617e814.dmtcp
/tmp/ckpt_hydra_pmi_proxy_54385264162a2589-45000-5617e813.dmtcp '

+ '[' -n '' ']'

+ export DMTCP_REMLAUNCH_NODES=1

+ DMTCP_REMLAUNCH_NODES=1

+ bound=0

++ seq 0 0

+ for i in '$(seq 0 $bound)'

+ eval 'val=${DMTCP_REMLAUNCH_0_SLOTS}'

++ val=1

+ export DMTCP_REMLAUNCH_0_SLOTS=1

+ DMTCP_REMLAUNCH_0_SLOTS=1

+ bound2=0

++ seq 0 0

+ for j in '$(seq 0 $bound2)'

+ eval 'ckpts=${DMTCP_REMLAUNCH_0_0}'

++ ckpts='/tmp/ckpt_helloWorldMPI_1b69d09fb3238b30-44000-5617e813.dmtcp
/tmp/ckpt_hydra_pmi_proxy_1b69d09fb3238b30-43000-5617e813.dmtcp
/tmp/ckpt_helloWorldMPI_54385264162a2589-46000-5617e814.dmtcp
/tmp/ckpt_hydra_pmi_proxy_54385264162a2589-45000-5617e813.dmtcp '

+ export 
'DMTCP_REMLAUNCH_0_0=/tmp/ckpt_helloWorldMPI_1b69d09fb3238b30-44000-5617e813.dmtcp
/tmp/ckpt_hydra_pmi_proxy_1b69d09fb3238b30-43000-5617e813.dmtcp
/tmp/ckpt_helloWorldMPI_54385264162a2589-46000-5617e814.dmtcp
/tmp/ckpt_hydra_pmi_proxy_54385264162a2589-45000-5617e813.dmtcp '

+ 
DMTCP_REMLAUNCH_0_0='/tmp/ckpt_helloWorldMPI_1b69d09fb3238b30-44000-5617e813.dmtcp
/tmp/ckpt_hydra_pmi_proxy_1b69d09fb3238b30-43000-5617e813.dmtcp
/tmp/ckpt_helloWorldMPI_54385264162a2589-46000-5617e814.dmtcp
/tmp/ckpt_hydra_pmi_proxy_54385264162a2589-45000-5617e813.dmtcp '

+ '[' HYDRA = HYDRA ']'

++ mktemp ./tmp.XXXXXXXXXX

+ export DMTCP_SRUN_HELPER_SYNCFILE=./tmp.wCpQ2NYVjA

+ DMTCP_SRUN_HELPER_SYNCFILE=./tmp.wCpQ2NYVjA

+ rm ./tmp.wCpQ2NYVjA

+ dmtcp_srun_helper -r /home/localsoft/slurm/soft/bin/srun
/home/localsoft/dmtcp/bin/dmtcp_rm_loclaunch

+ '[' '!' -f ./tmp.wCpQ2NYVjA ']'

+ . ./tmp.wCpQ2NYVjA

++ export DMTCP_SRUN_HELPER_ADDR=/tmp/srun_helper_usock.ezhOdw

++ DMTCP_SRUN_HELPER_ADDR=/tmp/srun_helper_usock.ezhOdw

+ pass_slurm_helper_contact '
/tmp/ckpt_dmtcp_srun_helper_1b69d09fb3238b30-41000-5617e813.dmtcp
/tmp/ckpt_mpiexec.hydra_1b69d09fb3238b30-40000-5617e813.dmtcp'

+ LOCAL_FILES='
/tmp/ckpt_dmtcp_srun_helper_1b69d09fb3238b30-41000-5617e813.dmtcp
/tmp/ckpt_mpiexec.hydra_1b69d09fb3238b30-40000-5617e813.dmtcp'

+ '[' -n '' ']'

+ '[' -n '' ']'

++ whoami

++ hostname

+ CURRENT_TMPDIR=/tmp/dmtcp-sl...@acme12.ciemat.es

+ '[' '!' -d /tmp/dmtcp-sl...@acme12.ciemat.es ']'

+ for CKPT_FILE in '$LOCAL_FILES'

+ SUFFIX=/tmp/ckpt_dmtcp_srun_helper_1b69d09fb3238b30-41000-5617e813

+ 
SLURM_ENV_FILE=/tmp/dmtcp-sl...@acme12.ciemat.es/slurm_env_1b69d09fb3238b30-41000-5617e813

+ echo DMTCP_SRUN_HELPER_ADDR=/tmp/srun_helper_usock.ezhOdw

+ for CKPT_FILE in '$LOCAL_FILES'

+ SUFFIX=/tmp/ckpt_mpiexec.hydra_1b69d09fb3238b30-40000-5617e813

+ 
SLURM_ENV_FILE=/tmp/dmtcp-sl...@acme12.ciemat.es/slurm_env_1b69d09fb3238b30-40000-5617e813

+ echo DMTCP_SRUN_HELPER_ADDR=/tmp/srun_helper_usock.ezhOdw

+ rm ./tmp.wCpQ2NYVjA

+ dmtcp_restart --join --coord-host acme12.ciemat.es --coord-port 7779
/tmp/ckpt_dmtcp_srun_helper_1b69d09fb3238b30-41000-5617e813.dmtcp
/tmp/ckpt_mpiexec.hydra_1b69d09fb3238b30-40000-5617e813.dmtcp

+ exit 0


'''''''





-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN

------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to