Dear DMTCP developers,

I'm trying to set up checkpoint/restart of MPI applications using MVAPICH.

I tried several options to launch DMTCP with MVAPICH, but none succeeded.

I use symbols ****** around lengthy dumps of debugging information.

I show my most successful attempt, I can report results of other attempts by 
request.

In the end the restart script seem to complain about shared memory file, which 
it can't open. Could you tell me how can I work around this issue?


First I launch dmtcp_coordinator in separate window, then I start application 
as following:

******
$ dmtcp_launch --rm --ib srun --mpi=pmi2    ./wrapper.sh  ./bin/lu.A.2 
[40000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
[40000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
[40000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under SLURM!'
[40000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= tid_offset: 720
[40000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= pid_offset: 724
[42000] TRACE at rm_slurm.cpp:131 in print_args; REASON='Init CMD:'
     cmdline = /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 ./wrapper.sh 
./bin/lu.A.2 
[42000] TRACE at rm_slurm.cpp:160 in patch_srun_cmdline; REASON='Expand 
dmtcp_launch path'
     dmtcpCkptPath = dmtcp_launch
[42000] TRACE at rm_slurm.cpp:253 in execve; REASON='How command looks from 
exec*:'
[42000] TRACE at rm_slurm.cpp:254 in execve; REASON='CMD:'
     cmdline = dmtcp_srun_helper dmtcp_nocheckpoint 
/opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 dmtcp_launch --coord-host 
127.0.0.1 --coord-port 7779 --ckptdir 
/home/s9951545/dmtcp-app/NPB3.3/NPB3.3-MPI --infiniband --batch-queue 
--explicit-srun ./wrapper.sh ./bin/lu.A.2 
[42000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= tid_offset: 720
[42000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= pid_offset: 724
[42000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
[42000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
[42000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under SLURM!'


 NAS Parallel Benchmarks 3.3 -- LU Benchmark

 Size:   64x  64x  64
 Iterations:  250
 Number of processes:     2

 Time step    1
 Time step   20
 Time step   40
 Time step   60
[42000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start, internal pmi 
capable'
[40000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start, internal pmi 
capable'
[42000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no sockets left'
[40000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no sockets left'
[40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi 
capable'
[42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi 
capable'
 Time step   80
******

I manage to create a checkpoint, but when I try to restart, the restart script 
stop at this point:

******
$ ./dmtcp_restart_script.sh 
<SKIPPED>
     dir = /tmp/dmtcp-s9951545@taurusi4043
[45000] TRACE at jfilesystem.cpp:172 in mkdir_r; REASON='Directory already 
exists'
     dir = /tmp/dmtcp-s9951545@taurusi4043
[45000] WARNING at fileconnlist.cpp:192 in resume; 
REASON='JWARNING(unlink(missingUnlinkedShmFiles[i].name) != -1) failed'
     missingUnlinkedShmFiles[i].name = 
/dev/shm/cm_shmem-1003236.42-taurusi4043-1074916.tmp
     (strerror((*__errno_location ()))) = No such file or directory
Message: The file was unlinked at the time of checkpoint. Unlinking it after 
restart failed
[42000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open 
SLURM environment file. Environment won't be restored!'
     filename = 
/tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-42000-5845bc0a
[44000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open 
SLURM environment file. Environment won't be restored!'
     filename = 
/tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-44000-323cf5bc0749
[40000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open 
SLURM environment file. Environment won't be restored!'
     filename = 
/tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-40000-323cd8b79b6f
[45000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open 
SLURM environment file. Environment won't be restored!'
     filename = 
/tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-45000-5845bc0a
[44000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi 
capable'
[42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi 
capable'
[42000] TRACE at rm_slurm.cpp:522 in slurmRestoreHelper; REASON='This is srun 
helper. Restore it'
[40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi 
capable'
[45000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi 
capable'
lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size == sizeof(ibv_qp_id_t)' 
failed.
******


Before starting, I set up following environment variables for MVAPICH:

export MV2_USE_SHARED_MEM=0 # This one is probably the most relevant
export MV2_USE_BLOCKING=0
export MV2_ENABLE_AFFINITY=0
export MV2_RDMA_NUM_EXTRA_POLLS=1
export MV2_CM_MAX_SPIN_COUNT=1
export MV2_SPIN_COUNT=100
export MV2_DEBUG_SHOW_BACKTRACE=1
export MV2_DEBUG_CORESIZE=unlimited



-- 
Regards,
Maksym Planeta

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to