Dear DMTCP developers, I'm trying to set up checkpoint/restart of MPI applications using MVAPICH.
I tried several options to launch DMTCP with MVAPICH, but none succeeded. I use symbols ****** around lengthy dumps of debugging information. I show my most successful attempt, I can report results of other attempts by request. In the end the restart script seem to complain about shared memory file, which it can't open. Could you tell me how can I work around this issue? First I launch dmtcp_coordinator in separate window, then I start application as following: ****** $ dmtcp_launch --rm --ib srun --mpi=pmi2 ./wrapper.sh ./bin/lu.A.2 [40000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start' [40000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start' [40000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under SLURM!' [40000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= tid_offset: 720 [40000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= pid_offset: 724 [42000] TRACE at rm_slurm.cpp:131 in print_args; REASON='Init CMD:' cmdline = /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 ./wrapper.sh ./bin/lu.A.2 [42000] TRACE at rm_slurm.cpp:160 in patch_srun_cmdline; REASON='Expand dmtcp_launch path' dmtcpCkptPath = dmtcp_launch [42000] TRACE at rm_slurm.cpp:253 in execve; REASON='How command looks from exec*:' [42000] TRACE at rm_slurm.cpp:254 in execve; REASON='CMD:' cmdline = dmtcp_srun_helper dmtcp_nocheckpoint /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 dmtcp_launch --coord-host 127.0.0.1 --coord-port 7779 --ckptdir /home/s9951545/dmtcp-app/NPB3.3/NPB3.3-MPI --infiniband --batch-queue --explicit-srun ./wrapper.sh ./bin/lu.A.2 [42000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= tid_offset: 720 [42000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= pid_offset: 724 [42000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start' [42000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start' [42000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under SLURM!' NAS Parallel Benchmarks 3.3 -- LU Benchmark Size: 64x 64x 64 Iterations: 250 Number of processes: 2 Time step 1 Time step 20 Time step 40 Time step 60 [42000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start, internal pmi capable' [40000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start, internal pmi capable' [42000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no sockets left' [40000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no sockets left' [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi capable' [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi capable' Time step 80 ****** I manage to create a checkpoint, but when I try to restart, the restart script stop at this point: ****** $ ./dmtcp_restart_script.sh <SKIPPED> dir = /tmp/dmtcp-s9951545@taurusi4043 [45000] TRACE at jfilesystem.cpp:172 in mkdir_r; REASON='Directory already exists' dir = /tmp/dmtcp-s9951545@taurusi4043 [45000] WARNING at fileconnlist.cpp:192 in resume; REASON='JWARNING(unlink(missingUnlinkedShmFiles[i].name) != -1) failed' missingUnlinkedShmFiles[i].name = /dev/shm/cm_shmem-1003236.42-taurusi4043-1074916.tmp (strerror((*__errno_location ()))) = No such file or directory Message: The file was unlinked at the time of checkpoint. Unlinking it after restart failed [42000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open SLURM environment file. Environment won't be restored!' filename = /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-42000-5845bc0a [44000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open SLURM environment file. Environment won't be restored!' filename = /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-44000-323cf5bc0749 [40000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open SLURM environment file. Environment won't be restored!' filename = /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-40000-323cd8b79b6f [45000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open SLURM environment file. Environment won't be restored!' filename = /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-45000-5845bc0a [44000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi capable' [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi capable' [42000] TRACE at rm_slurm.cpp:522 in slurmRestoreHelper; REASON='This is srun helper. Restore it' [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi capable' [45000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal pmi capable' lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size == sizeof(ibv_qp_id_t)' failed. ****** Before starting, I set up following environment variables for MVAPICH: export MV2_USE_SHARED_MEM=0 # This one is probably the most relevant export MV2_USE_BLOCKING=0 export MV2_ENABLE_AFFINITY=0 export MV2_RDMA_NUM_EXTRA_POLLS=1 export MV2_CM_MAX_SPIN_COUNT=1 export MV2_SPIN_COUNT=100 export MV2_DEBUG_SHOW_BACKTRACE=1 export MV2_DEBUG_CORESIZE=unlimited -- Regards, Maksym Planeta
smime.p7s
Description: S/MIME Cryptographic Signature
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum