Hi Maksym,
Thanks for writing to us. Can you provide the following info:
DMTCP version, Slurm version, Mvapich2 version, and is Mvapich2 configured
with srun as the process launcher?
Also, how did you run the jobs? Did you do it by submitting scripts or by
running interactive jobs?
Best,
Jiajun
On Mon, Dec 5, 2016 at 2:21 PM, Maksym Planeta <
mplan...@os.inf.tu-dresden.de> wrote:
> Dear DMTCP developers,
>
> I'm trying to set up checkpoint/restart of MPI applications using MVAPICH.
>
> I tried several options to launch DMTCP with MVAPICH, but none succeeded.
>
> I use symbols ****** around lengthy dumps of debugging information.
>
> I show my most successful attempt, I can report results of other attempts
> by request.
>
> In the end the restart script seem to complain about shared memory file,
> which it can't open. Could you tell me how can I work around this issue?
>
>
> First I launch dmtcp_coordinator in separate window, then I start
> application as following:
>
> ******
> $ dmtcp_launch --rm --ib srun --mpi=pmi2 ./wrapper.sh ./bin/lu.A.2
> [40000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
> [40000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
> [40000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under
> SLURM!'
> [40000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= tid_offset: 720
> [40000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= pid_offset: 724
> [42000] TRACE at rm_slurm.cpp:131 in print_args; REASON='Init CMD:'
> cmdline = /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2
> ./wrapper.sh ./bin/lu.A.2
> [42000] TRACE at rm_slurm.cpp:160 in patch_srun_cmdline; REASON='Expand
> dmtcp_launch path'
> dmtcpCkptPath = dmtcp_launch
> [42000] TRACE at rm_slurm.cpp:253 in execve; REASON='How command looks
> from exec*:'
> [42000] TRACE at rm_slurm.cpp:254 in execve; REASON='CMD:'
> cmdline = dmtcp_srun_helper dmtcp_nocheckpoint
> /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 dmtcp_launch
> --coord-host 127.0.0.1 --coord-port 7779 --ckptdir
> /home/s9951545/dmtcp-app/NPB3.3/NPB3.3-MPI --infiniband --batch-queue
> --explicit-srun ./wrapper.sh ./bin/lu.A.2
> [42000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= tid_offset: 720
> [42000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= pid_offset: 724
> [42000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
> [42000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
> [42000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under
> SLURM!'
>
>
> NAS Parallel Benchmarks 3.3 -- LU Benchmark
>
> Size: 64x 64x 64
> Iterations: 250
> Number of processes: 2
>
> Time step 1
> Time step 20
> Time step 40
> Time step 60
> [42000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start,
> internal pmi capable'
> [40000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start,
> internal pmi capable'
> [42000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no sockets
> left'
> [40000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no sockets
> left'
> [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal
> pmi capable'
> [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal
> pmi capable'
> Time step 80
> ******
>
> I manage to create a checkpoint, but when I try to restart, the restart
> script stop at this point:
>
> ******
> $ ./dmtcp_restart_script.sh
> <SKIPPED>
> dir = /tmp/dmtcp-s9951545@taurusi4043
> [45000] TRACE at jfilesystem.cpp:172 in mkdir_r; REASON='Directory already
> exists'
> dir = /tmp/dmtcp-s9951545@taurusi4043
> [45000] WARNING at fileconnlist.cpp:192 in resume; REASON='JWARNING(unlink(
> missingUnlinkedShmFiles[i].name) != -1) failed'
> missingUnlinkedShmFiles[i].name = /dev/shm/cm_shmem-1003236.42-
> taurusi4043-1074916.tmp
> (strerror((*__errno_location ()))) = No such file or directory
> Message: The file was unlinked at the time of checkpoint. Unlinking it
> after restart failed
> [42000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open
> SLURM environment file. Environment won't be restored!'
> filename = /tmp/dmtcp-s9951545@taurusi4043/slurm_env_
> 4b324242916bf6c4-42000-5845bc0a
> [44000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open
> SLURM environment file. Environment won't be restored!'
> filename = /tmp/dmtcp-s9951545@taurusi4043/slurm_env_
> 4b324242916bf6c4-44000-323cf5bc0749
> [40000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open
> SLURM environment file. Environment won't be restored!'
> filename = /tmp/dmtcp-s9951545@taurusi4043/slurm_env_
> 4b324242916bf6c4-40000-323cd8b79b6f
> [45000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; REASON='Cannot open
> SLURM environment file. Environment won't be restored!'
> filename = /tmp/dmtcp-s9951545@taurusi4043/slurm_env_
> 4b324242916bf6c4-45000-5845bc0a
> [44000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal
> pmi capable'
> [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal
> pmi capable'
> [42000] TRACE at rm_slurm.cpp:522 in slurmRestoreHelper; REASON='This is
> srun helper. Restore it'
> [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal
> pmi capable'
> [45000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, internal
> pmi capable'
> lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size ==
> sizeof(ibv_qp_id_t)' failed.
> ******
>
>
> Before starting, I set up following environment variables for MVAPICH:
>
> export MV2_USE_SHARED_MEM=0 # This one is probably the most relevant
> export MV2_USE_BLOCKING=0
> export MV2_ENABLE_AFFINITY=0
> export MV2_RDMA_NUM_EXTRA_POLLS=1
> export MV2_CM_MAX_SPIN_COUNT=1
> export MV2_SPIN_COUNT=100
> export MV2_DEBUG_SHOW_BACKTRACE=1
> export MV2_DEBUG_CORESIZE=unlimited
>
>
>
> --
> Regards,
> Maksym Planeta
>
>
> ------------------------------------------------------------
> ------------------
>
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum