Which dmtcp version are you using? Could you try the following patch please?
https://github.com/jiajuncao/dmtcp/commit/8d693636e4a0fce87fb4d96e685e4336831d50ea
Best,
Jiajun
On Mon, Dec 5, 2016 at 5:52 PM, Maksym Planeta <
mplan...@os.inf.tu-dresden.de> wrote:
> I was running the application inside interactive job allocation. One shell
> was running coordinator, another one was launching the application.
>
> Both shells were inside the same working directory.
>
> Normally I use mpirun_rsh to launch applications. If I use srun, I have to
> provide --mpi=pmi2 additionally.
>
> MVAPICH is configured for mpirun.
>
> $ srun --version
> slurm 16.05.5-Bull.1.1-20161010-0700
>
>
> $ mpiname -a
> MVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:mrail
>
> Compilation
> CC: gcc -g -O0
> CXX: g++ -g -O0
> F77: gfortran -L/lib -L/lib -g -O0
> FC: gfortran -g -O0
>
> Configuration
> --enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo
> --enable-mpit-pvars=all --enable-check-compiler-flags
> --enable-threads=multiple --enable-weak-symbols
> --disable-dependency-tracking --enable-fast-install --disable-rdma-cm
> --with-pm=mpirun:hydra --with-rdma=gen2 --with-device=ch3:mrail
> --enable-alloca --enable-hwloc --disable-fast --enable-g=dbg
> --enable-error-messages=all --enable-error-checking=all --prefix=<dir>
>
>
> On 12/05/2016 11:39 PM, Jiajun Cao wrote:
>
>> Hi Maksym,
>>
>> Thanks for writing to us. Can you provide the following info:
>>
>> DMTCP version, Slurm version, Mvapich2 version, and is Mvapich2
>> configured with srun as the process launcher?
>>
>> Also, how did you run the jobs? Did you do it by submitting scripts or
>> by running interactive jobs?
>>
>>
>> Best,
>> Jiajun
>>
>> On Mon, Dec 5, 2016 at 2:21 PM, Maksym Planeta
>> <mplan...@os.inf.tu-dresden.de <mailto:mplan...@os.inf.tu-dresden.de>>
>>
>> wrote:
>>
>> Dear DMTCP developers,
>>
>> I'm trying to set up checkpoint/restart of MPI applications using
>> MVAPICH.
>>
>> I tried several options to launch DMTCP with MVAPICH, but none
>> succeeded.
>>
>> I use symbols ****** around lengthy dumps of debugging information.
>>
>> I show my most successful attempt, I can report results of other
>> attempts by request.
>>
>> In the end the restart script seem to complain about shared memory
>> file, which it can't open. Could you tell me how can I work around
>> this issue?
>>
>>
>> First I launch dmtcp_coordinator in separate window, then I start
>> application as following:
>>
>> ******
>> $ dmtcp_launch --rm --ib srun --mpi=pmi2 ./wrapper.sh ./bin/lu.A.2
>> [40000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
>> [40000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
>> [40000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under
>> SLURM!'
>> [40000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON=
>> tid_offset: 720
>> [40000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON=
>> pid_offset: 724
>> [42000] TRACE at rm_slurm.cpp:131 in print_args; REASON='Init CMD:'
>> cmdline = /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2
>> ./wrapper.sh ./bin/lu.A.2
>> [42000] TRACE at rm_slurm.cpp:160 in patch_srun_cmdline;
>> REASON='Expand dmtcp_launch path'
>> dmtcpCkptPath = dmtcp_launch
>> [42000] TRACE at rm_slurm.cpp:253 in execve; REASON='How command
>> looks from exec*:'
>> [42000] TRACE at rm_slurm.cpp:254 in execve; REASON='CMD:'
>> cmdline = dmtcp_srun_helper dmtcp_nocheckpoint
>> /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 dmtcp_launch
>> --coord-host 127.0.0.1 --coord-port 7779 --ckptdir
>> /home/s9951545/dmtcp-app/NPB3.3/NPB3.3-MPI --infiniband
>> --batch-queue --explicit-srun ./wrapper.sh ./bin/lu.A.2
>> [42000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON=
>> tid_offset: 720
>> [42000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON=
>> pid_offset: 724
>> [42000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
>> [42000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
>> [42000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under
>> SLURM!'
>>
>>
>> NAS Parallel Benchmarks 3.3 -- LU Benchmark
>>
>> Size: 64x 64x 64
>> Iterations: 250
>> Number of processes: 2
>>
>> Time step 1
>> Time step 20
>> Time step 40
>> Time step 60
>> [42000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start,
>> internal pmi capable'
>> [40000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start,
>> internal pmi capable'
>> [42000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no
>> sockets left'
>> [40000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no
>> sockets left'
>> [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>> internal pmi capable'
>> [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>> internal pmi capable'
>> Time step 80
>> ******
>>
>> I manage to create a checkpoint, but when I try to restart, the
>> restart script stop at this point:
>>
>> ******
>> $ ./dmtcp_restart_script.sh
>> <SKIPPED>
>> dir = /tmp/dmtcp-s9951545@taurusi4043
>> [45000] TRACE at jfilesystem.cpp:172 in mkdir_r; REASON='Directory
>> already exists'
>> dir = /tmp/dmtcp-s9951545@taurusi4043
>> [45000] WARNING at fileconnlist.cpp:192 in resume;
>> REASON='JWARNING(unlink(missingUnlinkedShmFiles[i].name) != -1)
>> failed'
>> missingUnlinkedShmFiles[i].name =
>> /dev/shm/cm_shmem-1003236.42-taurusi4043-1074916.tmp
>> (strerror((*__errno_location ()))) = No such file or directory
>> Message: The file was unlinked at the time of checkpoint. Unlinking
>> it after restart failed
>> [42000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>> REASON='Cannot open SLURM environment file. Environment won't be
>> restored!'
>> filename =
>> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-
>> 42000-5845bc0a
>> [44000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>> REASON='Cannot open SLURM environment file. Environment won't be
>> restored!'
>> filename =
>> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-
>> 44000-323cf5bc0749
>> [40000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>> REASON='Cannot open SLURM environment file. Environment won't be
>> restored!'
>> filename =
>> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-
>> 40000-323cd8b79b6f
>> [45000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>> REASON='Cannot open SLURM environment file. Environment won't be
>> restored!'
>> filename =
>> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-
>> 45000-5845bc0a
>> [44000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>> internal pmi capable'
>> [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>> internal pmi capable'
>> [42000] TRACE at rm_slurm.cpp:522 in slurmRestoreHelper;
>> REASON='This is srun helper. Restore it'
>> [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>> internal pmi capable'
>> [45000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>> internal pmi capable'
>> lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size ==
>> sizeof(ibv_qp_id_t)' failed.
>> ******
>>
>>
>> Before starting, I set up following environment variables for MVAPICH:
>>
>> export MV2_USE_SHARED_MEM=0 # This one is probably the most relevant
>> export MV2_USE_BLOCKING=0
>> export MV2_ENABLE_AFFINITY=0
>> export MV2_RDMA_NUM_EXTRA_POLLS=1
>> export MV2_CM_MAX_SPIN_COUNT=1
>> export MV2_SPIN_COUNT=100
>> export MV2_DEBUG_SHOW_BACKTRACE=1
>> export MV2_DEBUG_CORESIZE=unlimited
>>
>>
>>
>> --
>> Regards,
>> Maksym Planeta
>>
>>
>> ------------------------------------------------------------
>> ------------------
>>
>> _______________________________________________
>> Dmtcp-forum mailing list
>> Dmtcp-forum@lists.sourceforge.net
>> <mailto:Dmtcp-forum@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>> <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
>>
>>
>>
> --
> Regards,
> Maksym Planeta
>
>
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum