Which dmtcp version are you using? Could you try the following patch please?

https://github.com/jiajuncao/dmtcp/commit/8d693636e4a0fce87fb4d96e685e4336831d50ea

Best,
Jiajun

On Mon, Dec 5, 2016 at 5:52 PM, Maksym Planeta <
mplan...@os.inf.tu-dresden.de> wrote:

> I was running the application inside interactive job allocation. One shell
> was running coordinator, another one was launching the application.
>
> Both shells were inside the same working directory.
>
> Normally I use mpirun_rsh to launch applications. If I use srun, I have to
> provide --mpi=pmi2 additionally.
>
> MVAPICH is configured for mpirun.
>
> $ srun --version
> slurm 16.05.5-Bull.1.1-20161010-0700
>
>
> $ mpiname -a
> MVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:mrail
>
> Compilation
> CC: gcc    -g -O0
> CXX: g++   -g -O0
> F77: gfortran -L/lib -L/lib   -g -O0
> FC: gfortran   -g -O0
>
> Configuration
> --enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo
> --enable-mpit-pvars=all --enable-check-compiler-flags
> --enable-threads=multiple --enable-weak-symbols
> --disable-dependency-tracking --enable-fast-install --disable-rdma-cm
> --with-pm=mpirun:hydra --with-rdma=gen2 --with-device=ch3:mrail
> --enable-alloca --enable-hwloc --disable-fast --enable-g=dbg
> --enable-error-messages=all --enable-error-checking=all --prefix=<dir>
>
>
> On 12/05/2016 11:39 PM, Jiajun Cao wrote:
>
>> Hi Maksym,
>>
>> Thanks for writing to us. Can you provide the following info:
>>
>> DMTCP version, Slurm version, Mvapich2 version, and is Mvapich2
>> configured with srun as the process launcher?
>>
>> Also, how did you run the jobs? Did you do it by submitting scripts or
>> by running interactive jobs?
>>
>>
>> Best,
>> Jiajun
>>
>> On Mon, Dec 5, 2016 at 2:21 PM, Maksym Planeta
>> <mplan...@os.inf.tu-dresden.de <mailto:mplan...@os.inf.tu-dresden.de>>
>>
>> wrote:
>>
>>     Dear DMTCP developers,
>>
>>     I'm trying to set up checkpoint/restart of MPI applications using
>>     MVAPICH.
>>
>>     I tried several options to launch DMTCP with MVAPICH, but none
>>     succeeded.
>>
>>     I use symbols ****** around lengthy dumps of debugging information.
>>
>>     I show my most successful attempt, I can report results of other
>>     attempts by request.
>>
>>     In the end the restart script seem to complain about shared memory
>>     file, which it can't open. Could you tell me how can I work around
>>     this issue?
>>
>>
>>     First I launch dmtcp_coordinator in separate window, then I start
>>     application as following:
>>
>>     ******
>>     $ dmtcp_launch --rm --ib srun --mpi=pmi2    ./wrapper.sh  ./bin/lu.A.2
>>     [40000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
>>     [40000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
>>     [40000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under
>>     SLURM!'
>>     [40000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON=
>>     tid_offset: 720
>>     [40000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON=
>>     pid_offset: 724
>>     [42000] TRACE at rm_slurm.cpp:131 in print_args; REASON='Init CMD:'
>>          cmdline = /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2
>>     ./wrapper.sh ./bin/lu.A.2
>>     [42000] TRACE at rm_slurm.cpp:160 in patch_srun_cmdline;
>>     REASON='Expand dmtcp_launch path'
>>          dmtcpCkptPath = dmtcp_launch
>>     [42000] TRACE at rm_slurm.cpp:253 in execve; REASON='How command
>>     looks from exec*:'
>>     [42000] TRACE at rm_slurm.cpp:254 in execve; REASON='CMD:'
>>          cmdline = dmtcp_srun_helper dmtcp_nocheckpoint
>>     /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 dmtcp_launch
>>     --coord-host 127.0.0.1 --coord-port 7779 --ckptdir
>>     /home/s9951545/dmtcp-app/NPB3.3/NPB3.3-MPI --infiniband
>>     --batch-queue --explicit-srun ./wrapper.sh ./bin/lu.A.2
>>     [42000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON=
>>     tid_offset: 720
>>     [42000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON=
>>     pid_offset: 724
>>     [42000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
>>     [42000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
>>     [42000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under
>>     SLURM!'
>>
>>
>>      NAS Parallel Benchmarks 3.3 -- LU Benchmark
>>
>>      Size:   64x  64x  64
>>      Iterations:  250
>>      Number of processes:     2
>>
>>      Time step    1
>>      Time step   20
>>      Time step   40
>>      Time step   60
>>     [42000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start,
>>     internal pmi capable'
>>     [40000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start,
>>     internal pmi capable'
>>     [42000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no
>>     sockets left'
>>     [40000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no
>>     sockets left'
>>     [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>>     internal pmi capable'
>>     [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>>     internal pmi capable'
>>      Time step   80
>>     ******
>>
>>     I manage to create a checkpoint, but when I try to restart, the
>>     restart script stop at this point:
>>
>>     ******
>>     $ ./dmtcp_restart_script.sh
>>     <SKIPPED>
>>          dir = /tmp/dmtcp-s9951545@taurusi4043
>>     [45000] TRACE at jfilesystem.cpp:172 in mkdir_r; REASON='Directory
>>     already exists'
>>          dir = /tmp/dmtcp-s9951545@taurusi4043
>>     [45000] WARNING at fileconnlist.cpp:192 in resume;
>>     REASON='JWARNING(unlink(missingUnlinkedShmFiles[i].name) != -1)
>> failed'
>>          missingUnlinkedShmFiles[i].name =
>>     /dev/shm/cm_shmem-1003236.42-taurusi4043-1074916.tmp
>>          (strerror((*__errno_location ()))) = No such file or directory
>>     Message: The file was unlinked at the time of checkpoint. Unlinking
>>     it after restart failed
>>     [42000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>>     REASON='Cannot open SLURM environment file. Environment won't be
>>     restored!'
>>          filename =
>>     /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-
>> 42000-5845bc0a
>>     [44000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>>     REASON='Cannot open SLURM environment file. Environment won't be
>>     restored!'
>>          filename =
>>     /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-
>> 44000-323cf5bc0749
>>     [40000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>>     REASON='Cannot open SLURM environment file. Environment won't be
>>     restored!'
>>          filename =
>>     /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-
>> 40000-323cd8b79b6f
>>     [45000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>>     REASON='Cannot open SLURM environment file. Environment won't be
>>     restored!'
>>          filename =
>>     /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-
>> 45000-5845bc0a
>>     [44000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>>     internal pmi capable'
>>     [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>>     internal pmi capable'
>>     [42000] TRACE at rm_slurm.cpp:522 in slurmRestoreHelper;
>>     REASON='This is srun helper. Restore it'
>>     [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>>     internal pmi capable'
>>     [45000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start,
>>     internal pmi capable'
>>     lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size ==
>>     sizeof(ibv_qp_id_t)' failed.
>>     ******
>>
>>
>>     Before starting, I set up following environment variables for MVAPICH:
>>
>>     export MV2_USE_SHARED_MEM=0 # This one is probably the most relevant
>>     export MV2_USE_BLOCKING=0
>>     export MV2_ENABLE_AFFINITY=0
>>     export MV2_RDMA_NUM_EXTRA_POLLS=1
>>     export MV2_CM_MAX_SPIN_COUNT=1
>>     export MV2_SPIN_COUNT=100
>>     export MV2_DEBUG_SHOW_BACKTRACE=1
>>     export MV2_DEBUG_CORESIZE=unlimited
>>
>>
>>
>>     --
>>     Regards,
>>     Maksym Planeta
>>
>>
>>     ------------------------------------------------------------
>> ------------------
>>
>>     _______________________________________________
>>     Dmtcp-forum mailing list
>>     Dmtcp-forum@lists.sourceforge.net
>>     <mailto:Dmtcp-forum@lists.sourceforge.net>
>>     https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>     <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
>>
>>
>>
> --
> Regards,
> Maksym Planeta
>
>
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to