Dear Jiajun,

thank you for the response.

On 12/08/2016 12:39 AM, Jiajun Cao wrote:
> Which dmtcp version are you using? Could you try the following patch please?
> 
> https://github.com/jiajuncao/dmtcp/commit/8d693636e4a0fce87fb4d96e685e4336831d50ea
> 

Originally I was using dmtcp master (8f3754), then I switched to your branch 
not-connected-qp (8d69363).

I was getting the same error all the time:

./dmtcp_restart_script.sh 
[45000] WARNING at fileconnlist.cpp:192 in resume; 
REASON='JWARNING(unlink(missingUnlinkedShmFiles[i].name) != -1) failed'
     missingUnlinkedShmFiles[i].name = 
/dev/shm/cm_shmem-1014477.5-taurusi5478-1074916.tmp
     (strerror((*__errno_location ()))) = No such file or directory
Message: The file was unlinked at the time of checkpoint. Unlinking it after 
restart failed
lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size == sizeof(ibv_qp_id_t)' 
failed.

I also tried to switch to tag 2.4.5, and then cherry-picked the commit you 
suggested. In both cases the error was slightly different from the first set of 
trials:

$ ./dmtcp_restart_script.sh 
size = 2
[45000] ERROR at connection.cpp:79 in restoreOptions; 
REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
     _fds[0] = 3
     _fcntlFlags = 557058
     (strerror((*__errno_location ()))) = Bad file descriptor
lu.A.2 (45000): Terminating...
select failed: Bad file descriptor


> Best,
> Jiajun
> 
> On Mon, Dec 5, 2016 at 5:52 PM, Maksym Planeta
> <mplan...@os.inf.tu-dresden.de <mailto:mplan...@os.inf.tu-dresden.de>>
> wrote:
> 
>     I was running the application inside interactive job allocation. One
>     shell was running coordinator, another one was launching the
>     application.
> 
>     Both shells were inside the same working directory.
> 
>     Normally I use mpirun_rsh to launch applications. If I use srun, I
>     have to provide --mpi=pmi2 additionally.
> 
>     MVAPICH is configured for mpirun.
> 
>     $ srun --version
>     slurm 16.05.5-Bull.1.1-20161010-0700
> 
> 
>     $ mpiname -a
>     MVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:mrail
> 
>     Compilation
>     CC: gcc    -g -O0
>     CXX: g++   -g -O0
>     F77: gfortran -L/lib -L/lib   -g -O0
>     FC: gfortran   -g -O0
> 
>     Configuration
>     --enable-fortran=all --enable-cxx --enable-timing=none
>     --enable-debuginfo --enable-mpit-pvars=all
>     --enable-check-compiler-flags --enable-threads=multiple
>     --enable-weak-symbols --disable-dependency-tracking
>     --enable-fast-install --disable-rdma-cm --with-pm=mpirun:hydra
>     --with-rdma=gen2 --with-device=ch3:mrail --enable-alloca
>     --enable-hwloc --disable-fast --enable-g=dbg
>     --enable-error-messages=all --enable-error-checking=all --prefix=<dir>
> 
> 
>     On 12/05/2016 11:39 PM, Jiajun Cao wrote:
> 
>         Hi Maksym,
> 
>         Thanks for writing to us. Can you provide the following info:
> 
>         DMTCP version, Slurm version, Mvapich2 version, and is Mvapich2
>         configured with srun as the process launcher?
> 
>         Also, how did you run the jobs? Did you do it by submitting
>         scripts or
>         by running interactive jobs?
> 
> 
>         Best,
>         Jiajun
> 
>         On Mon, Dec 5, 2016 at 2:21 PM, Maksym Planeta
>         <mplan...@os.inf.tu-dresden.de
>         <mailto:mplan...@os.inf.tu-dresden.de>
>         <mailto:mplan...@os.inf.tu-dresden.de
>         <mailto:mplan...@os.inf.tu-dresden.de>>>
> 
>         wrote:
> 
>             Dear DMTCP developers,
> 
>             I'm trying to set up checkpoint/restart of MPI applications
>         using
>             MVAPICH.
> 
>             I tried several options to launch DMTCP with MVAPICH, but none
>             succeeded.
> 
>             I use symbols ****** around lengthy dumps of debugging
>         information.
> 
>             I show my most successful attempt, I can report results of other
>             attempts by request.
> 
>             In the end the restart script seem to complain about shared
>         memory
>             file, which it can't open. Could you tell me how can I work
>         around
>             this issue?
> 
> 
>             First I launch dmtcp_coordinator in separate window, then I
>         start
>             application as following:
> 
>             ******
>             $ dmtcp_launch --rm --ib srun --mpi=pmi2    ./wrapper.sh 
>         ./bin/lu.A.2
>             [40000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
>             [40000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
>             [40000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We
>         run under
>             SLURM!'
>             [40000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON=
>             tid_offset: 720
>             [40000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON=
>             pid_offset: 724
>             [42000] TRACE at rm_slurm.cpp:131 in print_args;
>         REASON='Init CMD:'
>                  cmdline = /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2
>             ./wrapper.sh ./bin/lu.A.2
>             [42000] TRACE at rm_slurm.cpp:160 in patch_srun_cmdline;
>             REASON='Expand dmtcp_launch path'
>                  dmtcpCkptPath = dmtcp_launch
>             [42000] TRACE at rm_slurm.cpp:253 in execve; REASON='How command
>             looks from exec*:'
>             [42000] TRACE at rm_slurm.cpp:254 in execve; REASON='CMD:'
>                  cmdline = dmtcp_srun_helper dmtcp_nocheckpoint
>             /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 dmtcp_launch
>             --coord-host 127.0.0.1 --coord-port 7779 --ckptdir
>             /home/s9951545/dmtcp-app/NPB3.3/NPB3.3-MPI --infiniband
>             --batch-queue --explicit-srun ./wrapper.sh ./bin/lu.A.2
>             [42000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON=
>             tid_offset: 720
>             [42000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON=
>             pid_offset: 724
>             [42000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start'
>             [42000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start'
>             [42000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We
>         run under
>             SLURM!'
> 
> 
>              NAS Parallel Benchmarks 3.3 -- LU Benchmark
> 
>              Size:   64x  64x  64
>              Iterations:  250
>              Number of processes:     2
> 
>              Time step    1
>              Time step   20
>              Time step   40
>              Time step   60
>             [42000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi;
>         REASON='Start,
>             internal pmi capable'
>             [40000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi;
>         REASON='Start,
>             internal pmi capable'
>             [42000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no
>             sockets left'
>             [40000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no
>             sockets left'
>             [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi;
>         REASON='Start,
>             internal pmi capable'
>             [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi;
>         REASON='Start,
>             internal pmi capable'
>              Time step   80
>             ******
> 
>             I manage to create a checkpoint, but when I try to restart, the
>             restart script stop at this point:
> 
>             ******
>             $ ./dmtcp_restart_script.sh
>             <SKIPPED>
>                  dir = /tmp/dmtcp-s9951545@taurusi4043
>             [45000] TRACE at jfilesystem.cpp:172 in mkdir_r;
>         REASON='Directory
>             already exists'
>                  dir = /tmp/dmtcp-s9951545@taurusi4043
>             [45000] WARNING at fileconnlist.cpp:192 in resume;
>             REASON='JWARNING(unlink(missingUnlinkedShmFiles[i].name) !=
>         -1) failed'
>                  missingUnlinkedShmFiles[i].name =
>             /dev/shm/cm_shmem-1003236.42-taurusi4043-1074916.tmp
>                  (strerror((*__errno_location ()))) = No such file or
>         directory
>             Message: The file was unlinked at the time of checkpoint.
>         Unlinking
>             it after restart failed
>             [42000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>             REASON='Cannot open SLURM environment file. Environment won't be
>             restored!'
>                  filename =
>            
>         
> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-42000-5845bc0a
>             [44000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>             REASON='Cannot open SLURM environment file. Environment won't be
>             restored!'
>                  filename =
>            
>         
> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-44000-323cf5bc0749
>             [40000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>             REASON='Cannot open SLURM environment file. Environment won't be
>             restored!'
>                  filename =
>            
>         
> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-40000-323cd8b79b6f
>             [45000] TRACE at rm_slurm.cpp:74 in slurm_restore_env;
>             REASON='Cannot open SLURM environment file. Environment won't be
>             restored!'
>                  filename =
>            
>         
> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-45000-5845bc0a
>             [44000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi;
>         REASON='Start,
>             internal pmi capable'
>             [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi;
>         REASON='Start,
>             internal pmi capable'
>             [42000] TRACE at rm_slurm.cpp:522 in slurmRestoreHelper;
>             REASON='This is srun helper. Restore it'
>             [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi;
>         REASON='Start,
>             internal pmi capable'
>             [45000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi;
>         REASON='Start,
>             internal pmi capable'
>             lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size ==
>             sizeof(ibv_qp_id_t)' failed.
>             ******
> 
> 
>             Before starting, I set up following environment variables
>         for MVAPICH:
> 
>             export MV2_USE_SHARED_MEM=0 # This one is probably the most
>         relevant
>             export MV2_USE_BLOCKING=0
>             export MV2_ENABLE_AFFINITY=0
>             export MV2_RDMA_NUM_EXTRA_POLLS=1
>             export MV2_CM_MAX_SPIN_COUNT=1
>             export MV2_SPIN_COUNT=100
>             export MV2_DEBUG_SHOW_BACKTRACE=1
>             export MV2_DEBUG_CORESIZE=unlimited
> 
> 
> 
>             --
>             Regards,
>             Maksym Planeta
> 
> 
>            
>         
> ------------------------------------------------------------------------------
> 
>             _______________________________________________
>             Dmtcp-forum mailing list
>             Dmtcp-forum@lists.sourceforge.net
>         <mailto:Dmtcp-forum@lists.sourceforge.net>
>             <mailto:Dmtcp-forum@lists.sourceforge.net
>         <mailto:Dmtcp-forum@lists.sourceforge.net>>
>             https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>         <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
>             <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>         <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>>
> 
> 
> 
>     -- 
>     Regards,
>     Maksym Planeta
> 
> 

-- 
Regards,
Maksym Planeta

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to