Dear Jiajun, thank you for the response.
On 12/08/2016 12:39 AM, Jiajun Cao wrote: > Which dmtcp version are you using? Could you try the following patch please? > > https://github.com/jiajuncao/dmtcp/commit/8d693636e4a0fce87fb4d96e685e4336831d50ea > Originally I was using dmtcp master (8f3754), then I switched to your branch not-connected-qp (8d69363). I was getting the same error all the time: ./dmtcp_restart_script.sh [45000] WARNING at fileconnlist.cpp:192 in resume; REASON='JWARNING(unlink(missingUnlinkedShmFiles[i].name) != -1) failed' missingUnlinkedShmFiles[i].name = /dev/shm/cm_shmem-1014477.5-taurusi5478-1074916.tmp (strerror((*__errno_location ()))) = No such file or directory Message: The file was unlinked at the time of checkpoint. Unlinking it after restart failed lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size == sizeof(ibv_qp_id_t)' failed. I also tried to switch to tag 2.4.5, and then cherry-picked the commit you suggested. In both cases the error was slightly different from the first set of trials: $ ./dmtcp_restart_script.sh size = 2 [45000] ERROR at connection.cpp:79 in restoreOptions; REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed' _fds[0] = 3 _fcntlFlags = 557058 (strerror((*__errno_location ()))) = Bad file descriptor lu.A.2 (45000): Terminating... select failed: Bad file descriptor > Best, > Jiajun > > On Mon, Dec 5, 2016 at 5:52 PM, Maksym Planeta > <mplan...@os.inf.tu-dresden.de <mailto:mplan...@os.inf.tu-dresden.de>> > wrote: > > I was running the application inside interactive job allocation. One > shell was running coordinator, another one was launching the > application. > > Both shells were inside the same working directory. > > Normally I use mpirun_rsh to launch applications. If I use srun, I > have to provide --mpi=pmi2 additionally. > > MVAPICH is configured for mpirun. > > $ srun --version > slurm 16.05.5-Bull.1.1-20161010-0700 > > > $ mpiname -a > MVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:mrail > > Compilation > CC: gcc -g -O0 > CXX: g++ -g -O0 > F77: gfortran -L/lib -L/lib -g -O0 > FC: gfortran -g -O0 > > Configuration > --enable-fortran=all --enable-cxx --enable-timing=none > --enable-debuginfo --enable-mpit-pvars=all > --enable-check-compiler-flags --enable-threads=multiple > --enable-weak-symbols --disable-dependency-tracking > --enable-fast-install --disable-rdma-cm --with-pm=mpirun:hydra > --with-rdma=gen2 --with-device=ch3:mrail --enable-alloca > --enable-hwloc --disable-fast --enable-g=dbg > --enable-error-messages=all --enable-error-checking=all --prefix=<dir> > > > On 12/05/2016 11:39 PM, Jiajun Cao wrote: > > Hi Maksym, > > Thanks for writing to us. Can you provide the following info: > > DMTCP version, Slurm version, Mvapich2 version, and is Mvapich2 > configured with srun as the process launcher? > > Also, how did you run the jobs? Did you do it by submitting > scripts or > by running interactive jobs? > > > Best, > Jiajun > > On Mon, Dec 5, 2016 at 2:21 PM, Maksym Planeta > <mplan...@os.inf.tu-dresden.de > <mailto:mplan...@os.inf.tu-dresden.de> > <mailto:mplan...@os.inf.tu-dresden.de > <mailto:mplan...@os.inf.tu-dresden.de>>> > > wrote: > > Dear DMTCP developers, > > I'm trying to set up checkpoint/restart of MPI applications > using > MVAPICH. > > I tried several options to launch DMTCP with MVAPICH, but none > succeeded. > > I use symbols ****** around lengthy dumps of debugging > information. > > I show my most successful attempt, I can report results of other > attempts by request. > > In the end the restart script seem to complain about shared > memory > file, which it can't open. Could you tell me how can I work > around > this issue? > > > First I launch dmtcp_coordinator in separate window, then I > start > application as following: > > ****** > $ dmtcp_launch --rm --ib srun --mpi=pmi2 ./wrapper.sh > ./bin/lu.A.2 > [40000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start' > [40000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start' > [40000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We > run under > SLURM!' > [40000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= > tid_offset: 720 > [40000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= > pid_offset: 724 > [42000] TRACE at rm_slurm.cpp:131 in print_args; > REASON='Init CMD:' > cmdline = /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 > ./wrapper.sh ./bin/lu.A.2 > [42000] TRACE at rm_slurm.cpp:160 in patch_srun_cmdline; > REASON='Expand dmtcp_launch path' > dmtcpCkptPath = dmtcp_launch > [42000] TRACE at rm_slurm.cpp:253 in execve; REASON='How command > looks from exec*:' > [42000] TRACE at rm_slurm.cpp:254 in execve; REASON='CMD:' > cmdline = dmtcp_srun_helper dmtcp_nocheckpoint > /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 dmtcp_launch > --coord-host 127.0.0.1 --coord-port 7779 --ckptdir > /home/s9951545/dmtcp-app/NPB3.3/NPB3.3-MPI --infiniband > --batch-queue --explicit-srun ./wrapper.sh ./bin/lu.A.2 > [42000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= > tid_offset: 720 > [42000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= > pid_offset: 724 > [42000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start' > [42000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start' > [42000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We > run under > SLURM!' > > > NAS Parallel Benchmarks 3.3 -- LU Benchmark > > Size: 64x 64x 64 > Iterations: 250 > Number of processes: 2 > > Time step 1 > Time step 20 > Time step 40 > Time step 60 > [42000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; > REASON='Start, > internal pmi capable' > [40000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; > REASON='Start, > internal pmi capable' > [42000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no > sockets left' > [40000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no > sockets left' > [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; > REASON='Start, > internal pmi capable' > [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; > REASON='Start, > internal pmi capable' > Time step 80 > ****** > > I manage to create a checkpoint, but when I try to restart, the > restart script stop at this point: > > ****** > $ ./dmtcp_restart_script.sh > <SKIPPED> > dir = /tmp/dmtcp-s9951545@taurusi4043 > [45000] TRACE at jfilesystem.cpp:172 in mkdir_r; > REASON='Directory > already exists' > dir = /tmp/dmtcp-s9951545@taurusi4043 > [45000] WARNING at fileconnlist.cpp:192 in resume; > REASON='JWARNING(unlink(missingUnlinkedShmFiles[i].name) != > -1) failed' > missingUnlinkedShmFiles[i].name = > /dev/shm/cm_shmem-1003236.42-taurusi4043-1074916.tmp > (strerror((*__errno_location ()))) = No such file or > directory > Message: The file was unlinked at the time of checkpoint. > Unlinking > it after restart failed > [42000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; > REASON='Cannot open SLURM environment file. Environment won't be > restored!' > filename = > > > /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-42000-5845bc0a > [44000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; > REASON='Cannot open SLURM environment file. Environment won't be > restored!' > filename = > > > /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-44000-323cf5bc0749 > [40000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; > REASON='Cannot open SLURM environment file. Environment won't be > restored!' > filename = > > > /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-40000-323cd8b79b6f > [45000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; > REASON='Cannot open SLURM environment file. Environment won't be > restored!' > filename = > > > /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4-45000-5845bc0a > [44000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; > REASON='Start, > internal pmi capable' > [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; > REASON='Start, > internal pmi capable' > [42000] TRACE at rm_slurm.cpp:522 in slurmRestoreHelper; > REASON='This is srun helper. Restore it' > [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; > REASON='Start, > internal pmi capable' > [45000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; > REASON='Start, > internal pmi capable' > lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size == > sizeof(ibv_qp_id_t)' failed. > ****** > > > Before starting, I set up following environment variables > for MVAPICH: > > export MV2_USE_SHARED_MEM=0 # This one is probably the most > relevant > export MV2_USE_BLOCKING=0 > export MV2_ENABLE_AFFINITY=0 > export MV2_RDMA_NUM_EXTRA_POLLS=1 > export MV2_CM_MAX_SPIN_COUNT=1 > export MV2_SPIN_COUNT=100 > export MV2_DEBUG_SHOW_BACKTRACE=1 > export MV2_DEBUG_CORESIZE=unlimited > > > > -- > Regards, > Maksym Planeta > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > <mailto:Dmtcp-forum@lists.sourceforge.net> > <mailto:Dmtcp-forum@lists.sourceforge.net > <mailto:Dmtcp-forum@lists.sourceforge.net>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum> > <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>> > > > > -- > Regards, > Maksym Planeta > > -- Regards, Maksym Planeta
smime.p7s
Description: S/MIME Cryptographic Signature
------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today.http://sdm.link/xeonphi
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum