Hi Team,

Me and my team are using your tool to checkpoint our applications. But, we have encountered some issues. There are no problems when we use DMTCP to checkpoint local-running apps. However, on clusters, we have some issues and we haven't figured out a solution yet :

First of all, here is our environment:
- We have a sample application, which simply indefinitively increments a counter, sleep one second and displays it at each loop step (not really complex) - We use DMTCP 2.0 and 2.1, according to the used cluster (most of DMTCP tests are performed using DMTCP 2.1) - We use SLURM as job manager to send our jobs on clusters (salloc, sbatch, srun and so on). - We have two prompts, one running dmtcp_coordinator and the other one launching the command.

 * First little thing about SLURM issues is how DMTCP parses SRUN
   command when exec() functions are overrided. Only long-format
   options are detected and not short-format ones. Thus, when we use
   "srun -N 1 ./a.out", DMTCP believes "1" is the application name and
   we get "srun -N dmtcp_launch <options> 1 ./a.out" command. (it's not
   a big deal but it's a good thing to know before using it)
 * The second one is how SLURM plugin is loaded. DMTCP checks if some
   SLURM environment variables are set before loading. The issue is
   when we use DMTCP to launch SRUN without have a SLURM environment.
   Thus, plugin SLURM is not loaded and it's unable to checkpoint
   applications over the job manager. Instead of using SRUN directly,
   we currently decided to use SBATCH instead, as you have written down
   in your documentation. So, would it be possible to use SALLOC
   instead of SBATCH (in order to keep interactive mode)? Moreover, if
   we attempts to launch jobs like : "salloc -N 1 dmtcp_launch
   <options> srun --nodes=1 ./a.out", we have the following error :

       [46000] ERROR at fileconnlist.cpp:363 in processFileConnection;
             path = /proc/self/fd/socket:[132529151]
       Message: Unimplemented file type.
       tmp (46000): Terminating...

 * Finally, on SLURM using, we launch our job like :
     o Sbatch
         + dmtcp_launch <options>
             # myMainScript.sh
                 * srun <options>
                     o ./a.out
     o When we do like that, checkpointing seems to be good (even in
       --enable-debug, no particular warnings), but, on restart, we get
       the following output (and the application stops):

       [45000] TRACE at pid.cpp:121 in openSharedFile;
       REASON='_real_open: '
             strerror((*__errno_location ())) = File exists
             fd = -1
       [45000] ERROR at pid.cpp:130 in openSharedFile;
       REASON='JASSERT(false) failed'
       name =
       
/tmp/dmtcp-login@clusterNode5/dmtcpPidMap.57d889deebbd7d0c-45000-53187ae9.53187b323
       strerror((*__errno_location ())) = Bad file descriptor
       Message: Cannot open file
       bash (45000): Terminating...

In case of "no-ideas", we'll provide you complete logs and backtraces.

Thanks in advance for your help and congratulations for what you have made so far :)
Regards,

--

*Julien Adam*
Information Systems Engineering student

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to