Hi Husen,

I'll start with some basic questions about your setup.

 - Could you share with us your launch scripts for MPICH and OpenMPI?
 - What DMTCP version are you using?
 - Do you have InfiniBand on your setup? If yes, then you'd need to
   configure DMTCP with the IB support (`./configure 
--enable-infiniband-support`),
   and use the `--ib` flag with dmtcp_launch.

Next, you wrote:

 > I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2
 > doesn't recognize --rm option.

The `--rm` flag is a `dmtcp_launch` option, it's not an MPICH option.

You seem to be seeing two kinds of warnings:

 a) "Still draining socket... perhaps remote host is not running under
     DMTCP"; and
 b) "Datagram Sockets not supported. Hopefully, this is a short lived
     connection".

The first one indicates that there are sockets in your process going
out to entities not running under DMTCP. I think this could be specific
to your SLURM/MPI setup.

The second warning could imply many different things. I haven't
seen MPI's using datagram sockets usually. Datagram sockets are not
supported in DMTCP out-of-the-box. Is your application doing that?
Are you trying to checkpoint a GUI-based application?

In either case, the warnings are not fatal, or at least, not
immediately fatal. However, the warnings could lead to other issues
that arise at restart time.

Moving forward ...

I think the first thing you need to do is to verify if the checkpoint
was "successful".

If the checkpoint was "successful", you should see checkpoint images
corresponding to each MPI rank, i.e., there should be one checkpoint
image (a *.dmtcp file) per MPI process. Do you see that? Do you see a
restart script?

Next step would be to verify the restart part.

The restart script is a little tricky and might need some modifications
depending on your setup. In other words, don't rely on it to work
out-of-the-box. You could try to restart the computation manually
to isolate the issue. Here's how I would do it:

 - Allocate N interactive nodes. N could be 1 or more; it's easier to
   debug with 1 node, assuming you have enough RAM on the node.
 - Start dmtcp_coordinator: you could start it on the head node or one
   of the allocated compute nodes
 - ssh to allocated node, and manually execute the restart command:

     dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image1.dmtcp
     dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image2.dmtcp
     ...

   The only thing you need to ensure when doing this manually is that
   the MPI ranks that were sharing a node prior to checkpointing are
   restarted on one node. This is because the MPI processes might be
   using (SysV) shared-memory for intra-node communication. On restart,
   DMTCP will try to restore the shared-memory region and fail if the
   processes are not restarted on one node.

Finally, I think what you are seeing is because of some configuration
issue. We have tested with different MPI's recently and it works. I
could be wrong though.

Would it be possible for you to give us a guest account for debugging
on your setup? It'll be the most efficient way of resolving this.

-Rohan

On Fri, May 20, 2016 at 06:32:09PM +0700, Husen R wrote:
> Hi Gene,
> 
> Thank you for your reply!
> 
> I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2
> doesn't recognize --rm option.
> I don't know exactly, what's the difference between mpich-3.2 and mpich2 ?
> 
> recently I tried to use openmpi-1.6 to checkpoint mpi application using
> dmtcp and slurm.
> but I got the following error :
> 
> [40000] WARNING at socketconnection.cpp:187 in TcpConnection;
> REASON='JWARNING(false) failed'
>      type = 2
> Message: Datagram Sockets not supported. Hopefully, this is a short lived
> connection!
> [46000] WARNING at socketconnection.cpp:187 in TcpConnection;
> REASON='JWARNING(false) failed'
>      type = 2
> Message: Datagram Sockets not supported. Hopefully, this is a short lived
> connection!
> [50000] WARNING at socketconnection.cpp:187 in TcpConnection;
> REASON='JWARNING(false) failed'
>      type = 2
> Message: Datagram Sockets not supported. Hopefully, this is a short lived
> connection!
> [45000] WARNING at socketconnection.cpp:187 in TcpConnection;
> REASON='JWARNING(false) failed'
>      type = 2
> Message: Datagram Sockets not supported. Hopefully, this is a short lived
> connection!
> [48000] WARNING at socketconnection.cpp:187 in TcpConnection;
> REASON='JWARNING(false) failed'
>      type = 2
> Message: Datagram Sockets not supported. Hopefully, this is a short lived
> connection!
> ...
> ...
> ...
> [41000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
> REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
>      _magicBits =
> Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator
> die uncleanly?
> dmtcp_srun_helper (41000): Terminating...
> 
> 
> 
> in addition, the slurm_restart.job seems not working at all.
> I need help.
> Thank you in advance,
> 
> 
> Regards,
> 
> Husen
> 
> On Fri, May 20, 2016 at 5:36 PM, Gene Cooperman <g...@ccs.neu.edu> wrote:
> 
> > Hi William and Husen,
> >     As far as I know, the combination "--rm --ib" should work with
> > the major MPI implementations:  Open MPI, MVAPICH2, Intel MPI, MPICH2.
> > But I'm not sure which ones we've tested with very recently.
> > I'm pretty sure that we've used MVAPICH2 and Open MPI in this way.
> >
> > Jiajun and Rohan,
> >     Could you confirm which implementations you've used _with the
> > "--rm --ib" combination_?  If it's not working with one of the
> > major MPI implementations, we need to fix that.
> >
> > Thanks,
> > - Gene
> >
> > On Thu, May 19, 2016 at 03:42:06PM -0700, William Fox wrote:
> > > At least for me ( I am not a developer for dmtcp) I was forced to switch
> > to
> > > openmpi (version1.6 specifically) in order to get --rm to work correctly.
> > > What version of mpi are you running? In addition, if you are using
> > > infiniband, --ib will need to be installed and utilized in order to
> > > accomplish a restart.
> > >
> > > On Wed, May 18, 2016 at 1:15 AM, Husen R <hus...@gmail.com> wrote:
> > >
> > > > dear all,
> > > >
> > > > I have tried to checkpoint mpi application using dmtcp but I failed
> > with
> > > > the error message as follows :
> > > >
> > > >
> > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
> > > > REASON='JWARNING(false) failed'
> > > >      _dataSockets[i]->socket().sockfd() = 9
> > > >      buffer.size() = 0
> > > >      WARN_INTERVAL_SEC = 10
> > > > Message: Still draining socket... perhaps remote host is not running
> > under
> > > > DMTCP?
> > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
> > > > REASON='JWARNING(false) failed'
> > > >      _dataSockets[i]->socket().sockfd() = 7
> > > >      buffer.size() = 0
> > > >      WARN_INTERVAL_SEC = 10
> > > > Message: Still draining socket... perhaps remote host is not running
> > under
> > > > DMTCP?
> > > > ......
> > > > ......
> > > > ......
> > > >
> > > > I use this sbatch script to submit job :
> > > >
> > > > #####################################SBATCH###########################
> > > > #!/bin/bash
> > > > # Put your SLURM options here
> > > > #SBATCH --partition=comeon
> > > > #SBATCH --time=01:15:00
> > > > #SBATCH --nodes=2
> > > > #SBATCH --ntasks-per-node=4
> > > > #SBATCH --job-name="dmtcp_job"
> > > > #SBATCH --output=dmtcp_ckpt_img/dmtcp-%j.out
> > > >
> > > > start_coordinator()
> > > > {
> > > >
> > > >     fname=dmtcp_command.$SLURM_JOBID
> > > >     h=$(hostname)
> > > >     check_coordinator=$(which dmtcp_coordinator)
> > > >
> > > >     if [ -z "$check_coordinator" ]; then
> > > >         echo "No dmtcp_coordinator found. Check your DMTCP installation
> > > > and PATH settings."
> > > >         exit 0
> > > >     fi
> > > >
> > > >     dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname
> > $@
> > > > 1>/dev/null 2>&1
> > > >
> > > >     p=`cat $fname`
> > > >     chmod +x $fname
> > > >     echo "#!/bin/bash" > $fname
> > > >     echo >> $fname
> > > >     echo "export PATH=$PATH" >> $fname
> > > >     echo "export DMTCP_COORD_HOST=$h" >> $fname
> > > >     echo "export DMTCP_COORD_PORT=$p" >> $fname
> > > >     echo "dmtcp_command \$@" >> $fname
> > > >
> > > >     # Set up local environment for DMTCP
> > > >     export DMTCP_COORD_HOST=$h
> > > >     export DMTCP_COORD_PORT=$p
> > > > }
> > > >
> > > > cd $SLURM_SUBMIT_DIR
> > > > start_coordinator -i 240
> > > > dmtcp_launch -h $h -p $p mpiexec ./mm.o
> > > >
> > > >
> > #########################################################################
> > > >
> > > > I also have tried using --rm option in dmtcp_launch but it doesn't work
> > > > and no output at all.
> > > >
> > > > anybody tell me how to solve this please ? I need help
> > > >
> > > >
> > > > Regards,
> > > >
> > > >
> > > >
> > > > Husen
> > > >
> > > >
> > > >
> > ------------------------------------------------------------------------------
> > > > Mobile security can be enabling, not merely restricting. Employees who
> > > > bring their own devices (BYOD) to work are irked by the imposition of
> > MDM
> > > > restrictions. Mobile Device Manager Plus allows you to control only the
> > > > apps on BYO-devices by containerizing them, leaving personal data
> > > > untouched!
> > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> > > > _______________________________________________
> > > > Dmtcp-forum mailing list
> > > > Dmtcp-forum@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > > >
> > > >
> > >
> > >
> > > --
> > > William Fox
> > >
> > > Lawrence Berkeley National Laboratory
> > > Computational Research Division
> >
> > >
> > ------------------------------------------------------------------------------
> > > Mobile security can be enabling, not merely restricting. Employees who
> > > bring their own devices (BYOD) to work are irked by the imposition of MDM
> > > restrictions. Mobile Device Manager Plus allows you to control only the
> > > apps on BYO-devices by containerizing them, leaving personal data
> > untouched!
> > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> >
> > > _______________________________________________
> > > Dmtcp-forum mailing list
> > > Dmtcp-forum@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >
> >

> ------------------------------------------------------------------------------
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j

> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to