Hi Gene,
Thank you for your reply!
I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2
doesn't recognize --rm option.
I don't know exactly, what's the difference between mpich-3.2 and mpich2 ?
recently I tried to use openmpi-1.6 to checkpoint mpi application using
dmtcp and slurm.
but I got the following error :
[40000] WARNING at socketconnection.cpp:187 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived
connection!
[46000] WARNING at socketconnection.cpp:187 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived
connection!
[50000] WARNING at socketconnection.cpp:187 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived
connection!
[45000] WARNING at socketconnection.cpp:187 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived
connection!
[48000] WARNING at socketconnection.cpp:187 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived
connection!
...
...
...
[41000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
_magicBits =
Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator
die uncleanly?
dmtcp_srun_helper (41000): Terminating...
in addition, the slurm_restart.job seems not working at all.
I need help.
Thank you in advance,
Regards,
Husen
On Fri, May 20, 2016 at 5:36 PM, Gene Cooperman <g...@ccs.neu.edu> wrote:
> Hi William and Husen,
> As far as I know, the combination "--rm --ib" should work with
> the major MPI implementations: Open MPI, MVAPICH2, Intel MPI, MPICH2.
> But I'm not sure which ones we've tested with very recently.
> I'm pretty sure that we've used MVAPICH2 and Open MPI in this way.
>
> Jiajun and Rohan,
> Could you confirm which implementations you've used _with the
> "--rm --ib" combination_? If it's not working with one of the
> major MPI implementations, we need to fix that.
>
> Thanks,
> - Gene
>
> On Thu, May 19, 2016 at 03:42:06PM -0700, William Fox wrote:
> > At least for me ( I am not a developer for dmtcp) I was forced to switch
> to
> > openmpi (version1.6 specifically) in order to get --rm to work correctly.
> > What version of mpi are you running? In addition, if you are using
> > infiniband, --ib will need to be installed and utilized in order to
> > accomplish a restart.
> >
> > On Wed, May 18, 2016 at 1:15 AM, Husen R <hus...@gmail.com> wrote:
> >
> > > dear all,
> > >
> > > I have tried to checkpoint mpi application using dmtcp but I failed
> with
> > > the error message as follows :
> > >
> > >
> > > [40000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
> > > REASON='JWARNING(false) failed'
> > > _dataSockets[i]->socket().sockfd() = 9
> > > buffer.size() = 0
> > > WARN_INTERVAL_SEC = 10
> > > Message: Still draining socket... perhaps remote host is not running
> under
> > > DMTCP?
> > > [40000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval;
> > > REASON='JWARNING(false) failed'
> > > _dataSockets[i]->socket().sockfd() = 7
> > > buffer.size() = 0
> > > WARN_INTERVAL_SEC = 10
> > > Message: Still draining socket... perhaps remote host is not running
> under
> > > DMTCP?
> > > ......
> > > ......
> > > ......
> > >
> > > I use this sbatch script to submit job :
> > >
> > > #####################################SBATCH###########################
> > > #!/bin/bash
> > > # Put your SLURM options here
> > > #SBATCH --partition=comeon
> > > #SBATCH --time=01:15:00
> > > #SBATCH --nodes=2
> > > #SBATCH --ntasks-per-node=4
> > > #SBATCH --job-name="dmtcp_job"
> > > #SBATCH --output=dmtcp_ckpt_img/dmtcp-%j.out
> > >
> > > start_coordinator()
> > > {
> > >
> > > fname=dmtcp_command.$SLURM_JOBID
> > > h=$(hostname)
> > > check_coordinator=$(which dmtcp_coordinator)
> > >
> > > if [ -z "$check_coordinator" ]; then
> > > echo "No dmtcp_coordinator found. Check your DMTCP installation
> > > and PATH settings."
> > > exit 0
> > > fi
> > >
> > > dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname
> $@
> > > 1>/dev/null 2>&1
> > >
> > > p=`cat $fname`
> > > chmod +x $fname
> > > echo "#!/bin/bash" > $fname
> > > echo >> $fname
> > > echo "export PATH=$PATH" >> $fname
> > > echo "export DMTCP_COORD_HOST=$h" >> $fname
> > > echo "export DMTCP_COORD_PORT=$p" >> $fname
> > > echo "dmtcp_command \$@" >> $fname
> > >
> > > # Set up local environment for DMTCP
> > > export DMTCP_COORD_HOST=$h
> > > export DMTCP_COORD_PORT=$p
> > > }
> > >
> > > cd $SLURM_SUBMIT_DIR
> > > start_coordinator -i 240
> > > dmtcp_launch -h $h -p $p mpiexec ./mm.o
> > >
> > >
> #########################################################################
> > >
> > > I also have tried using --rm option in dmtcp_launch but it doesn't work
> > > and no output at all.
> > >
> > > anybody tell me how to solve this please ? I need help
> > >
> > >
> > > Regards,
> > >
> > >
> > >
> > > Husen
> > >
> > >
> > >
> ------------------------------------------------------------------------------
> > > Mobile security can be enabling, not merely restricting. Employees who
> > > bring their own devices (BYOD) to work are irked by the imposition of
> MDM
> > > restrictions. Mobile Device Manager Plus allows you to control only the
> > > apps on BYO-devices by containerizing them, leaving personal data
> > > untouched!
> > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> > > _______________________________________________
> > > Dmtcp-forum mailing list
> > > Dmtcp-forum@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > >
> > >
> >
> >
> > --
> > William Fox
> >
> > Lawrence Berkeley National Laboratory
> > Computational Research Division
>
> >
> ------------------------------------------------------------------------------
> > Mobile security can be enabling, not merely restricting. Employees who
> > bring their own devices (BYOD) to work are irked by the imposition of MDM
> > restrictions. Mobile Device Manager Plus allows you to control only the
> > apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>
> > _______________________________________________
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum