Hi Jiajun,

Sorry for delayed response.
I switch the order of dmtcp_launch and mpirun/mpiexec, and the checkpoint
is work successfully!
however, when I try to restart using slurm_rstr.job, I got the following
error :

head-node: Will not use SLURM_LOCALID=4 for launch, max is 1
head-node: Will not use SLURM_LOCALID=1 for launch, max is 1
head-node: Will not use SLURM_LOCALID=5 for launch, max is 1
head-node: Will not use SLURM_LOCALID=3 for launch, max is 1
head-node: Will not use SLURM_LOCALID=6 for launch, max is 1
head-node: Will not use SLURM_LOCALID=2 for launch, max is 1
head-node: Will not use SLURM_LOCALID=7 for launch, max is 1
[cli_8]: [cli_12]: write_line error; fd=18 buf=:cmd=finalize
:
system msg for write_line failure : Bad file descriptor
[cli_11]: write_line error; fd=14 buf=:cmd=finalize
:
system msg for write_line failure : Bad file descriptor
[cli_10]: write_line error; fd=10 buf=:cmd=finalize
:
system msg for write_line failure : Bad file descriptor
[cli_9]: write_line error; fd=7 buf=:cmd=finalize
:
system msg for write_line failure : Bad file descriptor
[cli_14]: write_line error; fd=26 buf=:cmd=finalize
:
system msg for write_line failure : Bad file descriptor
[cli_13]: write_line error; fd=22 buf=:cmd=finalize
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(367).....: MPI_Finalize failed
MPI_Finalize(288).....:
MPID_Finalize(172)....:
MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(367).....: MPI_Finalize failed
MPI_Finalize(288).....:
MPID_Finalize(172)....:
MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1
Received results from task 15
Time : 2346.923935
[cli_15]: write_line error; fd=30 buf=:cmd=finalize
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(367).....: MPI_Finalize failed
MPI_Finalize(288).....:
MPID_Finalize(172)....:
MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1
[cli_0]: write_line error; fd=6 buf=:cmd=finalize
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(367).....: MPI_Finalize failed
MPI_Finalize(288).....:
MPID_Finalize(172)....:
MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1


Any clue how to fix this ?

Thank you in advance.

regards,


Husen

On Mon, May 23, 2016 at 5:15 AM, Jiajun Cao <jia...@ccs.neu.edu> wrote:

> Hi Husen,
>
> The scripts look okay. Just out of curiosity, could you try to switch the
> order of dmtcp_launch and mpirun/mpiexec? It may produce something
> different, if it's a Slurm-related issue.
>
> Best,
> Jiajun
>
> On Sat, May 21, 2016 at 1:44 AM, Husen R <hus...@gmail.com> wrote:
>
>> by the way,
>>
>> If I use MPICH, no checkpoint files are created.
>>
>> regards,
>>
>>
>> Husen
>>
>> On Fri, May 20, 2016 at 7:40 PM, Rohan Garg <rohg...@ccs.neu.edu> wrote:
>>
>>> Hi Husen,
>>>
>>> I'll start with some basic questions about your setup.
>>>
>>>  - Could you share with us your launch scripts for MPICH and OpenMPI?
>>>  - What DMTCP version are you using?
>>>  - Do you have InfiniBand on your setup? If yes, then you'd need to
>>>    configure DMTCP with the IB support (`./configure
>>> --enable-infiniband-support`),
>>>    and use the `--ib` flag with dmtcp_launch.
>>>
>>> Next, you wrote:
>>>
>>>  > I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2
>>>  > doesn't recognize --rm option.
>>>
>>> The `--rm` flag is a `dmtcp_launch` option, it's not an MPICH option.
>>>
>>> You seem to be seeing two kinds of warnings:
>>>
>>>  a) "Still draining socket... perhaps remote host is not running under
>>>      DMTCP"; and
>>>  b) "Datagram Sockets not supported. Hopefully, this is a short lived
>>>      connection".
>>>
>>> The first one indicates that there are sockets in your process going
>>> out to entities not running under DMTCP. I think this could be specific
>>> to your SLURM/MPI setup.
>>>
>>> The second warning could imply many different things. I haven't
>>> seen MPI's using datagram sockets usually. Datagram sockets are not
>>> supported in DMTCP out-of-the-box. Is your application doing that?
>>> Are you trying to checkpoint a GUI-based application?
>>>
>>> In either case, the warnings are not fatal, or at least, not
>>> immediately fatal. However, the warnings could lead to other issues
>>> that arise at restart time.
>>>
>>> Moving forward ...
>>>
>>> I think the first thing you need to do is to verify if the checkpoint
>>> was "successful".
>>>
>>> If the checkpoint was "successful", you should see checkpoint images
>>> corresponding to each MPI rank, i.e., there should be one checkpoint
>>> image (a *.dmtcp file) per MPI process. Do you see that? Do you see a
>>> restart script?
>>>
>>> Next step would be to verify the restart part.
>>>
>>> The restart script is a little tricky and might need some modifications
>>> depending on your setup. In other words, don't rely on it to work
>>> out-of-the-box. You could try to restart the computation manually
>>> to isolate the issue. Here's how I would do it:
>>>
>>>  - Allocate N interactive nodes. N could be 1 or more; it's easier to
>>>    debug with 1 node, assuming you have enough RAM on the node.
>>>  - Start dmtcp_coordinator: you could start it on the head node or one
>>>    of the allocated compute nodes
>>>  - ssh to allocated node, and manually execute the restart command:
>>>
>>>      dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image1.dmtcp
>>>      dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image2.dmtcp
>>>      ...
>>>
>>>    The only thing you need to ensure when doing this manually is that
>>>    the MPI ranks that were sharing a node prior to checkpointing are
>>>    restarted on one node. This is because the MPI processes might be
>>>    using (SysV) shared-memory for intra-node communication. On restart,
>>>    DMTCP will try to restore the shared-memory region and fail if the
>>>    processes are not restarted on one node.
>>>
>>> Finally, I think what you are seeing is because of some configuration
>>> issue. We have tested with different MPI's recently and it works. I
>>> could be wrong though.
>>>
>>> Would it be possible for you to give us a guest account for debugging
>>> on your setup? It'll be the most efficient way of resolving this.
>>>
>>> -Rohan
>>>
>>> On Fri, May 20, 2016 at 06:32:09PM +0700, Husen R wrote:
>>> > Hi Gene,
>>> >
>>> > Thank you for your reply!
>>> >
>>> > I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2
>>> > doesn't recognize --rm option.
>>> > I don't know exactly, what's the difference between mpich-3.2 and
>>> mpich2 ?
>>> >
>>> > recently I tried to use openmpi-1.6 to checkpoint mpi application using
>>> > dmtcp and slurm.
>>> > but I got the following error :
>>> >
>>> > [40000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>> > REASON='JWARNING(false) failed'
>>> >      type = 2
>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>> lived
>>> > connection!
>>> > [46000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>> > REASON='JWARNING(false) failed'
>>> >      type = 2
>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>> lived
>>> > connection!
>>> > [50000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>> > REASON='JWARNING(false) failed'
>>> >      type = 2
>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>> lived
>>> > connection!
>>> > [45000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>> > REASON='JWARNING(false) failed'
>>> >      type = 2
>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>> lived
>>> > connection!
>>> > [48000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>> > REASON='JWARNING(false) failed'
>>> >      type = 2
>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>> lived
>>> > connection!
>>> > ...
>>> > ...
>>> > ...
>>> > [41000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
>>> > REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
>>> >      _magicBits =
>>> > Message: read invalid message, _magicBits mismatch.  Did DMTCP
>>> coordinator
>>> > die uncleanly?
>>> > dmtcp_srun_helper (41000): Terminating...
>>> >
>>> >
>>> >
>>> > in addition, the slurm_restart.job seems not working at all.
>>> > I need help.
>>> > Thank you in advance,
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Husen
>>> >
>>> > On Fri, May 20, 2016 at 5:36 PM, Gene Cooperman <g...@ccs.neu.edu>
>>> wrote:
>>> >
>>> > > Hi William and Husen,
>>> > >     As far as I know, the combination "--rm --ib" should work with
>>> > > the major MPI implementations:  Open MPI, MVAPICH2, Intel MPI,
>>> MPICH2.
>>> > > But I'm not sure which ones we've tested with very recently.
>>> > > I'm pretty sure that we've used MVAPICH2 and Open MPI in this way.
>>> > >
>>> > > Jiajun and Rohan,
>>> > >     Could you confirm which implementations you've used _with the
>>> > > "--rm --ib" combination_?  If it's not working with one of the
>>> > > major MPI implementations, we need to fix that.
>>> > >
>>> > > Thanks,
>>> > > - Gene
>>> > >
>>> > > On Thu, May 19, 2016 at 03:42:06PM -0700, William Fox wrote:
>>> > > > At least for me ( I am not a developer for dmtcp) I was forced to
>>> switch
>>> > > to
>>> > > > openmpi (version1.6 specifically) in order to get --rm to work
>>> correctly.
>>> > > > What version of mpi are you running? In addition, if you are using
>>> > > > infiniband, --ib will need to be installed and utilized in order to
>>> > > > accomplish a restart.
>>> > > >
>>> > > > On Wed, May 18, 2016 at 1:15 AM, Husen R <hus...@gmail.com> wrote:
>>> > > >
>>> > > > > dear all,
>>> > > > >
>>> > > > > I have tried to checkpoint mpi application using dmtcp but I
>>> failed
>>> > > with
>>> > > > > the error message as follows :
>>> > > > >
>>> > > > >
>>> > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in
>>> onTimeoutInterval;
>>> > > > > REASON='JWARNING(false) failed'
>>> > > > >      _dataSockets[i]->socket().sockfd() = 9
>>> > > > >      buffer.size() = 0
>>> > > > >      WARN_INTERVAL_SEC = 10
>>> > > > > Message: Still draining socket... perhaps remote host is not
>>> running
>>> > > under
>>> > > > > DMTCP?
>>> > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in
>>> onTimeoutInterval;
>>> > > > > REASON='JWARNING(false) failed'
>>> > > > >      _dataSockets[i]->socket().sockfd() = 7
>>> > > > >      buffer.size() = 0
>>> > > > >      WARN_INTERVAL_SEC = 10
>>> > > > > Message: Still draining socket... perhaps remote host is not
>>> running
>>> > > under
>>> > > > > DMTCP?
>>> > > > > ......
>>> > > > > ......
>>> > > > > ......
>>> > > > >
>>> > > > > I use this sbatch script to submit job :
>>> > > > >
>>> > > > >
>>> #####################################SBATCH###########################
>>> > > > > #!/bin/bash
>>> > > > > # Put your SLURM options here
>>> > > > > #SBATCH --partition=comeon
>>> > > > > #SBATCH --time=01:15:00
>>> > > > > #SBATCH --nodes=2
>>> > > > > #SBATCH --ntasks-per-node=4
>>> > > > > #SBATCH --job-name="dmtcp_job"
>>> > > > > #SBATCH --output=dmtcp_ckpt_img/dmtcp-%j.out
>>> > > > >
>>> > > > > start_coordinator()
>>> > > > > {
>>> > > > >
>>> > > > >     fname=dmtcp_command.$SLURM_JOBID
>>> > > > >     h=$(hostname)
>>> > > > >     check_coordinator=$(which dmtcp_coordinator)
>>> > > > >
>>> > > > >     if [ -z "$check_coordinator" ]; then
>>> > > > >         echo "No dmtcp_coordinator found. Check your DMTCP
>>> installation
>>> > > > > and PATH settings."
>>> > > > >         exit 0
>>> > > > >     fi
>>> > > > >
>>> > > > >     dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file
>>> $fname
>>> > > $@
>>> > > > > 1>/dev/null 2>&1
>>> > > > >
>>> > > > >     p=`cat $fname`
>>> > > > >     chmod +x $fname
>>> > > > >     echo "#!/bin/bash" > $fname
>>> > > > >     echo >> $fname
>>> > > > >     echo "export PATH=$PATH" >> $fname
>>> > > > >     echo "export DMTCP_COORD_HOST=$h" >> $fname
>>> > > > >     echo "export DMTCP_COORD_PORT=$p" >> $fname
>>> > > > >     echo "dmtcp_command \$@" >> $fname
>>> > > > >
>>> > > > >     # Set up local environment for DMTCP
>>> > > > >     export DMTCP_COORD_HOST=$h
>>> > > > >     export DMTCP_COORD_PORT=$p
>>> > > > > }
>>> > > > >
>>> > > > > cd $SLURM_SUBMIT_DIR
>>> > > > > start_coordinator -i 240
>>> > > > > dmtcp_launch -h $h -p $p mpiexec ./mm.o
>>> > > > >
>>> > > > >
>>> > >
>>> #########################################################################
>>> > > > >
>>> > > > > I also have tried using --rm option in dmtcp_launch but it
>>> doesn't work
>>> > > > > and no output at all.
>>> > > > >
>>> > > > > anybody tell me how to solve this please ? I need help
>>> > > > >
>>> > > > >
>>> > > > > Regards,
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > Husen
>>> > > > >
>>> > > > >
>>> > > > >
>>> > >
>>> ------------------------------------------------------------------------------
>>> > > > > Mobile security can be enabling, not merely restricting.
>>> Employees who
>>> > > > > bring their own devices (BYOD) to work are irked by the
>>> imposition of
>>> > > MDM
>>> > > > > restrictions. Mobile Device Manager Plus allows you to control
>>> only the
>>> > > > > apps on BYO-devices by containerizing them, leaving personal data
>>> > > > > untouched!
>>> > > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>> > > > > _______________________________________________
>>> > > > > Dmtcp-forum mailing list
>>> > > > > Dmtcp-forum@lists.sourceforge.net
>>> > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>> > > > >
>>> > > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > William Fox
>>> > > >
>>> > > > Lawrence Berkeley National Laboratory
>>> > > > Computational Research Division
>>> > >
>>> > > >
>>> > >
>>> ------------------------------------------------------------------------------
>>> > > > Mobile security can be enabling, not merely restricting. Employees
>>> who
>>> > > > bring their own devices (BYOD) to work are irked by the imposition
>>> of MDM
>>> > > > restrictions. Mobile Device Manager Plus allows you to control
>>> only the
>>> > > > apps on BYO-devices by containerizing them, leaving personal data
>>> > > untouched!
>>> > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>> > >
>>> > > > _______________________________________________
>>> > > > Dmtcp-forum mailing list
>>> > > > Dmtcp-forum@lists.sourceforge.net
>>> > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>> > >
>>> > >
>>>
>>> >
>>> ------------------------------------------------------------------------------
>>> > Mobile security can be enabling, not merely restricting. Employees who
>>> > bring their own devices (BYOD) to work are irked by the imposition of
>>> MDM
>>> > restrictions. Mobile Device Manager Plus allows you to control only the
>>> > apps on BYO-devices by containerizing them, leaving personal data
>>> untouched!
>>> > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>>
>>> > _______________________________________________
>>> > Dmtcp-forum mailing list
>>> > Dmtcp-forum@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Mobile security can be enabling, not merely restricting. Employees who
>> bring their own devices (BYOD) to work are irked by the imposition of MDM
>> restrictions. Mobile Device Manager Plus allows you to control only the
>> apps on BYO-devices by containerizing them, leaving personal data
>> untouched!
>> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>> _______________________________________________
>> Dmtcp-forum mailing list
>> Dmtcp-forum@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>
>>
>
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to