Hi Jiajun

Thank you for your reply !

I restart the job using restart script provided by DMTCP in
plugin/batch-queue/job_examples directory.
Yes, I use the same number of nodes and processes. Does DMTCP support using
a different number of nodes and processes too ?

attached is my restart script.

Thank you in advance


Husen

On Wed, Jun 8, 2016 at 10:31 AM, Jiajun Cao <jia...@ccs.neu.edu> wrote:

> Hi Husen,
>
> How did you restart the job? Did you use the same number of nodes and
> processes? Can you send me your restart script please?
>
>
> Best,
> Jiajun
>
> On Mon, May 30, 2016 at 11:29 PM, Husen R <hus...@gmail.com> wrote:
>
>> Hi Jiajun,
>>
>> Sorry for delayed response.
>> I switch the order of dmtcp_launch and mpirun/mpiexec, and the checkpoint
>> is work successfully!
>> however, when I try to restart using slurm_rstr.job, I got the following
>> error :
>>
>> head-node: Will not use SLURM_LOCALID=4 for launch, max is 1
>> head-node: Will not use SLURM_LOCALID=1 for launch, max is 1
>> head-node: Will not use SLURM_LOCALID=5 for launch, max is 1
>> head-node: Will not use SLURM_LOCALID=3 for launch, max is 1
>> head-node: Will not use SLURM_LOCALID=6 for launch, max is 1
>> head-node: Will not use SLURM_LOCALID=2 for launch, max is 1
>> head-node: Will not use SLURM_LOCALID=7 for launch, max is 1
>> [cli_8]: [cli_12]: write_line error; fd=18 buf=:cmd=finalize
>> :
>> system msg for write_line failure : Bad file descriptor
>> [cli_11]: write_line error; fd=14 buf=:cmd=finalize
>> :
>> system msg for write_line failure : Bad file descriptor
>> [cli_10]: write_line error; fd=10 buf=:cmd=finalize
>> :
>> system msg for write_line failure : Bad file descriptor
>> [cli_9]: write_line error; fd=7 buf=:cmd=finalize
>> :
>> system msg for write_line failure : Bad file descriptor
>> [cli_14]: write_line error; fd=26 buf=:cmd=finalize
>> :
>> system msg for write_line failure : Bad file descriptor
>> [cli_13]: write_line error; fd=22 buf=:cmd=finalize
>> :
>> system msg for write_line failure : Bad file descriptor
>> Fatal error in MPI_Finalize: Other MPI error, error stack:
>> MPI_Finalize(367).....: MPI_Finalize failed
>> MPI_Finalize(288).....:
>> MPID_Finalize(172)....:
>> MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1
>> Fatal error in MPI_Finalize: Other MPI error, error stack:
>> MPI_Finalize(367).....: MPI_Finalize failed
>> MPI_Finalize(288).....:
>> MPID_Finalize(172)....:
>> MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1
>> Received results from task 15
>> Time : 2346.923935
>> [cli_15]: write_line error; fd=30 buf=:cmd=finalize
>> :
>> system msg for write_line failure : Bad file descriptor
>> Fatal error in MPI_Finalize: Other MPI error, error stack:
>> MPI_Finalize(367).....: MPI_Finalize failed
>> MPI_Finalize(288).....:
>> MPID_Finalize(172)....:
>> MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1
>> [cli_0]: write_line error; fd=6 buf=:cmd=finalize
>> :
>> system msg for write_line failure : Bad file descriptor
>> Fatal error in MPI_Finalize: Other MPI error, error stack:
>> MPI_Finalize(367).....: MPI_Finalize failed
>> MPI_Finalize(288).....:
>> MPID_Finalize(172)....:
>> MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1
>>
>>
>> Any clue how to fix this ?
>>
>> Thank you in advance.
>>
>> regards,
>>
>>
>> Husen
>>
>> On Mon, May 23, 2016 at 5:15 AM, Jiajun Cao <jia...@ccs.neu.edu> wrote:
>>
>>> Hi Husen,
>>>
>>> The scripts look okay. Just out of curiosity, could you try to switch
>>> the order of dmtcp_launch and mpirun/mpiexec? It may produce something
>>> different, if it's a Slurm-related issue.
>>>
>>> Best,
>>> Jiajun
>>>
>>> On Sat, May 21, 2016 at 1:44 AM, Husen R <hus...@gmail.com> wrote:
>>>
>>>> by the way,
>>>>
>>>> If I use MPICH, no checkpoint files are created.
>>>>
>>>> regards,
>>>>
>>>>
>>>> Husen
>>>>
>>>> On Fri, May 20, 2016 at 7:40 PM, Rohan Garg <rohg...@ccs.neu.edu>
>>>> wrote:
>>>>
>>>>> Hi Husen,
>>>>>
>>>>> I'll start with some basic questions about your setup.
>>>>>
>>>>>  - Could you share with us your launch scripts for MPICH and OpenMPI?
>>>>>  - What DMTCP version are you using?
>>>>>  - Do you have InfiniBand on your setup? If yes, then you'd need to
>>>>>    configure DMTCP with the IB support (`./configure
>>>>> --enable-infiniband-support`),
>>>>>    and use the `--ib` flag with dmtcp_launch.
>>>>>
>>>>> Next, you wrote:
>>>>>
>>>>>  > I have tried to use --rm in mpich-3.2, and it doesn't work.
>>>>> mpich-3.2
>>>>>  > doesn't recognize --rm option.
>>>>>
>>>>> The `--rm` flag is a `dmtcp_launch` option, it's not an MPICH option.
>>>>>
>>>>> You seem to be seeing two kinds of warnings:
>>>>>
>>>>>  a) "Still draining socket... perhaps remote host is not running under
>>>>>      DMTCP"; and
>>>>>  b) "Datagram Sockets not supported. Hopefully, this is a short lived
>>>>>      connection".
>>>>>
>>>>> The first one indicates that there are sockets in your process going
>>>>> out to entities not running under DMTCP. I think this could be specific
>>>>> to your SLURM/MPI setup.
>>>>>
>>>>> The second warning could imply many different things. I haven't
>>>>> seen MPI's using datagram sockets usually. Datagram sockets are not
>>>>> supported in DMTCP out-of-the-box. Is your application doing that?
>>>>> Are you trying to checkpoint a GUI-based application?
>>>>>
>>>>> In either case, the warnings are not fatal, or at least, not
>>>>> immediately fatal. However, the warnings could lead to other issues
>>>>> that arise at restart time.
>>>>>
>>>>> Moving forward ...
>>>>>
>>>>> I think the first thing you need to do is to verify if the checkpoint
>>>>> was "successful".
>>>>>
>>>>> If the checkpoint was "successful", you should see checkpoint images
>>>>> corresponding to each MPI rank, i.e., there should be one checkpoint
>>>>> image (a *.dmtcp file) per MPI process. Do you see that? Do you see a
>>>>> restart script?
>>>>>
>>>>> Next step would be to verify the restart part.
>>>>>
>>>>> The restart script is a little tricky and might need some modifications
>>>>> depending on your setup. In other words, don't rely on it to work
>>>>> out-of-the-box. You could try to restart the computation manually
>>>>> to isolate the issue. Here's how I would do it:
>>>>>
>>>>>  - Allocate N interactive nodes. N could be 1 or more; it's easier to
>>>>>    debug with 1 node, assuming you have enough RAM on the node.
>>>>>  - Start dmtcp_coordinator: you could start it on the head node or one
>>>>>    of the allocated compute nodes
>>>>>  - ssh to allocated node, and manually execute the restart command:
>>>>>
>>>>>      dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image1.dmtcp
>>>>>      dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image2.dmtcp
>>>>>      ...
>>>>>
>>>>>    The only thing you need to ensure when doing this manually is that
>>>>>    the MPI ranks that were sharing a node prior to checkpointing are
>>>>>    restarted on one node. This is because the MPI processes might be
>>>>>    using (SysV) shared-memory for intra-node communication. On restart,
>>>>>    DMTCP will try to restore the shared-memory region and fail if the
>>>>>    processes are not restarted on one node.
>>>>>
>>>>> Finally, I think what you are seeing is because of some configuration
>>>>> issue. We have tested with different MPI's recently and it works. I
>>>>> could be wrong though.
>>>>>
>>>>> Would it be possible for you to give us a guest account for debugging
>>>>> on your setup? It'll be the most efficient way of resolving this.
>>>>>
>>>>> -Rohan
>>>>>
>>>>> On Fri, May 20, 2016 at 06:32:09PM +0700, Husen R wrote:
>>>>> > Hi Gene,
>>>>> >
>>>>> > Thank you for your reply!
>>>>> >
>>>>> > I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2
>>>>> > doesn't recognize --rm option.
>>>>> > I don't know exactly, what's the difference between mpich-3.2 and
>>>>> mpich2 ?
>>>>> >
>>>>> > recently I tried to use openmpi-1.6 to checkpoint mpi application
>>>>> using
>>>>> > dmtcp and slurm.
>>>>> > but I got the following error :
>>>>> >
>>>>> > [40000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>>>> > REASON='JWARNING(false) failed'
>>>>> >      type = 2
>>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>>>> lived
>>>>> > connection!
>>>>> > [46000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>>>> > REASON='JWARNING(false) failed'
>>>>> >      type = 2
>>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>>>> lived
>>>>> > connection!
>>>>> > [50000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>>>> > REASON='JWARNING(false) failed'
>>>>> >      type = 2
>>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>>>> lived
>>>>> > connection!
>>>>> > [45000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>>>> > REASON='JWARNING(false) failed'
>>>>> >      type = 2
>>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>>>> lived
>>>>> > connection!
>>>>> > [48000] WARNING at socketconnection.cpp:187 in TcpConnection;
>>>>> > REASON='JWARNING(false) failed'
>>>>> >      type = 2
>>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short
>>>>> lived
>>>>> > connection!
>>>>> > ...
>>>>> > ...
>>>>> > ...
>>>>> > [41000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
>>>>> > REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0)
>>>>> failed'
>>>>> >      _magicBits =
>>>>> > Message: read invalid message, _magicBits mismatch.  Did DMTCP
>>>>> coordinator
>>>>> > die uncleanly?
>>>>> > dmtcp_srun_helper (41000): Terminating...
>>>>> >
>>>>> >
>>>>> >
>>>>> > in addition, the slurm_restart.job seems not working at all.
>>>>> > I need help.
>>>>> > Thank you in advance,
>>>>> >
>>>>> >
>>>>> > Regards,
>>>>> >
>>>>> > Husen
>>>>> >
>>>>> > On Fri, May 20, 2016 at 5:36 PM, Gene Cooperman <g...@ccs.neu.edu>
>>>>> wrote:
>>>>> >
>>>>> > > Hi William and Husen,
>>>>> > >     As far as I know, the combination "--rm --ib" should work with
>>>>> > > the major MPI implementations:  Open MPI, MVAPICH2, Intel MPI,
>>>>> MPICH2.
>>>>> > > But I'm not sure which ones we've tested with very recently.
>>>>> > > I'm pretty sure that we've used MVAPICH2 and Open MPI in this way.
>>>>> > >
>>>>> > > Jiajun and Rohan,
>>>>> > >     Could you confirm which implementations you've used _with the
>>>>> > > "--rm --ib" combination_?  If it's not working with one of the
>>>>> > > major MPI implementations, we need to fix that.
>>>>> > >
>>>>> > > Thanks,
>>>>> > > - Gene
>>>>> > >
>>>>> > > On Thu, May 19, 2016 at 03:42:06PM -0700, William Fox wrote:
>>>>> > > > At least for me ( I am not a developer for dmtcp) I was forced
>>>>> to switch
>>>>> > > to
>>>>> > > > openmpi (version1.6 specifically) in order to get --rm to work
>>>>> correctly.
>>>>> > > > What version of mpi are you running? In addition, if you are
>>>>> using
>>>>> > > > infiniband, --ib will need to be installed and utilized in order
>>>>> to
>>>>> > > > accomplish a restart.
>>>>> > > >
>>>>> > > > On Wed, May 18, 2016 at 1:15 AM, Husen R <hus...@gmail.com>
>>>>> wrote:
>>>>> > > >
>>>>> > > > > dear all,
>>>>> > > > >
>>>>> > > > > I have tried to checkpoint mpi application using dmtcp but I
>>>>> failed
>>>>> > > with
>>>>> > > > > the error message as follows :
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in
>>>>> onTimeoutInterval;
>>>>> > > > > REASON='JWARNING(false) failed'
>>>>> > > > >      _dataSockets[i]->socket().sockfd() = 9
>>>>> > > > >      buffer.size() = 0
>>>>> > > > >      WARN_INTERVAL_SEC = 10
>>>>> > > > > Message: Still draining socket... perhaps remote host is not
>>>>> running
>>>>> > > under
>>>>> > > > > DMTCP?
>>>>> > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in
>>>>> onTimeoutInterval;
>>>>> > > > > REASON='JWARNING(false) failed'
>>>>> > > > >      _dataSockets[i]->socket().sockfd() = 7
>>>>> > > > >      buffer.size() = 0
>>>>> > > > >      WARN_INTERVAL_SEC = 10
>>>>> > > > > Message: Still draining socket... perhaps remote host is not
>>>>> running
>>>>> > > under
>>>>> > > > > DMTCP?
>>>>> > > > > ......
>>>>> > > > > ......
>>>>> > > > > ......
>>>>> > > > >
>>>>> > > > > I use this sbatch script to submit job :
>>>>> > > > >
>>>>> > > > >
>>>>> #####################################SBATCH###########################
>>>>> > > > > #!/bin/bash
>>>>> > > > > # Put your SLURM options here
>>>>> > > > > #SBATCH --partition=comeon
>>>>> > > > > #SBATCH --time=01:15:00
>>>>> > > > > #SBATCH --nodes=2
>>>>> > > > > #SBATCH --ntasks-per-node=4
>>>>> > > > > #SBATCH --job-name="dmtcp_job"
>>>>> > > > > #SBATCH --output=dmtcp_ckpt_img/dmtcp-%j.out
>>>>> > > > >
>>>>> > > > > start_coordinator()
>>>>> > > > > {
>>>>> > > > >
>>>>> > > > >     fname=dmtcp_command.$SLURM_JOBID
>>>>> > > > >     h=$(hostname)
>>>>> > > > >     check_coordinator=$(which dmtcp_coordinator)
>>>>> > > > >
>>>>> > > > >     if [ -z "$check_coordinator" ]; then
>>>>> > > > >         echo "No dmtcp_coordinator found. Check your DMTCP
>>>>> installation
>>>>> > > > > and PATH settings."
>>>>> > > > >         exit 0
>>>>> > > > >     fi
>>>>> > > > >
>>>>> > > > >     dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file
>>>>> $fname
>>>>> > > $@
>>>>> > > > > 1>/dev/null 2>&1
>>>>> > > > >
>>>>> > > > >     p=`cat $fname`
>>>>> > > > >     chmod +x $fname
>>>>> > > > >     echo "#!/bin/bash" > $fname
>>>>> > > > >     echo >> $fname
>>>>> > > > >     echo "export PATH=$PATH" >> $fname
>>>>> > > > >     echo "export DMTCP_COORD_HOST=$h" >> $fname
>>>>> > > > >     echo "export DMTCP_COORD_PORT=$p" >> $fname
>>>>> > > > >     echo "dmtcp_command \$@" >> $fname
>>>>> > > > >
>>>>> > > > >     # Set up local environment for DMTCP
>>>>> > > > >     export DMTCP_COORD_HOST=$h
>>>>> > > > >     export DMTCP_COORD_PORT=$p
>>>>> > > > > }
>>>>> > > > >
>>>>> > > > > cd $SLURM_SUBMIT_DIR
>>>>> > > > > start_coordinator -i 240
>>>>> > > > > dmtcp_launch -h $h -p $p mpiexec ./mm.o
>>>>> > > > >
>>>>> > > > >
>>>>> > >
>>>>> #########################################################################
>>>>> > > > >
>>>>> > > > > I also have tried using --rm option in dmtcp_launch but it
>>>>> doesn't work
>>>>> > > > > and no output at all.
>>>>> > > > >
>>>>> > > > > anybody tell me how to solve this please ? I need help
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > Regards,
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > Husen
>>>>> > > > >
>>>>> > > > >
>>>>> > > > >
>>>>> > >
>>>>> ------------------------------------------------------------------------------
>>>>> > > > > Mobile security can be enabling, not merely restricting.
>>>>> Employees who
>>>>> > > > > bring their own devices (BYOD) to work are irked by the
>>>>> imposition of
>>>>> > > MDM
>>>>> > > > > restrictions. Mobile Device Manager Plus allows you to control
>>>>> only the
>>>>> > > > > apps on BYO-devices by containerizing them, leaving personal
>>>>> data
>>>>> > > > > untouched!
>>>>> > > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>>>> > > > > _______________________________________________
>>>>> > > > > Dmtcp-forum mailing list
>>>>> > > > > Dmtcp-forum@lists.sourceforge.net
>>>>> > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>>> > > > >
>>>>> > > > >
>>>>> > > >
>>>>> > > >
>>>>> > > > --
>>>>> > > > William Fox
>>>>> > > >
>>>>> > > > Lawrence Berkeley National Laboratory
>>>>> > > > Computational Research Division
>>>>> > >
>>>>> > > >
>>>>> > >
>>>>> ------------------------------------------------------------------------------
>>>>> > > > Mobile security can be enabling, not merely restricting.
>>>>> Employees who
>>>>> > > > bring their own devices (BYOD) to work are irked by the
>>>>> imposition of MDM
>>>>> > > > restrictions. Mobile Device Manager Plus allows you to control
>>>>> only the
>>>>> > > > apps on BYO-devices by containerizing them, leaving personal data
>>>>> > > untouched!
>>>>> > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>>>> > >
>>>>> > > > _______________________________________________
>>>>> > > > Dmtcp-forum mailing list
>>>>> > > > Dmtcp-forum@lists.sourceforge.net
>>>>> > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>>> > >
>>>>> > >
>>>>>
>>>>> >
>>>>> ------------------------------------------------------------------------------
>>>>> > Mobile security can be enabling, not merely restricting. Employees
>>>>> who
>>>>> > bring their own devices (BYOD) to work are irked by the imposition
>>>>> of MDM
>>>>> > restrictions. Mobile Device Manager Plus allows you to control only
>>>>> the
>>>>> > apps on BYO-devices by containerizing them, leaving personal data
>>>>> untouched!
>>>>> > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>>>>
>>>>> > _______________________________________________
>>>>> > Dmtcp-forum mailing list
>>>>> > Dmtcp-forum@lists.sourceforge.net
>>>>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Mobile security can be enabling, not merely restricting. Employees who
>>>> bring their own devices (BYOD) to work are irked by the imposition of
>>>> MDM
>>>> restrictions. Mobile Device Manager Plus allows you to control only the
>>>> apps on BYO-devices by containerizing them, leaving personal data
>>>> untouched!
>>>> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>>> _______________________________________________
>>>> Dmtcp-forum mailing list
>>>> Dmtcp-forum@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>>
>>>>
>>>
>>
>

Attachment: restart.job
Description: Binary data

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to