Hi Jiajun Thank you for your reply !
I restart the job using restart script provided by DMTCP in plugin/batch-queue/job_examples directory. Yes, I use the same number of nodes and processes. Does DMTCP support using a different number of nodes and processes too ? attached is my restart script. Thank you in advance Husen On Wed, Jun 8, 2016 at 10:31 AM, Jiajun Cao <jia...@ccs.neu.edu> wrote: > Hi Husen, > > How did you restart the job? Did you use the same number of nodes and > processes? Can you send me your restart script please? > > > Best, > Jiajun > > On Mon, May 30, 2016 at 11:29 PM, Husen R <hus...@gmail.com> wrote: > >> Hi Jiajun, >> >> Sorry for delayed response. >> I switch the order of dmtcp_launch and mpirun/mpiexec, and the checkpoint >> is work successfully! >> however, when I try to restart using slurm_rstr.job, I got the following >> error : >> >> head-node: Will not use SLURM_LOCALID=4 for launch, max is 1 >> head-node: Will not use SLURM_LOCALID=1 for launch, max is 1 >> head-node: Will not use SLURM_LOCALID=5 for launch, max is 1 >> head-node: Will not use SLURM_LOCALID=3 for launch, max is 1 >> head-node: Will not use SLURM_LOCALID=6 for launch, max is 1 >> head-node: Will not use SLURM_LOCALID=2 for launch, max is 1 >> head-node: Will not use SLURM_LOCALID=7 for launch, max is 1 >> [cli_8]: [cli_12]: write_line error; fd=18 buf=:cmd=finalize >> : >> system msg for write_line failure : Bad file descriptor >> [cli_11]: write_line error; fd=14 buf=:cmd=finalize >> : >> system msg for write_line failure : Bad file descriptor >> [cli_10]: write_line error; fd=10 buf=:cmd=finalize >> : >> system msg for write_line failure : Bad file descriptor >> [cli_9]: write_line error; fd=7 buf=:cmd=finalize >> : >> system msg for write_line failure : Bad file descriptor >> [cli_14]: write_line error; fd=26 buf=:cmd=finalize >> : >> system msg for write_line failure : Bad file descriptor >> [cli_13]: write_line error; fd=22 buf=:cmd=finalize >> : >> system msg for write_line failure : Bad file descriptor >> Fatal error in MPI_Finalize: Other MPI error, error stack: >> MPI_Finalize(367).....: MPI_Finalize failed >> MPI_Finalize(288).....: >> MPID_Finalize(172)....: >> MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1 >> Fatal error in MPI_Finalize: Other MPI error, error stack: >> MPI_Finalize(367).....: MPI_Finalize failed >> MPI_Finalize(288).....: >> MPID_Finalize(172)....: >> MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1 >> Received results from task 15 >> Time : 2346.923935 >> [cli_15]: write_line error; fd=30 buf=:cmd=finalize >> : >> system msg for write_line failure : Bad file descriptor >> Fatal error in MPI_Finalize: Other MPI error, error stack: >> MPI_Finalize(367).....: MPI_Finalize failed >> MPI_Finalize(288).....: >> MPID_Finalize(172)....: >> MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1 >> [cli_0]: write_line error; fd=6 buf=:cmd=finalize >> : >> system msg for write_line failure : Bad file descriptor >> Fatal error in MPI_Finalize: Other MPI error, error stack: >> MPI_Finalize(367).....: MPI_Finalize failed >> MPI_Finalize(288).....: >> MPID_Finalize(172)....: >> MPIDI_PG_Finalize(109): PMI_Finalize failed, error -1 >> >> >> Any clue how to fix this ? >> >> Thank you in advance. >> >> regards, >> >> >> Husen >> >> On Mon, May 23, 2016 at 5:15 AM, Jiajun Cao <jia...@ccs.neu.edu> wrote: >> >>> Hi Husen, >>> >>> The scripts look okay. Just out of curiosity, could you try to switch >>> the order of dmtcp_launch and mpirun/mpiexec? It may produce something >>> different, if it's a Slurm-related issue. >>> >>> Best, >>> Jiajun >>> >>> On Sat, May 21, 2016 at 1:44 AM, Husen R <hus...@gmail.com> wrote: >>> >>>> by the way, >>>> >>>> If I use MPICH, no checkpoint files are created. >>>> >>>> regards, >>>> >>>> >>>> Husen >>>> >>>> On Fri, May 20, 2016 at 7:40 PM, Rohan Garg <rohg...@ccs.neu.edu> >>>> wrote: >>>> >>>>> Hi Husen, >>>>> >>>>> I'll start with some basic questions about your setup. >>>>> >>>>> - Could you share with us your launch scripts for MPICH and OpenMPI? >>>>> - What DMTCP version are you using? >>>>> - Do you have InfiniBand on your setup? If yes, then you'd need to >>>>> configure DMTCP with the IB support (`./configure >>>>> --enable-infiniband-support`), >>>>> and use the `--ib` flag with dmtcp_launch. >>>>> >>>>> Next, you wrote: >>>>> >>>>> > I have tried to use --rm in mpich-3.2, and it doesn't work. >>>>> mpich-3.2 >>>>> > doesn't recognize --rm option. >>>>> >>>>> The `--rm` flag is a `dmtcp_launch` option, it's not an MPICH option. >>>>> >>>>> You seem to be seeing two kinds of warnings: >>>>> >>>>> a) "Still draining socket... perhaps remote host is not running under >>>>> DMTCP"; and >>>>> b) "Datagram Sockets not supported. Hopefully, this is a short lived >>>>> connection". >>>>> >>>>> The first one indicates that there are sockets in your process going >>>>> out to entities not running under DMTCP. I think this could be specific >>>>> to your SLURM/MPI setup. >>>>> >>>>> The second warning could imply many different things. I haven't >>>>> seen MPI's using datagram sockets usually. Datagram sockets are not >>>>> supported in DMTCP out-of-the-box. Is your application doing that? >>>>> Are you trying to checkpoint a GUI-based application? >>>>> >>>>> In either case, the warnings are not fatal, or at least, not >>>>> immediately fatal. However, the warnings could lead to other issues >>>>> that arise at restart time. >>>>> >>>>> Moving forward ... >>>>> >>>>> I think the first thing you need to do is to verify if the checkpoint >>>>> was "successful". >>>>> >>>>> If the checkpoint was "successful", you should see checkpoint images >>>>> corresponding to each MPI rank, i.e., there should be one checkpoint >>>>> image (a *.dmtcp file) per MPI process. Do you see that? Do you see a >>>>> restart script? >>>>> >>>>> Next step would be to verify the restart part. >>>>> >>>>> The restart script is a little tricky and might need some modifications >>>>> depending on your setup. In other words, don't rely on it to work >>>>> out-of-the-box. You could try to restart the computation manually >>>>> to isolate the issue. Here's how I would do it: >>>>> >>>>> - Allocate N interactive nodes. N could be 1 or more; it's easier to >>>>> debug with 1 node, assuming you have enough RAM on the node. >>>>> - Start dmtcp_coordinator: you could start it on the head node or one >>>>> of the allocated compute nodes >>>>> - ssh to allocated node, and manually execute the restart command: >>>>> >>>>> dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image1.dmtcp >>>>> dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image2.dmtcp >>>>> ... >>>>> >>>>> The only thing you need to ensure when doing this manually is that >>>>> the MPI ranks that were sharing a node prior to checkpointing are >>>>> restarted on one node. This is because the MPI processes might be >>>>> using (SysV) shared-memory for intra-node communication. On restart, >>>>> DMTCP will try to restore the shared-memory region and fail if the >>>>> processes are not restarted on one node. >>>>> >>>>> Finally, I think what you are seeing is because of some configuration >>>>> issue. We have tested with different MPI's recently and it works. I >>>>> could be wrong though. >>>>> >>>>> Would it be possible for you to give us a guest account for debugging >>>>> on your setup? It'll be the most efficient way of resolving this. >>>>> >>>>> -Rohan >>>>> >>>>> On Fri, May 20, 2016 at 06:32:09PM +0700, Husen R wrote: >>>>> > Hi Gene, >>>>> > >>>>> > Thank you for your reply! >>>>> > >>>>> > I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2 >>>>> > doesn't recognize --rm option. >>>>> > I don't know exactly, what's the difference between mpich-3.2 and >>>>> mpich2 ? >>>>> > >>>>> > recently I tried to use openmpi-1.6 to checkpoint mpi application >>>>> using >>>>> > dmtcp and slurm. >>>>> > but I got the following error : >>>>> > >>>>> > [40000] WARNING at socketconnection.cpp:187 in TcpConnection; >>>>> > REASON='JWARNING(false) failed' >>>>> > type = 2 >>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short >>>>> lived >>>>> > connection! >>>>> > [46000] WARNING at socketconnection.cpp:187 in TcpConnection; >>>>> > REASON='JWARNING(false) failed' >>>>> > type = 2 >>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short >>>>> lived >>>>> > connection! >>>>> > [50000] WARNING at socketconnection.cpp:187 in TcpConnection; >>>>> > REASON='JWARNING(false) failed' >>>>> > type = 2 >>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short >>>>> lived >>>>> > connection! >>>>> > [45000] WARNING at socketconnection.cpp:187 in TcpConnection; >>>>> > REASON='JWARNING(false) failed' >>>>> > type = 2 >>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short >>>>> lived >>>>> > connection! >>>>> > [48000] WARNING at socketconnection.cpp:187 in TcpConnection; >>>>> > REASON='JWARNING(false) failed' >>>>> > type = 2 >>>>> > Message: Datagram Sockets not supported. Hopefully, this is a short >>>>> lived >>>>> > connection! >>>>> > ... >>>>> > ... >>>>> > ... >>>>> > [41000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; >>>>> > REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) >>>>> failed' >>>>> > _magicBits = >>>>> > Message: read invalid message, _magicBits mismatch. Did DMTCP >>>>> coordinator >>>>> > die uncleanly? >>>>> > dmtcp_srun_helper (41000): Terminating... >>>>> > >>>>> > >>>>> > >>>>> > in addition, the slurm_restart.job seems not working at all. >>>>> > I need help. >>>>> > Thank you in advance, >>>>> > >>>>> > >>>>> > Regards, >>>>> > >>>>> > Husen >>>>> > >>>>> > On Fri, May 20, 2016 at 5:36 PM, Gene Cooperman <g...@ccs.neu.edu> >>>>> wrote: >>>>> > >>>>> > > Hi William and Husen, >>>>> > > As far as I know, the combination "--rm --ib" should work with >>>>> > > the major MPI implementations: Open MPI, MVAPICH2, Intel MPI, >>>>> MPICH2. >>>>> > > But I'm not sure which ones we've tested with very recently. >>>>> > > I'm pretty sure that we've used MVAPICH2 and Open MPI in this way. >>>>> > > >>>>> > > Jiajun and Rohan, >>>>> > > Could you confirm which implementations you've used _with the >>>>> > > "--rm --ib" combination_? If it's not working with one of the >>>>> > > major MPI implementations, we need to fix that. >>>>> > > >>>>> > > Thanks, >>>>> > > - Gene >>>>> > > >>>>> > > On Thu, May 19, 2016 at 03:42:06PM -0700, William Fox wrote: >>>>> > > > At least for me ( I am not a developer for dmtcp) I was forced >>>>> to switch >>>>> > > to >>>>> > > > openmpi (version1.6 specifically) in order to get --rm to work >>>>> correctly. >>>>> > > > What version of mpi are you running? In addition, if you are >>>>> using >>>>> > > > infiniband, --ib will need to be installed and utilized in order >>>>> to >>>>> > > > accomplish a restart. >>>>> > > > >>>>> > > > On Wed, May 18, 2016 at 1:15 AM, Husen R <hus...@gmail.com> >>>>> wrote: >>>>> > > > >>>>> > > > > dear all, >>>>> > > > > >>>>> > > > > I have tried to checkpoint mpi application using dmtcp but I >>>>> failed >>>>> > > with >>>>> > > > > the error message as follows : >>>>> > > > > >>>>> > > > > >>>>> > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in >>>>> onTimeoutInterval; >>>>> > > > > REASON='JWARNING(false) failed' >>>>> > > > > _dataSockets[i]->socket().sockfd() = 9 >>>>> > > > > buffer.size() = 0 >>>>> > > > > WARN_INTERVAL_SEC = 10 >>>>> > > > > Message: Still draining socket... perhaps remote host is not >>>>> running >>>>> > > under >>>>> > > > > DMTCP? >>>>> > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in >>>>> onTimeoutInterval; >>>>> > > > > REASON='JWARNING(false) failed' >>>>> > > > > _dataSockets[i]->socket().sockfd() = 7 >>>>> > > > > buffer.size() = 0 >>>>> > > > > WARN_INTERVAL_SEC = 10 >>>>> > > > > Message: Still draining socket... perhaps remote host is not >>>>> running >>>>> > > under >>>>> > > > > DMTCP? >>>>> > > > > ...... >>>>> > > > > ...... >>>>> > > > > ...... >>>>> > > > > >>>>> > > > > I use this sbatch script to submit job : >>>>> > > > > >>>>> > > > > >>>>> #####################################SBATCH########################### >>>>> > > > > #!/bin/bash >>>>> > > > > # Put your SLURM options here >>>>> > > > > #SBATCH --partition=comeon >>>>> > > > > #SBATCH --time=01:15:00 >>>>> > > > > #SBATCH --nodes=2 >>>>> > > > > #SBATCH --ntasks-per-node=4 >>>>> > > > > #SBATCH --job-name="dmtcp_job" >>>>> > > > > #SBATCH --output=dmtcp_ckpt_img/dmtcp-%j.out >>>>> > > > > >>>>> > > > > start_coordinator() >>>>> > > > > { >>>>> > > > > >>>>> > > > > fname=dmtcp_command.$SLURM_JOBID >>>>> > > > > h=$(hostname) >>>>> > > > > check_coordinator=$(which dmtcp_coordinator) >>>>> > > > > >>>>> > > > > if [ -z "$check_coordinator" ]; then >>>>> > > > > echo "No dmtcp_coordinator found. Check your DMTCP >>>>> installation >>>>> > > > > and PATH settings." >>>>> > > > > exit 0 >>>>> > > > > fi >>>>> > > > > >>>>> > > > > dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file >>>>> $fname >>>>> > > $@ >>>>> > > > > 1>/dev/null 2>&1 >>>>> > > > > >>>>> > > > > p=`cat $fname` >>>>> > > > > chmod +x $fname >>>>> > > > > echo "#!/bin/bash" > $fname >>>>> > > > > echo >> $fname >>>>> > > > > echo "export PATH=$PATH" >> $fname >>>>> > > > > echo "export DMTCP_COORD_HOST=$h" >> $fname >>>>> > > > > echo "export DMTCP_COORD_PORT=$p" >> $fname >>>>> > > > > echo "dmtcp_command \$@" >> $fname >>>>> > > > > >>>>> > > > > # Set up local environment for DMTCP >>>>> > > > > export DMTCP_COORD_HOST=$h >>>>> > > > > export DMTCP_COORD_PORT=$p >>>>> > > > > } >>>>> > > > > >>>>> > > > > cd $SLURM_SUBMIT_DIR >>>>> > > > > start_coordinator -i 240 >>>>> > > > > dmtcp_launch -h $h -p $p mpiexec ./mm.o >>>>> > > > > >>>>> > > > > >>>>> > > >>>>> ######################################################################### >>>>> > > > > >>>>> > > > > I also have tried using --rm option in dmtcp_launch but it >>>>> doesn't work >>>>> > > > > and no output at all. >>>>> > > > > >>>>> > > > > anybody tell me how to solve this please ? I need help >>>>> > > > > >>>>> > > > > >>>>> > > > > Regards, >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > Husen >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > >>>>> ------------------------------------------------------------------------------ >>>>> > > > > Mobile security can be enabling, not merely restricting. >>>>> Employees who >>>>> > > > > bring their own devices (BYOD) to work are irked by the >>>>> imposition of >>>>> > > MDM >>>>> > > > > restrictions. Mobile Device Manager Plus allows you to control >>>>> only the >>>>> > > > > apps on BYO-devices by containerizing them, leaving personal >>>>> data >>>>> > > > > untouched! >>>>> > > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j >>>>> > > > > _______________________________________________ >>>>> > > > > Dmtcp-forum mailing list >>>>> > > > > Dmtcp-forum@lists.sourceforge.net >>>>> > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>>>> > > > > >>>>> > > > > >>>>> > > > >>>>> > > > >>>>> > > > -- >>>>> > > > William Fox >>>>> > > > >>>>> > > > Lawrence Berkeley National Laboratory >>>>> > > > Computational Research Division >>>>> > > >>>>> > > > >>>>> > > >>>>> ------------------------------------------------------------------------------ >>>>> > > > Mobile security can be enabling, not merely restricting. >>>>> Employees who >>>>> > > > bring their own devices (BYOD) to work are irked by the >>>>> imposition of MDM >>>>> > > > restrictions. Mobile Device Manager Plus allows you to control >>>>> only the >>>>> > > > apps on BYO-devices by containerizing them, leaving personal data >>>>> > > untouched! >>>>> > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j >>>>> > > >>>>> > > > _______________________________________________ >>>>> > > > Dmtcp-forum mailing list >>>>> > > > Dmtcp-forum@lists.sourceforge.net >>>>> > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>>>> > > >>>>> > > >>>>> >>>>> > >>>>> ------------------------------------------------------------------------------ >>>>> > Mobile security can be enabling, not merely restricting. Employees >>>>> who >>>>> > bring their own devices (BYOD) to work are irked by the imposition >>>>> of MDM >>>>> > restrictions. Mobile Device Manager Plus allows you to control only >>>>> the >>>>> > apps on BYO-devices by containerizing them, leaving personal data >>>>> untouched! >>>>> > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j >>>>> >>>>> > _______________________________________________ >>>>> > Dmtcp-forum mailing list >>>>> > Dmtcp-forum@lists.sourceforge.net >>>>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>>>> >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Mobile security can be enabling, not merely restricting. Employees who >>>> bring their own devices (BYOD) to work are irked by the imposition of >>>> MDM >>>> restrictions. Mobile Device Manager Plus allows you to control only the >>>> apps on BYO-devices by containerizing them, leaving personal data >>>> untouched! >>>> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j >>>> _______________________________________________ >>>> Dmtcp-forum mailing list >>>> Dmtcp-forum@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>>> >>>> >>> >> >
restart.job
Description: Binary data
------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum