I think what Artem meant was you need to keep the allocations consistent before checkpoint and after restart. For instance, if you use 4 nodes * 2 processesPerNode before checkpoint, you should specify the same configuration in the restart script.
On Fri, Oct 9, 2015 at 10:06 AM, Artem Polyakov <artpo...@gmail.com> wrote: > P.S. no way to avoid that for now and near future IMO. > > 2015-10-09 17:01 GMT+03:00 Artem Polyakov <artpo...@gmail.com>: > >> You don't need "exact" allocation in terms of nodenames but you do need >> to remember how many nodes and how many procs per node you had in original >> allocation. >> >> 2015-10-09 16:39 GMT+03:00 MR.AB <denilson...@yahoo.fr>: >> >>> Hey >>> Thank you for the email, is there a way to make it work or i have tot >>> have variables to "remember" the exact allocations? >>> >>> >>> >>> On Friday, October 9, 2015 4:34 AM, Artem Polyakov <artpo...@gmail.com> >>> wrote: >>> >>> >>> Hello, >>> Please note, that one of the reasons may be non-equivalent allocations. >>> DMTCP cannot restore processes that was originally running on the same node >>> to be on different nodes. This means that if you originally requested the >>> following allocation: cn[0-1], ppn = 4 >>> and trying to restart on cn[0-4], ppn = 2 >>> this won't work even though the allocations are logically equivalent. >>> >>> 2015-10-08 16:00 GMT+03:00 abderrahmane <denilson...@yahoo.fr>: >>> >>> Hello >>> >>> I did it and still got Restart error : cannot map initial resources into >>> the restart allocation. >>> >>> Also i used openmpi 1.8.8 and got the same error msg. >>> >>> >>> On 10/06/2015 07:06 PM, Jiajun Cao wrote: >>> >>> Hi, >>> >>> Could you replace >>> >>> dmtcp_launch --rm mpirun --mca btl self,tcp ./<your binary> >>> >>> with the following: >>> >>> srun dmtcp_launch --rm ./<your binary> >>> >>> Also, add the following env vars to the script: >>> >>> export OMPI_MCA_mtl=^psm >>> export OMPI_MCA_btl=self,tcp >>> >>> and try again? >>> >>> On Tue, Oct 6, 2015 at 4:41 PM, abderrahmane <denilson...@yahoo.fr> >>> wrote: >>> >>> Hello >>> ]Thanks for the respond. >>> >>> >>> On 10/06/2015 02:18 PM, Jiajun Cao wrote: >>> >>> Hi, >>> >>> >>> 1. What kind of application are you running? Is there an integration of >>> matlab and mpi? I'm asking because I haven't run any mpi-based matlab >>> applications before. >>> >>> i just created a script that calculate fibonacci number a prints it out. >>> >>> 2. What kind of environment are you using? Specifically, I'd like to >>> know the MPI version, interconnect network type (Ethernet or InfiniBand), >>> and how MPI and Slurm are integrated (i.e., in the cluster, what command do >>> you use to run the application, srun or mpirun). >>> >>> I am using rhel7 and openmpi 1.8 inbiniband. for the slurm it is >>> integrated in a cluster environment, I used the script here : >>> >>> https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job >>> >>> 3. Do you get a valid checkpoint image(s)? Also, please attach your job >>> scripts. >>> >>> I get the checkpoint needed but when i restart i received the error i >>> sent >>> >>> Thanks >>> >>> >>> On Tue, Oct 6, 2015 at 1:29 PM, Kapil Arya < <kapil.arya...@gmail.com> >>> kapil.arya...@gmail.com> wrote: >>> >>> Jiajun, Artem, >>> >>> Can one of you take a look at this one? >>> >>> Kapil >>> >>> On Tue, Oct 6, 2015 at 12:31 PM, abderrahmane < <denilson...@yahoo.fr> >>> denilson...@yahoo.fr> wrote: >>> >>> Hello >>> >>> Thank you for the effort and work (dmtcp), I do have some questions: >>> ( P.S :I run my matlab code using --rm mpirun and slurm.) >>> >>> 1- is there a good way to run matlab code? I created a bash file in >>> added the following : >>> matlab -nojvm < file.m >>> >>> 2- running the code above with dmtcp and matlab worked fine, but when i >>> tried to restart the code using slurm_restart.job code from your github >>> and using --rm mpirun , I received the following error: >>> >>> restart error: cannot map initial resources into the restart allocation. >>> Allocated resources : *nodex:4 nodey:4 >>> >>> any ideas? please feel free to ask me more questions. >>> >>> best regards; >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> Dmtcp-forum mailing list >>> Dmtcp-forum@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> >>> >>> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum