The error message indicates the rm plugin couldn't map the checkpoint images to the set of restart resources (nodes, processes). If it is possible, could you provide us a guest account on your cluster to verify the issue? I think this is the most efficient way to figure out what is going on.
Best, Jiajun On Thu, Oct 8, 2015 at 9:00 AM, abderrahmane <denilson...@yahoo.fr> wrote: > Hello > > I did it and still got Restart error : cannot map initial resources into > the restart allocation. > > Also i used openmpi 1.8.8 and got the same error msg. > > > > On 10/06/2015 07:06 PM, Jiajun Cao wrote: > > Hi, > > Could you replace > > dmtcp_launch --rm mpirun --mca btl self,tcp ./<your binary> > > with the following: > > srun dmtcp_launch --rm ./<your binary> > > Also, add the following env vars to the script: > > export OMPI_MCA_mtl=^psm > export OMPI_MCA_btl=self,tcp > > and try again? > > On Tue, Oct 6, 2015 at 4:41 PM, abderrahmane <denilson...@yahoo.fr> wrote: > >> Hello >> ]Thanks for the respond. >> >> >> On 10/06/2015 02:18 PM, Jiajun Cao wrote: >> >> Hi, >> >> >> 1. What kind of application are you running? Is there an integration of >> matlab and mpi? I'm asking because I haven't run any mpi-based matlab >> applications before. >> >> i just created a script that calculate fibonacci number a prints it out. >> >> 2. What kind of environment are you using? Specifically, I'd like to know >> the MPI version, interconnect network type (Ethernet or InfiniBand), and >> how MPI and Slurm are integrated (i.e., in the cluster, what command do you >> use to run the application, srun or mpirun). >> >> I am using rhel7 and openmpi 1.8 inbiniband. for the slurm it is >> integrated in a cluster environment, I used the script here : >> >> https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job >> >> 3. Do you get a valid checkpoint image(s)? Also, please attach your job >> scripts. >> >> I get the checkpoint needed but when i restart i received the error i sent >> >> Thanks >> >> >> On Tue, Oct 6, 2015 at 1:29 PM, Kapil Arya < <kapil.arya...@gmail.com> >> kapil.arya...@gmail.com> wrote: >> >>> Jiajun, Artem, >>> >>> Can one of you take a look at this one? >>> >>> Kapil >>> >>> On Tue, Oct 6, 2015 at 12:31 PM, abderrahmane < <denilson...@yahoo.fr> >>> denilson...@yahoo.fr> wrote: >>> >>>> Hello >>>> >>>> Thank you for the effort and work (dmtcp), I do have some questions: >>>> ( P.S :I run my matlab code using --rm mpirun and slurm.) >>>> >>>> 1- is there a good way to run matlab code? I created a bash file in >>>> added the following : >>>> matlab -nojvm < file.m >>>> >>>> 2- running the code above with dmtcp and matlab worked fine, but when i >>>> tried to restart the code using slurm_restart.job code from your github >>>> and using --rm mpirun , I received the following error: >>>> >>>> restart error: cannot map initial resources into the restart allocation. >>>> Allocated resources : *nodex:4 nodey:4 >>>> >>>> any ideas? please feel free to ask me more questions. >>>> >>>> best regards; >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> _______________________________________________ >>>> Dmtcp-forum mailing list >>>> Dmtcp-forum@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>>> >>> >>> >> >> > >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum