Hey Thank you for the email, is there a way to make it work or i have tot have variables to "remember" the exact allocations?
On Friday, October 9, 2015 4:34 AM, Artem Polyakov <artpo...@gmail.com> wrote: Hello,Please note, that one of the reasons may be non-equivalent allocations. DMTCP cannot restore processes that was originally running on the same node to be on different nodes. This means that if you originally requested the following allocation: cn[0-1], ppn = 4and trying to restart on cn[0-4], ppn = 2this won't work even though the allocations are logically equivalent. 2015-10-08 16:00 GMT+03:00 abderrahmane <denilson...@yahoo.fr>: Hello I did it and still got Restart error : cannot map initial resources into the restart allocation. Also i used openmpi 1.8.8 and got the same error msg. On 10/06/2015 07:06 PM, Jiajun Cao wrote: Hi, Could you replace dmtcp_launch --rm mpirun --mca btl self,tcp ./<your binary> with the following: srun dmtcp_launch --rm ./<your binary> Also, add the following env vars to the script: export OMPI_MCA_mtl=^psm export OMPI_MCA_btl=self,tcp and try again? On Tue, Oct 6, 2015 at 4:41 PM, abderrahmane <denilson...@yahoo.fr> wrote: Hello ]Thanks for the respond. On 10/06/2015 02:18 PM, Jiajun Cao wrote: Hi, 1. What kind of application are you running? Is there an integration of matlab and mpi? I'm asking because I haven't run any mpi-based matlab applications before. i just created a script that calculate fibonacci number a prints it out. 2. What kind of environment are you using? Specifically, I'd like to know the MPI version, interconnect network type (Ethernet or InfiniBand), and how MPI and Slurm are integrated (i.e., in the cluster, what command do you use to run the application, srun or mpirun). I am using rhel7 and openmpi 1.8 inbiniband. for the slurm it is integrated in a cluster environment, I used the script here : https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job 3. Do you get a valid checkpoint image(s)? Also, please attach your job scripts. I get the checkpoint needed but when i restart i received the error i sent Thanks On Tue, Oct 6, 2015 at 1:29 PM, Kapil Arya <kapil.arya...@gmail.com> wrote: Jiajun, Artem, Can one of you take a look at this one? Kapil On Tue, Oct 6, 2015 at 12:31 PM, abderrahmane <denilson...@yahoo.fr> wrote: Hello Thank you for the effort and work (dmtcp), I do have some questions: ( P.S :I run my matlab code using --rm mpirun and slurm.) 1- is there a good way to run matlab code? I created a bash file in added the following : matlab -nojvm < file.m 2- running the code above with dmtcp and matlab worked fine, but when i tried to restart the code using slurm_restart.job code from your github and using --rm mpirun , I received the following error: restart error: cannot map initial resources into the restart allocation. Allocated resources : *nodex:4 nodey:4 any ideas? please feel free to ask me more questions. best regards; ------------------------------------------------------------------------------ _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum