Hi, I have been testing dmctp for c/r for a cluster. When I test manually, it runs fine. Jobs restart as expected and is generally impressive.
When I try to integrate it for use within slurm I run into issues upon restarting a process. It seems to be a cgroup issue. We have slurm create cgroups for each job, so it accesses resources within the cgroup. When a job restarts, it is looking for those resources in the previous cgroup. What is the best way to map that out to the new cgroup created after resubmitting the job. Here is the erros when trying to restart from a checkpoint file: [naveed@hpc-90-21 cp]$ dmtcp_restart --interval 120 --new-coordinator ckpt_openssl_2ad5fc20c8bb8d9-40000-7ffcf636c0aab.dmtcp [40000] ERROR at fileconnection.cpp:737 in refill; REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed' _path = /sys/fs/cgroup/blkio,cpuacct,memory,freezer/slurm/uid_8688/job_509451/step_0/memory.oom_control Message: File not found. openssl (40000): Terminating... (this is just a test with openssl speed) The original joib was 509451 and this was restarted with jobid 509452, so has a different cgroup. I would imagine this is a solved problem due to the slurm integration work and I am just missing something. What do others do in this situation? Naveed ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum