Hi,

I have been testing dmctp for c/r for a cluster.  When I test manually, it runs 
fine.  Jobs restart as expected and is generally impressive.

When I try to integrate it for use within slurm I run into issues upon 
restarting a process.  It seems to be a cgroup issue.  We have slurm create 
cgroups for each job, so it accesses resources within the cgroup.  When a job 
restarts, it is looking for those resources in the previous cgroup.  What is 
the best way to map that out to the new cgroup created after resubmitting the 
job.

Here is the erros when trying to restart from a checkpoint file:

[naveed@hpc-90-21 cp]$  dmtcp_restart  --interval 120 --new-coordinator 
ckpt_openssl_2ad5fc20c8bb8d9-40000-7ffcf636c0aab.dmtcp
[40000] ERROR at fileconnection.cpp:737 in refill; 
REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed'
     _path = 
/sys/fs/cgroup/blkio,cpuacct,memory,freezer/slurm/uid_8688/job_509451/step_0/memory.oom_control
Message: File not found.
openssl (40000): Terminating...

(this is just a test with openssl speed)

The original joib was 509451 and this was restarted with jobid 509452, so has 
a different cgroup.

I would imagine this is a solved problem due to the slurm integration work and 
I am just missing something.  What do others do in this situation?

Naveed


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to