I run into a similar problem. Our cluster uses PBS Pro job scheduler and each job owns a directory and files under the local storage on the node it’s running on, such as /jobfs/local/jobid. When the application is restarted with DMTCP as the jobid changed, the directory and files are not there any more. The pathvirt plugin doesn’t work. I found the file descriptors (3 and 4) for these directory and files. But appending “3>&-“ to the end of the "dmtcp_launch" line to close the file descriptor didn’t resolve the issue.
Cheers, Yuanyuan > On 7 Sep 2018, at 9:39 pm, Sven Willner <sven.will...@gmail.com> wrote: > > Hey Xiaoge, > > I had the same problemand in my case it turned out that the open cgroup files > were inherited file descriptors used for watching events. The pathvirt plugin > did not help as it only handles newly opened files. > > I worked around the problem by preventing bash (which starts my program) from > passing the cgroup file descriptors on to my program. Using lsof I found > their numbers were 10 (memory) and 11 (cpu) inherited from the slurm starting > process itself. Thus, I used the call > > dmtcp_launch my_program 10>&- 11>&- > > which works fine. I hope that helps you, too. > > Sven > > Wang, Xiaoge <wangx...@msu.edu> writes: > >> Hello, >> >> >> I have been trying to run batch job (using slurm) with checkpointing. I run >> into the same issue as already reported in this forum, see >> https://sourceforge.net/p/dmtcp/mailman/message/36347021/ . >> >> >> I am wondering if it is resolved. If it is resolved, what is the solution? I >> would like to try on my end. Thanks. >> >> >> -Xiaoge >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! >> http://sdm.link/slashdot_______________________________________________ >> Dmtcp-forum mailing list >> Dmtcp-forum@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum