I run into a similar problem. Our cluster uses PBS Pro job scheduler and each 
job owns a directory and files under the local storage on the node it’s running 
on, such as /jobfs/local/jobid.
When the application is restarted with DMTCP as the jobid changed, the 
directory and files are not there any more. The pathvirt plugin doesn’t work. 
I found the file descriptors (3 and 4) for these directory and files. But 
appending “3>&-“ to the end of the "dmtcp_launch" line to close the file 
descriptor didn’t resolve the issue.

Cheers,
Yuanyuan

> On 7 Sep 2018, at 9:39 pm, Sven Willner <sven.will...@gmail.com> wrote:
> 
> Hey Xiaoge,
> 
> I had the same problemand in my case it turned out that the open cgroup files 
> were inherited file descriptors used for watching events. The pathvirt plugin 
> did not help as it only handles newly opened files.
> 
> I worked around the problem by preventing bash (which starts my program) from 
> passing the cgroup file descriptors on to my program. Using lsof I found 
> their numbers were 10 (memory) and 11 (cpu) inherited from the slurm starting 
> process itself. Thus, I used the call
> 
> dmtcp_launch my_program 10>&- 11>&-
> 
> which works fine. I hope that helps you, too.
> 
> Sven
> 
> Wang, Xiaoge <wangx...@msu.edu> writes:
> 
>> Hello,
>> 
>> 
>> I have been trying to run batch job (using slurm) with checkpointing. I run 
>> into the same issue as already reported in this forum, see 
>> https://sourceforge.net/p/dmtcp/mailman/message/36347021/ .
>> 
>> 
>> I am wondering if it is resolved. If it is resolved, what is the solution? I 
>> would like to try on my end. Thanks.
>> 
>> 
>> -Xiaoge
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! 
>> http://sdm.link/slashdot_______________________________________________
>> Dmtcp-forum mailing list
>> Dmtcp-forum@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to