Hey Xiaoge,

I had the same problemand in my case it turned out that the open cgroup files were inherited file descriptors used for watching events. The pathvirt plugin did not help as it only handles newly opened files.

I worked around the problem by preventing bash (which starts my program) from passing the cgroup file descriptors on to my program. Using lsof I found their numbers were 10 (memory) and 11 (cpu) inherited from the slurm starting process itself. Thus, I used the call

dmtcp_launch my_program 10>&- 11>&-

which works fine. I hope that helps you, too.

Sven

Wang, Xiaoge <wangx...@msu.edu> writes:

Hello,


I have been trying to run batch job (using slurm) with checkpointing. I run into the same issue as already reported in this forum, see https://sourceforge.net/p/dmtcp/mailman/message/36347021/ .


I am wondering if it is resolved. If it is resolved, what is the solution? I would like to try on my end. Thanks.


-Xiaoge
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to