Hey Xiaoge,
I had the same problemand in my case it turned out that the open
cgroup files were inherited file descriptors used for watching
events. The pathvirt plugin did not help as it only handles newly
opened files.
I worked around the problem by preventing bash (which starts my
program) from passing the cgroup file descriptors on to my
program. Using lsof I found their numbers were 10 (memory) and 11
(cpu) inherited from the slurm starting process itself. Thus, I
used the call
dmtcp_launch my_program 10>&- 11>&-
which works fine. I hope that helps you, too.
Sven
Wang, Xiaoge <wangx...@msu.edu> writes:
Hello,
I have been trying to run batch job (using slurm) with
checkpointing. I run into the same issue as already reported in
this forum, see
https://sourceforge.net/p/dmtcp/mailman/message/36347021/ .
I am wondering if it is resolved. If it is resolved, what is the
solution? I would like to try on my end. Thanks.
-Xiaoge
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org!
http://sdm.link/slashdot_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum