I have been using DMTCP successfully for a long-running optim() task. This is a 
single-core process running on a large linux cluster with slurm as the job 
manager. This cluster places an 8-hour limit on individual jobs, and since my 
cost function takes 11 minutes to compute, I need many such jobs run 
sequentially. To make DMTCP work, I have had to rework file I/O to avoid 
references to temporary files written to /tmp, but other than that...optim() is 
checkpointed just before 8 hours is up, and then resumed successfully in a 
subsequent batch job running on a different core of the cluster.

While I have an answer for my particular task, it would still be useful to 
checkpoint using the scheme Henrik suggests. Thanks all for the interesting 
conversation!

-Andy



On 12/14/21 5:39 PM, Henrik Bengtsson wrote:
On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson <a...@yovo.org> wrote:

Those are good points, Duncan. I am experimenting with a nice checkpointing 
tool called DMTCP. It operates on the system level but is quite OS-dependent. 
It can be found at http://dmtcp.sourceforge.net/index.html.

Still, it would be nice to be able to checkpoint calls within R to potentially 
long-running processes like optim().

Teasing idea. Imagine if we could come up with some de-facto standard
API for this and that such a framework could be called automatically
by R. Something similar to how user interrupts are checked (e.g.
R_CheckUserInterrupt()) on a regular basis by the R engine and
through-out the R code. That could help troubleshooting and debugging,
e.g. sending the checkpoint to someone else or going backwards in
time.

Pasting in the below since I failed to hit Reply *All* the other day,
and it was only Richard who got it:

A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp).  I'm
sharing in case someone is interested in investigating this further.
Also, somewhere on the DMTCP wiki, they asked for testing with R by
more experienced users.

"DMTCP is a tool to transparently checkpoint the state of multiple
simultaneous applications, including multi-threaded and distributed
applications. It operates directly on the user binary executable,
without any Linux kernel modules or other kernel modifications."

They seem to be able to run this with HPC jobs, open files, Linux
containers, and even MPI, and so on.  I've only tested it very quickly
with interactive R and it seems to work.  Obviously more testing needs
to be done to identify when it doesn't work.  For example, I'd have a
hard time it would work out of the box with local parallel PSOCK
workers.  They mention "plug-ins", so maybe there's a way to adding
support for specific use cases on a one by one.

Different academic HPC environment appear to use it, e.g.

* https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
* http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/
* https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP

That's all I have time for now,

Henrik


-Andy

On 12/13/21 11:51 AM, Duncan Murdoch wrote:
On 13/12/2021 12:58 p.m., Greg Minshall wrote:
Jeff,

This sounds like an OS feature, not an R feature... certainly not a
portable R feature.

i'm not arguing for it, but this seems to me like something that could
be a language feature.


R functions can call libraries written in other languages, and can start 
processes, etc.  R doesn't know everything going on in every function call, and 
would have a lot of trouble saving it.

If you added some limitations, e.g. a process that periodically has its entire 
state stored in R variables, then it would be a lot easier.

Duncan Murdoch

--
Andy Jacobson
a...@yovo.org

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
Andy Jacobson
andy.jacob...@noaa.gov

NOAA Global Monitoring Lab
325 Broadway
Boulder, Colorado 80305

303/497-4916

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to