Re: [R] checkpointing
Colleagues, I am late to this thread. (It brings me back to my days running checkpoint restart on an IBM 370, which very useful for very, very long jobs). A search for "linux checkpoint restore" retrieved information about CIRU (Checkpoint/Restore in user space) which sounds a lot like the facility I used on the IBM 370. It appears to allow a user's process to be stopped, have its state backed up and then restarted. Perhaps this would solve (at least for Linux users of R or RStudio) the request to have checkpoint restart ability in an R program. Please let me know if you agree. John From: R-help on behalf of Andy Jacobson via R-help Sent: Tuesday, December 14, 2021 8:59 PM To: Henrik Bengtsson Cc: Greg Minshall; Andy Jacobson via R-help; Andy Jacobson Subject: Re: [R] checkpointing I have been using DMTCP successfully for a long-running optim() task. This is a single-core process running on a large linux cluster with slurm as the job manager. This cluster places an 8-hour limit on individual jobs, and since my cost function takes 11 minutes to compute, I need many such jobs run sequentially. To make DMTCP work, I have had to rework file I/O to avoid references to temporary files written to /tmp, but other than that...optim() is checkpointed just before 8 hours is up, and then resumed successfully in a subsequent batch job running on a different core of the cluster. While I have an answer for my particular task, it would still be useful to checkpoint using the scheme Henrik suggests. Thanks all for the interesting conversation! -Andy On 12/14/21 5:39 PM, Henrik Bengtsson wrote: > On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson wrote: >> >> Those are good points, Duncan. I am experimenting with a nice checkpointing >> tool called DMTCP. It operates on the system level but is quite >> OS-dependent. It can be found at >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdmtcp.sourceforge.net%2Findex.htmldata=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=D7knPv4UR%2FyMl%2BwSBsHeYwnxdBGU4uuCwqyPxXgjbzg%3Dreserved=0. >> >> Still, it would be nice to be able to checkpoint calls within R to >> potentially long-running processes like optim(). > > Teasing idea. Imagine if we could come up with some de-facto standard > API for this and that such a framework could be called automatically > by R. Something similar to how user interrupts are checked (e.g. > R_CheckUserInterrupt()) on a regular basis by the R engine and > through-out the R code. That could help troubleshooting and debugging, > e.g. sending the checkpoint to someone else or going backwards in > time. > > Pasting in the below since I failed to hit Reply *All* the other day, > and it was only Richard who got it: > > A few weeks ago, I played around with DMTCP (Distributed MultiThreaded > CheckPointing ) for Linux > (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmtcp%2Fdmtcpdata=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=xwfnXt1KJtPHUTW3cyhtgSmdeIiFl4VaiRJJAWRc5p4%3Dreserved=0). > I'm > sharing in case someone is interested in investigating this further. > Also, somewhere on the DMTCP wiki, they asked for testing with R by > more experienced users. > > "DMTCP is a tool to transparently checkpoint the state of multiple > simultaneous applications, including multi-threaded and distributed > applications. It operates directly on the user binary executable, > without any Linux kernel modules or other kernel modifications." > > They seem to be able to run this with HPC jobs, open files, Linux > containers, and even MPI, and so on. I've only tested it very quickly > with interactive R and it seems to work. Obviously more testing needs > to be done to identify when it doesn't work. For example, I'd have a > hard time it would work out of the box with local parallel PSOCK > workers. They mention "plug-ins", so maybe there's a way to adding > support for specific use cases on a one by one. > > Different academic HPC environment appear to use it, e.g. > > * https://docs.nersc.gov/development/checkpoint-restart/dmtcp/ > * > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.orc.gmu.edu%2Fmkdocs%2FCreating_Checkpoints_%2528DMTCP%2529%2Fdata=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395
Re: [R] checkpointing
I have been using DMTCP successfully for a long-running optim() task. This is a single-core process running on a large linux cluster with slurm as the job manager. This cluster places an 8-hour limit on individual jobs, and since my cost function takes 11 minutes to compute, I need many such jobs run sequentially. To make DMTCP work, I have had to rework file I/O to avoid references to temporary files written to /tmp, but other than that...optim() is checkpointed just before 8 hours is up, and then resumed successfully in a subsequent batch job running on a different core of the cluster. While I have an answer for my particular task, it would still be useful to checkpoint using the scheme Henrik suggests. Thanks all for the interesting conversation! -Andy On 12/14/21 5:39 PM, Henrik Bengtsson wrote: On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson wrote: Those are good points, Duncan. I am experimenting with a nice checkpointing tool called DMTCP. It operates on the system level but is quite OS-dependent. It can be found at http://dmtcp.sourceforge.net/index.html. Still, it would be nice to be able to checkpoint calls within R to potentially long-running processes like optim(). Teasing idea. Imagine if we could come up with some de-facto standard API for this and that such a framework could be called automatically by R. Something similar to how user interrupts are checked (e.g. R_CheckUserInterrupt()) on a regular basis by the R engine and through-out the R code. That could help troubleshooting and debugging, e.g. sending the checkpoint to someone else or going backwards in time. Pasting in the below since I failed to hit Reply *All* the other day, and it was only Richard who got it: A few weeks ago, I played around with DMTCP (Distributed MultiThreaded CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp). I'm sharing in case someone is interested in investigating this further. Also, somewhere on the DMTCP wiki, they asked for testing with R by more experienced users. "DMTCP is a tool to transparently checkpoint the state of multiple simultaneous applications, including multi-threaded and distributed applications. It operates directly on the user binary executable, without any Linux kernel modules or other kernel modifications." They seem to be able to run this with HPC jobs, open files, Linux containers, and even MPI, and so on. I've only tested it very quickly with interactive R and it seems to work. Obviously more testing needs to be done to identify when it doesn't work. For example, I'd have a hard time it would work out of the box with local parallel PSOCK workers. They mention "plug-ins", so maybe there's a way to adding support for specific use cases on a one by one. Different academic HPC environment appear to use it, e.g. * https://docs.nersc.gov/development/checkpoint-restart/dmtcp/ * http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/ * https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP That's all I have time for now, Henrik -Andy On 12/13/21 11:51 AM, Duncan Murdoch wrote: On 13/12/2021 12:58 p.m., Greg Minshall wrote: Jeff, This sounds like an OS feature, not an R feature... certainly not a portable R feature. i'm not arguing for it, but this seems to me like something that could be a language feature. R functions can call libraries written in other languages, and can start processes, etc. R doesn't know everything going on in every function call, and would have a lot of trouble saving it. If you added some limitations, e.g. a process that periodically has its entire state stored in R variables, then it would be a lot easier. Duncan Murdoch -- Andy Jacobson a...@yovo.org __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Andy Jacobson andy.jacob...@noaa.gov NOAA Global Monitoring Lab 325 Broadway Boulder, Colorado 80305 303/497-4916 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] checkpointing
On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson wrote: > > Those are good points, Duncan. I am experimenting with a nice checkpointing > tool called DMTCP. It operates on the system level but is quite OS-dependent. > It can be found at http://dmtcp.sourceforge.net/index.html. > > Still, it would be nice to be able to checkpoint calls within R to > potentially long-running processes like optim(). Teasing idea. Imagine if we could come up with some de-facto standard API for this and that such a framework could be called automatically by R. Something similar to how user interrupts are checked (e.g. R_CheckUserInterrupt()) on a regular basis by the R engine and through-out the R code. That could help troubleshooting and debugging, e.g. sending the checkpoint to someone else or going backwards in time. Pasting in the below since I failed to hit Reply *All* the other day, and it was only Richard who got it: A few weeks ago, I played around with DMTCP (Distributed MultiThreaded CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp). I'm sharing in case someone is interested in investigating this further. Also, somewhere on the DMTCP wiki, they asked for testing with R by more experienced users. "DMTCP is a tool to transparently checkpoint the state of multiple simultaneous applications, including multi-threaded and distributed applications. It operates directly on the user binary executable, without any Linux kernel modules or other kernel modifications." They seem to be able to run this with HPC jobs, open files, Linux containers, and even MPI, and so on. I've only tested it very quickly with interactive R and it seems to work. Obviously more testing needs to be done to identify when it doesn't work. For example, I'd have a hard time it would work out of the box with local parallel PSOCK workers. They mention "plug-ins", so maybe there's a way to adding support for specific use cases on a one by one. Different academic HPC environment appear to use it, e.g. * https://docs.nersc.gov/development/checkpoint-restart/dmtcp/ * http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/ * https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP That's all I have time for now, Henrik > > -Andy > > On 12/13/21 11:51 AM, Duncan Murdoch wrote: > > On 13/12/2021 12:58 p.m., Greg Minshall wrote: > >> Jeff, > >> > >>> This sounds like an OS feature, not an R feature... certainly not a > >>> portable R feature. > >> > >> i'm not arguing for it, but this seems to me like something that could > >> be a language feature. > >> > > > > R functions can call libraries written in other languages, and can start > > processes, etc. R doesn't know everything going on in every function call, > > and would have a lot of trouble saving it. > > > > If you added some limitations, e.g. a process that periodically has its > > entire state stored in R variables, then it would be a lot easier. > > > > Duncan Murdoch > > -- > Andy Jacobson > a...@yovo.org > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] checkpointing
Those are good points, Duncan. I am experimenting with a nice checkpointing tool called DMTCP. It operates on the system level but is quite OS-dependent. It can be found at http://dmtcp.sourceforge.net/index.html. Still, it would be nice to be able to checkpoint calls within R to potentially long-running processes like optim(). -Andy On 12/13/21 11:51 AM, Duncan Murdoch wrote: On 13/12/2021 12:58 p.m., Greg Minshall wrote: Jeff, This sounds like an OS feature, not an R feature... certainly not a portable R feature. i'm not arguing for it, but this seems to me like something that could be a language feature. R functions can call libraries written in other languages, and can start processes, etc. R doesn't know everything going on in every function call, and would have a lot of trouble saving it. If you added some limitations, e.g. a process that periodically has its entire state stored in R variables, then it would be a lot easier. Duncan Murdoch -- Andy Jacobson a...@yovo.org __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] checkpointing
Jeff, I am wondering what it even means to do what you say. In a compiled language, I can imagine wrapping up an executable along with some kind of run-time image (which may actually contain the parts of the executable that includes what has not run yet) and revive it elsewhere. But even there, how would it work if say the executable kept opening more files for reading or appending and you move it to place those files did not exists or had different contents or other such scenarios? What happens to open pipes with another process attached, for an OS that supports pipes? When you restart, the other processes re not there even if you supply an image of a pipe and I am sure others can imagine much more. R is interpreted. You could say the main interpreter may be like an executable and there may be multiple threads active at the time you stop the process and bundle it to be restarted later. But R has many fairly dynamic features including some the interpreter has not even looked at yet. Besides files it may want to open, there are any number of statements like library(filename) it may encounter and of course other files it may source(code) . In general, the info on what may be needed later is not in any serious way bundled with the file and many things may be hard to predict even with a look ahead as often arguments to functions are not evaluated till some indefinite later time or even never. I am trying to imagine how you stop and restore say an R program running connected to something like RSTUDIO which is also connected to a Python program with data and instructions flowing back and forth. It does not strike me as easy to make a reliable method to do this, albeit as noted, there are operating systems that do allow you to suspend arbitrary processes and restart them locally perhaps only before the system reboots. But I can think of exceptions, including some I see others have thought of. An example might be an R program that reads in lots of data, then makes objects like data.frames and then pauses in some kind of nested loop that will process the data while having the current indices saved in variables. It could literally ask to be frozen so it starts up from there when asked to. R can be set to intercept some signals and perhaps voluntarily save all the variables as they are (including the data it may be operating on and what it is making from it (as in what search items it has already found) as well as the needed index info) and exit gracefully. If the application is restarted, it might note the file with saved info and read in all the data and continue from there. The above is not a serious proposal and has lots of things that can go wrong, but I can imagine it as an app that sets itself up doing heavy lifting once and later every time you want to do a search, it loads the data and gets from you something to search for and does it quickly and resuspends till needed. But this example is not exactly what you asked for. I have actually done weird things like the above including things that simply start up again after a reboot as if nothing happened. What is a more interesting question for me is what R features might make sense that help construct a program that is in some sense re-startable if used right. I can imagine a package that lets you set things like a "level" for debugging so that your code when started at some point says: # initialize. # load any left-in-file data if it exists. if (level < 2) { do stuff level <- 2 } if (level < 3) { do more stuff level <- 3 } ... Something like the above might wrap parts in something like a "try()" that intercepts some interrupt condition and saves the needed status info. What I wonder is if long-running processes that can be up for months say in a web-server, may already have ways to save all kinds of status info so when they start up again after a normal reboot, are able to continue almost as if nothing had happened. -Original Message- From: R-help On Behalf Of Jeff Newmiller Sent: Monday, December 13, 2021 11:54 AM To: Andy Jacobson ; Andy Jacobson via R-help ; r-help@r-project.org Subject: Re: [R] checkpointing This sounds like an OS feature, not an R feature... certainly not a portable R feature. On December 13, 2021 8:37:30 AM PST, Andy Jacobson via R-help wrote: >Has anyone ever considered what it would take to implement checkpointing in R, so that long-running processes could be interrupted and resumed later, from a different process or even a different machine? > >Thanks, > >Andy > -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __
Re: [R] checkpointing
Use VirtualBox. You can take a 'snapshot' of a running virtual machine, either from the GUI or from the CLI (vboxmanage snapshot ...) and restore it later. This requires NO changes to R. Snapshots can be restored on another machine of the same kind with the same system software. VirtualBox is free. Parallels is (just) under USD100. On Tue, 14 Dec 2021 at 05:38, Andy Jacobson via R-help wrote: > Has anyone ever considered what it would take to implement checkpointing > in R, so that long-running processes could be interrupted and resumed > later, from a different process or even a different machine? > > Thanks, > > Andy > > -- > Andy Jacobson > andy.jacob...@noaa.gov > > NOAA Global Monitoring Lab > 325 Broadway > Boulder, Colorado 80305 > > 303/497-4916 > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] checkpointing
I used to work on a Prolog implementation that did something similar. At any point you could explicitly save a snapshot of the current state and then from the operating system command line, resume it. This wasn't really for checkpointing. It was so that you could load up a customised environment, initialise it, and then future runs would be able to start up much quicker. Initially it was all very simple. (Hah!) As long as you only expected a snapshot to work on the same kind of machine with exactly the same operating system and libraries and the same user files in the same (logical) places. Then as operating systems added features, it got harder and harder. First we had to cut back to discarding the stacks and restarting from top level on resumption. Finally, when address space layout randomisation came along (and by the way, this was back in the 1980s) we gave up. Bring in clusters with MPI and I for one don't want to go there. On Tue, 14 Dec 2021 at 06:59, Greg Minshall wrote: > Jeff, > > > This sounds like an OS feature, not an R feature... certainly not a > > portable R feature. > > i'm not arguing for it, but this seems to me like something that could > be a language feature. > > cheers, Greg > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] checkpointing
Duncan, > R functions can call libraries written in other languages, and can > start processes, etc. R doesn't know everything going on in every > function call, and would have a lot of trouble saving it. indeed, that obvious fact hadn't occurred to me. i withdraw my contention. :) cheers, Greg __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] checkpointing
On 13/12/2021 12:58 p.m., Greg Minshall wrote: Jeff, This sounds like an OS feature, not an R feature... certainly not a portable R feature. i'm not arguing for it, but this seems to me like something that could be a language feature. R functions can call libraries written in other languages, and can start processes, etc. R doesn't know everything going on in every function call, and would have a lot of trouble saving it. If you added some limitations, e.g. a process that periodically has its entire state stored in R variables, then it would be a lot easier. Duncan Murdoch __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] checkpointing
Jeff, > This sounds like an OS feature, not an R feature... certainly not a > portable R feature. i'm not arguing for it, but this seems to me like something that could be a language feature. cheers, Greg __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] checkpointing
This sounds like an OS feature, not an R feature... certainly not a portable R feature. On December 13, 2021 8:37:30 AM PST, Andy Jacobson via R-help wrote: >Has anyone ever considered what it would take to implement checkpointing in R, >so that long-running processes could be interrupted and resumed later, from a >different process or even a different machine? > >Thanks, > >Andy > -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.