Re: [R] checkpointing

2021-12-17 Thread Sorkin, John
Colleagues,

I am late to this thread. (It brings me back to my days running checkpoint 
restart on an IBM 370, which very useful for very, very long jobs). A search 
for "linux checkpoint restore" retrieved information about CIRU 
(Checkpoint/Restore in user space) which sounds a lot like the facility I used 
on the IBM 370. It appears to allow a user's process to be stopped, have its 
state backed up and then restarted. Perhaps this would solve (at least for 
Linux users of R or RStudio) the request to have checkpoint restart ability in 
an R program.

Please let me know if you agree.

John


From: R-help  on behalf of Andy Jacobson via 
R-help 
Sent: Tuesday, December 14, 2021 8:59 PM
To: Henrik Bengtsson
Cc: Greg Minshall; Andy Jacobson via R-help; Andy Jacobson
Subject: Re: [R] checkpointing

I have been using DMTCP successfully for a long-running optim() task. This is a 
single-core process running on a large linux cluster with slurm as the job 
manager. This cluster places an 8-hour limit on individual jobs, and since my 
cost function takes 11 minutes to compute, I need many such jobs run 
sequentially. To make DMTCP work, I have had to rework file I/O to avoid 
references to temporary files written to /tmp, but other than that...optim() is 
checkpointed just before 8 hours is up, and then resumed successfully in a 
subsequent batch job running on a different core of the cluster.

While I have an answer for my particular task, it would still be useful to 
checkpoint using the scheme Henrik suggests. Thanks all for the interesting 
conversation!

-Andy



On 12/14/21 5:39 PM, Henrik Bengtsson wrote:
> On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson  wrote:
>>
>> Those are good points, Duncan. I am experimenting with a nice checkpointing 
>> tool called DMTCP. It operates on the system level but is quite 
>> OS-dependent. It can be found at 
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdmtcp.sourceforge.net%2Findex.htmldata=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=D7knPv4UR%2FyMl%2BwSBsHeYwnxdBGU4uuCwqyPxXgjbzg%3Dreserved=0.
>>
>> Still, it would be nice to be able to checkpoint calls within R to 
>> potentially long-running processes like optim().
>
> Teasing idea. Imagine if we could come up with some de-facto standard
> API for this and that such a framework could be called automatically
> by R. Something similar to how user interrupts are checked (e.g.
> R_CheckUserInterrupt()) on a regular basis by the R engine and
> through-out the R code. That could help troubleshooting and debugging,
> e.g. sending the checkpoint to someone else or going backwards in
> time.
>
> Pasting in the below since I failed to hit Reply *All* the other day,
> and it was only Richard who got it:
>
> A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
> CheckPointing ) for Linux 
> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmtcp%2Fdmtcpdata=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=xwfnXt1KJtPHUTW3cyhtgSmdeIiFl4VaiRJJAWRc5p4%3Dreserved=0).
>   I'm
> sharing in case someone is interested in investigating this further.
> Also, somewhere on the DMTCP wiki, they asked for testing with R by
> more experienced users.
>
> "DMTCP is a tool to transparently checkpoint the state of multiple
> simultaneous applications, including multi-threaded and distributed
> applications. It operates directly on the user binary executable,
> without any Linux kernel modules or other kernel modifications."
>
> They seem to be able to run this with HPC jobs, open files, Linux
> containers, and even MPI, and so on.  I've only tested it very quickly
> with interactive R and it seems to work.  Obviously more testing needs
> to be done to identify when it doesn't work.  For example, I'd have a
> hard time it would work out of the box with local parallel PSOCK
> workers.  They mention "plug-ins", so maybe there's a way to adding
> support for specific use cases on a one by one.
>
> Different academic HPC environment appear to use it, e.g.
>
> * https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
> * 
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.orc.gmu.edu%2Fmkdocs%2FCreating_Checkpoints_%2528DMTCP%2529%2Fdata=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395

Re: [R] checkpointing

2021-12-14 Thread Andy Jacobson via R-help

I have been using DMTCP successfully for a long-running optim() task. This is a 
single-core process running on a large linux cluster with slurm as the job 
manager. This cluster places an 8-hour limit on individual jobs, and since my 
cost function takes 11 minutes to compute, I need many such jobs run 
sequentially. To make DMTCP work, I have had to rework file I/O to avoid 
references to temporary files written to /tmp, but other than that...optim() is 
checkpointed just before 8 hours is up, and then resumed successfully in a 
subsequent batch job running on a different core of the cluster.

While I have an answer for my particular task, it would still be useful to 
checkpoint using the scheme Henrik suggests. Thanks all for the interesting 
conversation!

-Andy



On 12/14/21 5:39 PM, Henrik Bengtsson wrote:

On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson  wrote:


Those are good points, Duncan. I am experimenting with a nice checkpointing 
tool called DMTCP. It operates on the system level but is quite OS-dependent. 
It can be found at http://dmtcp.sourceforge.net/index.html.

Still, it would be nice to be able to checkpoint calls within R to potentially 
long-running processes like optim().


Teasing idea. Imagine if we could come up with some de-facto standard
API for this and that such a framework could be called automatically
by R. Something similar to how user interrupts are checked (e.g.
R_CheckUserInterrupt()) on a regular basis by the R engine and
through-out the R code. That could help troubleshooting and debugging,
e.g. sending the checkpoint to someone else or going backwards in
time.

Pasting in the below since I failed to hit Reply *All* the other day,
and it was only Richard who got it:

A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp).  I'm
sharing in case someone is interested in investigating this further.
Also, somewhere on the DMTCP wiki, they asked for testing with R by
more experienced users.

"DMTCP is a tool to transparently checkpoint the state of multiple
simultaneous applications, including multi-threaded and distributed
applications. It operates directly on the user binary executable,
without any Linux kernel modules or other kernel modifications."

They seem to be able to run this with HPC jobs, open files, Linux
containers, and even MPI, and so on.  I've only tested it very quickly
with interactive R and it seems to work.  Obviously more testing needs
to be done to identify when it doesn't work.  For example, I'd have a
hard time it would work out of the box with local parallel PSOCK
workers.  They mention "plug-ins", so maybe there's a way to adding
support for specific use cases on a one by one.

Different academic HPC environment appear to use it, e.g.

* https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
* http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/
* https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP

That's all I have time for now,

Henrik



-Andy

On 12/13/21 11:51 AM, Duncan Murdoch wrote:

On 13/12/2021 12:58 p.m., Greg Minshall wrote:

Jeff,


This sounds like an OS feature, not an R feature... certainly not a
portable R feature.


i'm not arguing for it, but this seems to me like something that could
be a language feature.



R functions can call libraries written in other languages, and can start 
processes, etc.  R doesn't know everything going on in every function call, and 
would have a lot of trouble saving it.

If you added some limitations, e.g. a process that periodically has its entire 
state stored in R variables, then it would be a lot easier.

Duncan Murdoch


--
Andy Jacobson
a...@yovo.org

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Andy Jacobson
andy.jacob...@noaa.gov

NOAA Global Monitoring Lab
325 Broadway
Boulder, Colorado 80305

303/497-4916

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-14 Thread Henrik Bengtsson
On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson  wrote:
>
> Those are good points, Duncan. I am experimenting with a nice checkpointing 
> tool called DMTCP. It operates on the system level but is quite OS-dependent. 
> It can be found at http://dmtcp.sourceforge.net/index.html.
>
> Still, it would be nice to be able to checkpoint calls within R to 
> potentially long-running processes like optim().

Teasing idea. Imagine if we could come up with some de-facto standard
API for this and that such a framework could be called automatically
by R. Something similar to how user interrupts are checked (e.g.
R_CheckUserInterrupt()) on a regular basis by the R engine and
through-out the R code. That could help troubleshooting and debugging,
e.g. sending the checkpoint to someone else or going backwards in
time.

Pasting in the below since I failed to hit Reply *All* the other day,
and it was only Richard who got it:

A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp).  I'm
sharing in case someone is interested in investigating this further.
Also, somewhere on the DMTCP wiki, they asked for testing with R by
more experienced users.

"DMTCP is a tool to transparently checkpoint the state of multiple
simultaneous applications, including multi-threaded and distributed
applications. It operates directly on the user binary executable,
without any Linux kernel modules or other kernel modifications."

They seem to be able to run this with HPC jobs, open files, Linux
containers, and even MPI, and so on.  I've only tested it very quickly
with interactive R and it seems to work.  Obviously more testing needs
to be done to identify when it doesn't work.  For example, I'd have a
hard time it would work out of the box with local parallel PSOCK
workers.  They mention "plug-ins", so maybe there's a way to adding
support for specific use cases on a one by one.

Different academic HPC environment appear to use it, e.g.

* https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
* http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/
* https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP

That's all I have time for now,

Henrik

>
> -Andy
>
> On 12/13/21 11:51 AM, Duncan Murdoch wrote:
> > On 13/12/2021 12:58 p.m., Greg Minshall wrote:
> >> Jeff,
> >>
> >>> This sounds like an OS feature, not an R feature... certainly not a
> >>> portable R feature.
> >>
> >> i'm not arguing for it, but this seems to me like something that could
> >> be a language feature.
> >>
> >
> > R functions can call libraries written in other languages, and can start 
> > processes, etc.  R doesn't know everything going on in every function call, 
> > and would have a lot of trouble saving it.
> >
> > If you added some limitations, e.g. a process that periodically has its 
> > entire state stored in R variables, then it would be a lot easier.
> >
> > Duncan Murdoch
>
> --
> Andy Jacobson
> a...@yovo.org
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-14 Thread Andy Jacobson

Those are good points, Duncan. I am experimenting with a nice checkpointing 
tool called DMTCP. It operates on the system level but is quite OS-dependent. 
It can be found at http://dmtcp.sourceforge.net/index.html.

Still, it would be nice to be able to checkpoint calls within R to potentially 
long-running processes like optim().

-Andy

On 12/13/21 11:51 AM, Duncan Murdoch wrote:

On 13/12/2021 12:58 p.m., Greg Minshall wrote:

Jeff,


This sounds like an OS feature, not an R feature... certainly not a
portable R feature.


i'm not arguing for it, but this seems to me like something that could
be a language feature.



R functions can call libraries written in other languages, and can start 
processes, etc.  R doesn't know everything going on in every function call, and 
would have a lot of trouble saving it.

If you added some limitations, e.g. a process that periodically has its entire 
state stored in R variables, then it would be a lot easier.

Duncan Murdoch


--
Andy Jacobson
a...@yovo.org

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-13 Thread Avi Gross via R-help
Jeff,

I am wondering what it even means to do what you say. In a compiled
language, I can imagine wrapping up an executable along with some kind of
run-time image (which may actually contain the parts of the executable that
includes what has not run yet) and revive it elsewhere.

But even there, how would it work if say the executable kept opening more
files for reading or appending and you move it to place those files did not
exists or had different contents or other such scenarios? What happens to
open pipes with another process attached, for an OS that supports pipes?
When you restart, the other processes re not there even if you supply an
image of a pipe and I am sure others can imagine much more.

R is interpreted. You could say the main interpreter may be like an
executable and there may be multiple threads active at the time you stop the
process and bundle it to be restarted later. But R has many fairly dynamic
features including some the interpreter has not even looked at yet. Besides
files it may want to open, there are any number of statements like
library(filename) it may encounter and of course other files it may
source(code) . In general, the info on what may be needed later is not in
any serious way bundled with the file and many things may be hard to predict
even with a look ahead as often arguments to functions are not evaluated
till some indefinite later time or even never. 

I am trying to imagine how you stop and restore say an R program running
connected to something like RSTUDIO which is also connected to a Python
program with data and instructions flowing back and forth.

It does not strike me as easy to make a reliable method to do this, albeit
as noted, there are operating systems that do allow you to suspend arbitrary
processes and restart them locally perhaps only before the system reboots.

But I can think of exceptions, including some I see others have thought of.
An example might be an R program that reads in lots of data, then makes
objects like data.frames and then pauses in some kind of nested loop that
will process the data while having the current indices saved in variables.
It could literally ask to be frozen so it starts up from there when asked
to. R can be set to intercept some signals and perhaps voluntarily save all
the variables as they are (including the data it may be operating on and
what it is making from it (as in what search items it has already found) as
well as the needed index info) and exit gracefully. If the application is
restarted, it might note the file with saved info and read in all the data
and continue from there. The above is not a serious proposal and has lots of
things that can go wrong, but I can imagine it as an app that sets itself up
doing heavy lifting once and later every time you want to do a search, it
loads the data and gets from you something to search for and does it quickly
and resuspends till needed. But this example is not exactly what you asked
for.

I have actually done weird things like the above including things that
simply start up again after a reboot as if nothing happened. 

What is a more interesting question for me is what R features might make
sense that help construct a program that is in some sense re-startable if
used right. I can imagine a package that lets you set things like a "level"
for debugging so that your code when started at some point says:

# initialize.
# load any left-in-file data if it exists.

if (level < 2) {
  do stuff
  level <- 2
  }

if (level < 3) {
  do more stuff
  level <- 3
  }

...

Something like the above might wrap parts in something like a "try()" that
intercepts some interrupt condition and saves the needed status info.

What I wonder is if long-running processes that can be up for months say in
a web-server, may already have ways to save all kinds of status info so when
they start up again after a normal reboot, are able to continue almost as if
nothing had happened.


-Original Message-
From: R-help  On Behalf Of Jeff Newmiller
Sent: Monday, December 13, 2021 11:54 AM
To: Andy Jacobson ; Andy Jacobson via R-help
; r-help@r-project.org
Subject: Re: [R] checkpointing

This sounds like an OS feature, not an R feature... certainly not a portable
R feature.

On December 13, 2021 8:37:30 AM PST, Andy Jacobson via R-help
 wrote:
>Has anyone ever considered what it would take to implement checkpointing in
R, so that long-running processes could be interrupted and resumed later,
from a different process or even a different machine?
>
>Thanks,
>
>Andy
>

--
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__

Re: [R] checkpointing

2021-12-13 Thread Richard O'Keefe
Use VirtualBox.  You can take a 'snapshot' of a running virtual
machine, either from the GUI or from the CLI (vboxmanage snapshot ...)
and restore it later.

This requires NO changes to R.  Snapshots can be restored on
another machine of the same kind with the same system software.

VirtualBox is free.  Parallels is (just) under USD100.

On Tue, 14 Dec 2021 at 05:38, Andy Jacobson via R-help 
wrote:

> Has anyone ever considered what it would take to implement checkpointing
> in R, so that long-running processes could be interrupted and resumed
> later, from a different process or even a different machine?
>
> Thanks,
>
> Andy
>
> --
> Andy Jacobson
> andy.jacob...@noaa.gov
>
> NOAA Global Monitoring Lab
> 325 Broadway
> Boulder, Colorado 80305
>
> 303/497-4916
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-13 Thread Richard O'Keefe
I used to work on a Prolog implementation that did something similar.
At any point you could explicitly save a snapshot of the current state
and then from the operating system command line, resume it.
This wasn't really for checkpointing.  It was so that you could load
up a customised environment, initialise it, and then future runs would
be able to start up much quicker.

Initially it was all very simple.
(Hah!)  As long as you only expected a snapshot to work on the same kind
of machine with exactly the same operating system and libraries and the
same user files in the same (logical) places.

Then as operating systems added features, it got harder and harder.
First we had to cut back to discarding the stacks and restarting from
top level on resumption.
Finally, when address space layout randomisation came along (and by the
way, this was back in the 1980s) we gave up.

Bring in clusters with MPI and I for one don't want to go there.


On Tue, 14 Dec 2021 at 06:59, Greg Minshall  wrote:

> Jeff,
>
> > This sounds like an OS feature, not an R feature... certainly not a
> > portable R feature.
>
> i'm not arguing for it, but this seems to me like something that could
> be a language feature.
>
> cheers, Greg
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-13 Thread Greg Minshall
Duncan,

> R functions can call libraries written in other languages, and can
> start processes, etc.  R doesn't know everything going on in every
> function call, and would have a lot of trouble saving it.

indeed, that obvious fact hadn't occurred to me.  i withdraw my
contention.  :)

cheers, Greg

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-13 Thread Duncan Murdoch

On 13/12/2021 12:58 p.m., Greg Minshall wrote:

Jeff,


This sounds like an OS feature, not an R feature... certainly not a
portable R feature.


i'm not arguing for it, but this seems to me like something that could
be a language feature.



R functions can call libraries written in other languages, and can start 
processes, etc.  R doesn't know everything going on in every function 
call, and would have a lot of trouble saving it.


If you added some limitations, e.g. a process that periodically has its 
entire state stored in R variables, then it would be a lot easier.


Duncan Murdoch

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-13 Thread Greg Minshall
Jeff,

> This sounds like an OS feature, not an R feature... certainly not a
> portable R feature.

i'm not arguing for it, but this seems to me like something that could
be a language feature.

cheers, Greg

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-13 Thread Jeff Newmiller
This sounds like an OS feature, not an R feature... certainly not a portable R 
feature.

On December 13, 2021 8:37:30 AM PST, Andy Jacobson via R-help 
 wrote:
>Has anyone ever considered what it would take to implement checkpointing in R, 
>so that long-running processes could be interrupted and resumed later, from a 
>different process or even a different machine?
>
>Thanks,
>
>Andy
>

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.