Yep. For the checkpoint/continue that patch looks good.
On Tue, Feb 18, 2014 at 11:30 AM, Adrian Reber <adr...@lisas.de> wrote: > On Tue, Feb 18, 2014 at 10:21:23AM -0600, Josh Hursey wrote: > > So when a process is restarted with CRIU, does it resume execution after > > the criu_dump() or somewhere else? > > The process is resumed at the same point it was checkpointed with > criu_dump(). > > > In a continue/leave-running mode after checkpoint the MPI library does > not > > need to do quite a much work since we can depend on some things not > > changing (such as the machine name, orted pid, ...). > > During criu_dump() nothing changes. > > > In a restart mode then the entire library has to be updated - much more > > expensive than the continue mode. > > Ah. If I understand you correctly there are C/R methods which require > that the checkpointed process is terminated and needs to be restarted to > continue running. CRIU is completely transparent for the process. It > needs no special environment (LD_PRELOAD) nor any special handling. > criu_dump() pauses the process, checkpoints it and (if desired) lets it > continue in the same state it was before. > > > The CRS components that we have supported emerge from their checkpointing > > function (criu_dump in your case) knowing if they are in the continue or > > restart mode. So that CRS function sets the flag according so the rest of > > the library can do the right thing afterwards. > > So, I would say CRIU CRS is in continue mode after criu_dump(). > > > The restart function is called by the opal_restart tool to restart the > > process from an image. Some checkpointers have a library call to restart > a > > process others used external tools to do so. So that interface just let's > > the checkpointer decide, given a snapshot image, how it should restart > that > > process. The restarted process is assumed to wake up in the > > opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart > > function name can be a bit misleading. > > > > Does that help? > > That helps a lot. Thanks. I am not 100% sure I understand the restart > case, but I will try to implement it and probably then I will understand > how it works. > > Would you say, that for the checkpoint only functionality in continue > mode the patch can be checked in? > > Adrian > > > On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber <adr...@lisas.de> wrote: > > > > > I think I do not understand your question. So far I have only > implemented > > > the > > > checkpoint part and not the restart part. > > > > > > Using criu_dump() the process can be left in three different > > > states. Without any special handling the process is dumped and then > > > killed. I can also tell criu to leave the process stopped > (--leave-stopped) > > > or running (--leave-running). I decided to default to --leave-running > so > > > that after the checkpoint has been performed the process continues > > > running where it stopped. > > > > > > What would be the difference between 'being restarted versus continuing > > > after checkpointing'? Right now only 'continuing after checkpoint' is > > > implemented. I do not understand how process 'is being restarted' fits > > > in the checkpoint function. > > > > > > In opal_crs_criu_checkpoint() I am using criu_dump() to > > > checkpoint the process and the plan is to use criu_restore() in > > > opal_crs_criu_restart() (which I have not yet implemented). > > > > > > On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote: > > > > It look fine except that the restart state is not flagged. When a > process > > > > is restarted does it resume execution inside the criu_dump() > function? If > > > > so, is there a way to tell from its return code (or some other > mechanism) > > > > that it is being restarted versus continuing after checkpointing? > > > > > > > > > > > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain <r...@open-mpi.org> > wrote: > > > > > > > > > Great - looks fine to me!! > > > > > > > > > > > > > > > On Feb 17, 2014, at 11:39 AM, Adrian Reber <adr...@lisas.de> > wrote: > > > > > > > > > > > I have prepared a patch I would like to commit which adds to > code to > > > > > > actually checkpoint a process. Thanks for the pointers about the > > > string > > > > > > variables I tried to do implement it correctly. > > > > > > > > > > > > CRIU currently has problems with the new OOB usock but I will > contact > > > > > > the CRIU developers about this error. Using tcp, checkpointing > works. > > > > > > > > > > > > CRIU also has problems with --np > 1, but I am sure this can > also be > > > > > > resolved. > > > > > > > > > > > > The patch is at: > > > > > > > > > > > > > > > > > > > > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492 > > > > > > > > > > > > Adrian > > > > > > _______________________________________________ > > > > > > devel mailing list > > > > > > de...@open-mpi.org > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > > > > > _______________________________________________ > > > > > devel mailing list > > > > > de...@open-mpi.org > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Joshua Hursey Assistant Professor of Computer Science University of Wisconsin-La Crosse http://cs.uwlax.edu/~jjhursey