Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Josh Hursey
Yep. For the checkpoint/continue that patch looks good.


On Tue, Feb 18, 2014 at 11:30 AM, Adrian Reber  wrote:

> On Tue, Feb 18, 2014 at 10:21:23AM -0600, Josh Hursey wrote:
> > So when a process is restarted with CRIU, does it resume execution after
> > the criu_dump() or somewhere else?
>
> The process is resumed at the same point it was checkpointed with
> criu_dump().
>
> > In a continue/leave-running mode after checkpoint the MPI library does
> not
> > need to do quite a much work since we can depend on some things not
> > changing (such as the machine name, orted pid, ...).
>
> During criu_dump() nothing changes.
>
> > In a restart mode then the entire library has to be updated - much more
> > expensive than the continue mode.
>
> Ah. If I understand you correctly there are C/R methods which require
> that the checkpointed process is terminated and needs to be restarted to
> continue running. CRIU is completely transparent for the process. It
> needs no special environment (LD_PRELOAD) nor any special handling.
> criu_dump() pauses the process, checkpoints it and (if desired) lets it
> continue in the same state it was before.
>
> > The CRS components that we have supported emerge from their checkpointing
> > function (criu_dump in your case) knowing if they are in the continue or
> > restart mode. So that CRS function sets the flag according so the rest of
> > the library can do the right thing afterwards.
>
> So, I would say CRIU CRS is in continue mode after criu_dump().
>
> > The restart function is called by the opal_restart tool to restart the
> > process from an image. Some checkpointers have a library call to restart
> a
> > process others used external tools to do so. So that interface just let's
> > the checkpointer decide, given a snapshot image, how it should restart
> that
> > process. The restarted process is assumed to wake up in the
> > opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart
> > function name can be a bit misleading.
> >
> > Does that help?
>
> That helps a lot. Thanks. I am not 100% sure I understand the restart
> case, but I will try to implement it and probably then I will understand
> how it works.
>
> Would you say, that for the checkpoint only functionality in continue
> mode the patch can be checked in?
>
> Adrian
>
> > On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber  wrote:
> >
> > > I think I do not understand your question. So far I have only
> implemented
> > > the
> > > checkpoint part and not the restart part.
> > >
> > > Using criu_dump() the process can  be left in three different
> > > states. Without any special handling the process is dumped and then
> > > killed. I can also tell criu to leave the process stopped
> (--leave-stopped)
> > > or running (--leave-running). I decided to default to --leave-running
> so
> > > that after the checkpoint has been performed the process continues
> > > running where it stopped.
> > >
> > > What would be the difference between 'being restarted versus continuing
> > > after checkpointing'? Right now only 'continuing after checkpoint' is
> > > implemented. I do not understand how process 'is being restarted' fits
> > > in the checkpoint function.
> > >
> > > In opal_crs_criu_checkpoint() I am using criu_dump() to
> > > checkpoint the process and the plan is to use criu_restore() in
> > > opal_crs_criu_restart() (which I have not yet implemented).
> > >
> > > On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> > > > It look fine except that the restart state is not flagged. When a
> process
> > > > is restarted does it resume execution inside the criu_dump()
> function? If
> > > > so, is there a way to tell from its return code (or some other
> mechanism)
> > > > that it is being restarted versus continuing after checkpointing?
> > > >
> > > >
> > > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain 
> wrote:
> > > >
> > > > > Great - looks fine to me!!
> > > > >
> > > > >
> > > > > On Feb 17, 2014, at 11:39 AM, Adrian Reber 
> wrote:
> > > > >
> > > > > > I have prepared a patch I would like to commit which adds to
> code to
> > > > > > actually checkpoint a process. Thanks for the pointers about the
> > > string
> > > > > > variables I tried to do implement it correctly.
> > > > > >
> > > > > > CRIU currently has problems with the new OOB usock but I will
> contact
> > > > > > the CRIU developers about this error. Using tcp, checkpointing
> works.
> > > > > >
> > > > > > CRIU also has problems with --np > 1, but I am sure this can
> also be
> > > > > > resolved.
> > > > > >
> > > > > > The patch is at:
> > > > > >
> > > > > >
> > > > >
> > >
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > > > > >
> > > > > >   Adrian
> > > > > > ___
> > > > > > devel mailing list
> > > > > > de...@open-mpi.org
> > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >
> > 

Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
On Tue, Feb 18, 2014 at 10:21:23AM -0600, Josh Hursey wrote:
> So when a process is restarted with CRIU, does it resume execution after
> the criu_dump() or somewhere else?

The process is resumed at the same point it was checkpointed with
criu_dump().

> In a continue/leave-running mode after checkpoint the MPI library does not
> need to do quite a much work since we can depend on some things not
> changing (such as the machine name, orted pid, ...).

During criu_dump() nothing changes.

> In a restart mode then the entire library has to be updated - much more
> expensive than the continue mode.

Ah. If I understand you correctly there are C/R methods which require
that the checkpointed process is terminated and needs to be restarted to
continue running. CRIU is completely transparent for the process. It
needs no special environment (LD_PRELOAD) nor any special handling.
criu_dump() pauses the process, checkpoints it and (if desired) lets it
continue in the same state it was before.

> The CRS components that we have supported emerge from their checkpointing
> function (criu_dump in your case) knowing if they are in the continue or
> restart mode. So that CRS function sets the flag according so the rest of
> the library can do the right thing afterwards.

So, I would say CRIU CRS is in continue mode after criu_dump().

> The restart function is called by the opal_restart tool to restart the
> process from an image. Some checkpointers have a library call to restart a
> process others used external tools to do so. So that interface just let's
> the checkpointer decide, given a snapshot image, how it should restart that
> process. The restarted process is assumed to wake up in the
> opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart
> function name can be a bit misleading.
> 
> Does that help?

That helps a lot. Thanks. I am not 100% sure I understand the restart
case, but I will try to implement it and probably then I will understand
how it works.

Would you say, that for the checkpoint only functionality in continue
mode the patch can be checked in?

Adrian

> On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber  wrote:
> 
> > I think I do not understand your question. So far I have only implemented
> > the
> > checkpoint part and not the restart part.
> >
> > Using criu_dump() the process can  be left in three different
> > states. Without any special handling the process is dumped and then
> > killed. I can also tell criu to leave the process stopped (--leave-stopped)
> > or running (--leave-running). I decided to default to --leave-running so
> > that after the checkpoint has been performed the process continues
> > running where it stopped.
> >
> > What would be the difference between 'being restarted versus continuing
> > after checkpointing'? Right now only 'continuing after checkpoint' is
> > implemented. I do not understand how process 'is being restarted' fits
> > in the checkpoint function.
> >
> > In opal_crs_criu_checkpoint() I am using criu_dump() to
> > checkpoint the process and the plan is to use criu_restore() in
> > opal_crs_criu_restart() (which I have not yet implemented).
> >
> > On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> > > It look fine except that the restart state is not flagged. When a process
> > > is restarted does it resume execution inside the criu_dump() function? If
> > > so, is there a way to tell from its return code (or some other mechanism)
> > > that it is being restarted versus continuing after checkpointing?
> > >
> > >
> > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain  wrote:
> > >
> > > > Great - looks fine to me!!
> > > >
> > > >
> > > > On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:
> > > >
> > > > > I have prepared a patch I would like to commit which adds to code to
> > > > > actually checkpoint a process. Thanks for the pointers about the
> > string
> > > > > variables I tried to do implement it correctly.
> > > > >
> > > > > CRIU currently has problems with the new OOB usock but I will contact
> > > > > the CRIU developers about this error. Using tcp, checkpointing works.
> > > > >
> > > > > CRIU also has problems with --np > 1, but I am sure this can also be
> > > > > resolved.
> > > > >
> > > > > The patch is at:
> > > > >
> > > > >
> > > >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > > > >
> > > > >   Adrian
> > > > > ___
> > > > > devel mailing list
> > > > > de...@open-mpi.org
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Josh Hursey
So when a process is restarted with CRIU, does it resume execution after
the criu_dump() or somewhere else?

In a continue/leave-running mode after checkpoint the MPI library does not
need to do quite a much work since we can depend on some things not
changing (such as the machine name, orted pid, ...).

In a restart mode then the entire library has to be updated - much more
expensive than the continue mode.

The CRS components that we have supported emerge from their checkpointing
function (criu_dump in your case) knowing if they are in the continue or
restart mode. So that CRS function sets the flag according so the rest of
the library can do the right thing afterwards.

The restart function is called by the opal_restart tool to restart the
process from an image. Some checkpointers have a library call to restart a
process others used external tools to do so. So that interface just let's
the checkpointer decide, given a snapshot image, how it should restart that
process. The restarted process is assumed to wake up in the
opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart
function name can be a bit misleading.

Does that help?

-- Josh





On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber  wrote:

> I think I do not understand your question. So far I have only implemented
> the
> checkpoint part and not the restart part.
>
> Using criu_dump() the process can  be left in three different
> states. Without any special handling the process is dumped and then
> killed. I can also tell criu to leave the process stopped (--leave-stopped)
> or running (--leave-running). I decided to default to --leave-running so
> that after the checkpoint has been performed the process continues
> running where it stopped.
>
> What would be the difference between 'being restarted versus continuing
> after checkpointing'? Right now only 'continuing after checkpoint' is
> implemented. I do not understand how process 'is being restarted' fits
> in the checkpoint function.
>
> In opal_crs_criu_checkpoint() I am using criu_dump() to
> checkpoint the process and the plan is to use criu_restore() in
> opal_crs_criu_restart() (which I have not yet implemented).
>
> On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> > It look fine except that the restart state is not flagged. When a process
> > is restarted does it resume execution inside the criu_dump() function? If
> > so, is there a way to tell from its return code (or some other mechanism)
> > that it is being restarted versus continuing after checkpointing?
> >
> >
> > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain  wrote:
> >
> > > Great - looks fine to me!!
> > >
> > >
> > > On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:
> > >
> > > > I have prepared a patch I would like to commit which adds to code to
> > > > actually checkpoint a process. Thanks for the pointers about the
> string
> > > > variables I tried to do implement it correctly.
> > > >
> > > > CRIU currently has problems with the new OOB usock but I will contact
> > > > the CRIU developers about this error. Using tcp, checkpointing works.
> > > >
> > > > CRIU also has problems with --np > 1, but I am sure this can also be
> > > > resolved.
> > > >
> > > > The patch is at:
> > > >
> > > >
> > >
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > > >
> > > >   Adrian
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> >
> >
> >
> > --
> > Joshua Hursey
> > Assistant Professor of Computer Science
> > University of Wisconsin-La Crosse
> > http://cs.uwlax.edu/~jjhursey
>
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
I think I do not understand your question. So far I have only implemented the
checkpoint part and not the restart part.

Using criu_dump() the process can  be left in three different
states. Without any special handling the process is dumped and then
killed. I can also tell criu to leave the process stopped (--leave-stopped)
or running (--leave-running). I decided to default to --leave-running so
that after the checkpoint has been performed the process continues
running where it stopped.

What would be the difference between 'being restarted versus continuing
after checkpointing'? Right now only 'continuing after checkpoint' is
implemented. I do not understand how process 'is being restarted' fits
in the checkpoint function.

In opal_crs_criu_checkpoint() I am using criu_dump() to
checkpoint the process and the plan is to use criu_restore() in
opal_crs_criu_restart() (which I have not yet implemented).

On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> It look fine except that the restart state is not flagged. When a process
> is restarted does it resume execution inside the criu_dump() function? If
> so, is there a way to tell from its return code (or some other mechanism)
> that it is being restarted versus continuing after checkpointing?
> 
> 
> On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain  wrote:
> 
> > Great - looks fine to me!!
> >
> >
> > On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:
> >
> > > I have prepared a patch I would like to commit which adds to code to
> > > actually checkpoint a process. Thanks for the pointers about the string
> > > variables I tried to do implement it correctly.
> > >
> > > CRIU currently has problems with the new OOB usock but I will contact
> > > the CRIU developers about this error. Using tcp, checkpointing works.
> > >
> > > CRIU also has problems with --np > 1, but I am sure this can also be
> > > resolved.
> > >
> > > The patch is at:
> > >
> > >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > >
> > >   Adrian
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-17 Thread Josh Hursey
It look fine except that the restart state is not flagged. When a process
is restarted does it resume execution inside the criu_dump() function? If
so, is there a way to tell from its return code (or some other mechanism)
that it is being restarted versus continuing after checkpointing?


On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain  wrote:

> Great - looks fine to me!!
>
>
> On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:
>
> > I have prepared a patch I would like to commit which adds to code to
> > actually checkpoint a process. Thanks for the pointers about the string
> > variables I tried to do implement it correctly.
> >
> > CRIU currently has problems with the new OOB usock but I will contact
> > the CRIU developers about this error. Using tcp, checkpointing works.
> >
> > CRIU also has problems with --np > 1, but I am sure this can also be
> > resolved.
> >
> > The patch is at:
> >
> >
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> >
> >   Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-17 Thread Ralph Castain
Great - looks fine to me!!


On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:

> I have prepared a patch I would like to commit which adds to code to
> actually checkpoint a process. Thanks for the pointers about the string
> variables I tried to do implement it correctly.
> 
> CRIU currently has problems with the new OOB usock but I will contact
> the CRIU developers about this error. Using tcp, checkpointing works.
> 
> CRIU also has problems with --np > 1, but I am sure this can also be
> resolved.
> 
> The patch is at:
> 
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-17 Thread Adrian Reber
I have prepared a patch I would like to commit which adds to code to
actually checkpoint a process. Thanks for the pointers about the string
variables I tried to do implement it correctly.

CRIU currently has problems with the new OOB usock but I will contact
the CRIU developers about this error. Using tcp, checkpointing works.

CRIU also has problems with --np > 1, but I am sure this can also be
resolved.

The patch is at:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492

Adrian