Re: [OMPI devel] RFC: optimize probe in ob1

2014-02-18 Thread Nathan Hjelm
On Tue, Feb 11, 2014 at 01:43:37AM +0100, George Bosilca wrote:
> 
> The class is only usable in the context of a single .c file. As a code 
> protection it makes perfect sense to me.

Ah, yes. So it is. Fixed in the latest patch.

> It’s not yet, and I did not notice an RFC about. The event I was referring to 
> is only generated when the message is first noticed. In the particular 
> instance affected by your patch it has been delayed until the communicator is 
> created locally, but it still have to be generated once. 

The problem is the message is not generated once but twice with
add_fragment_to_unexpected where it is. One message is generated when
an out of order packet is processed by the outer loop (it is put into
the out of order list) then another time when it is processed by the
inter loop jumping to the add_fragment_to_unexpected. This has no affect
on the iprobe optimization so I have dropped it from my proposed patch.

> The size check and the removal from the list is still in the critical path. 
> At some point we were down to few hundreds of nano-sec, enough to get bugged 
> by one extra memory reference.

I modified the patch to only remove procs from the unexpected_procs list
when matching wildcard receive requests. This way there are no extra
instructions in the critical path. It will make probe a little slower
than the previous patch but that is ok. I see no degredation with simple
pt2pt benchmarks with vader. Please take a look and let me know what you
think.

-Nathan
diff --git a/ompi/mca/pml/ob1/pml_ob1.c b/ompi/mca/pml/ob1/pml_ob1.c
index bfb975a..f41cba1 100644
--- a/ompi/mca/pml/ob1/pml_ob1.c
+++ b/ompi/mca/pml/ob1/pml_ob1.c
@@ -192,8 +192,7 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
 {
 /* allocate pml specific comm data */
 mca_pml_ob1_comm_t* pml_comm = OBJ_NEW(mca_pml_ob1_comm_t);
-opal_list_item_t *item, *next_item;
-mca_pml_ob1_recv_frag_t* frag;
+mca_pml_ob1_recv_frag_t* frag, *next_frag;
 mca_pml_ob1_comm_proc_t* pml_proc;
 mca_pml_ob1_match_hdr_t* hdr;
 int i;
@@ -215,12 +214,9 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
 pml_comm->procs[i].ompi_proc = 
ompi_group_peer_lookup(comm->c_remote_group,i);
 OBJ_RETAIN(pml_comm->procs[i].ompi_proc);
 }
+
 /* Grab all related messages from the non_existing_communicator pending 
queue */
-for( item = 
opal_list_get_first(&mca_pml_ob1.non_existing_communicator_pending);
- item != 
opal_list_get_end(&mca_pml_ob1.non_existing_communicator_pending);
- item = next_item ) {
-frag = (mca_pml_ob1_recv_frag_t*)item;
-next_item = opal_list_get_next(item);
+OPAL_LIST_FOREACH_SAFE(frag, next_frag, 
&mca_pml_ob1.non_existing_communicator_pending, mca_pml_ob1_recv_frag_t) {
 hdr = &frag->hdr.hdr_match;
 
 /* Is this fragment for the current communicator ? */
@@ -231,7 +227,7 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
  * we should remove it from the
  * non_existing_communicator_pending list. */
 opal_list_remove_item( &mca_pml_ob1.non_existing_communicator_pending, 
-   item );
+   (opal_list_item_t *) frag);
 
   add_fragment_to_unexpected:
 
@@ -255,6 +251,11 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
 if( ((uint16_t)hdr->hdr_seq) == 
((uint16_t)pml_proc->expected_sequence) ) {
 /* We're now expecting the next sequence number. */
 pml_proc->expected_sequence++;
+/* add this proc to the list of procs with unexpected messages */
+if (!pml_proc->in_unexpected_list) {
+opal_list_append (&pml_comm->unexpected_procs, 
&pml_proc->super);
+pml_proc->in_unexpected_list = true;
+}
 opal_list_append( &pml_proc->unexpected_frags, 
(opal_list_item_t*)frag );
 PERUSE_TRACE_MSG_EVENT(PERUSE_COMM_MSG_INSERT_IN_UNEX_Q, comm,
hdr->hdr_src, hdr->hdr_tag, PERUSE_RECV);
@@ -264,9 +265,7 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
  * situation as the cant_match is only checked when a new fragment 
is received from
  * the network.
  */
-   for(frag = (mca_pml_ob1_recv_frag_t 
*)opal_list_get_first(&pml_proc->frags_cant_match);
-   frag != (mca_pml_ob1_recv_frag_t 
*)opal_list_get_end(&pml_proc->frags_cant_match);
-   frag = (mca_pml_ob1_recv_frag_t *)opal_list_get_next(frag)) {
+OPAL_LIST_FOREACH(frag, &pml_proc->frags_cant_match, 
mca_pml_ob1_recv_frag_t) {
hdr = &frag->hdr.hdr_match;
/* If the message has the next expected seq from that proc...  
*/
if(hdr->hdr_seq != pml_proc->expected_sequence)
@@ -579,9 +578,9 @@ int mca_pml_ob1_dump(struct ompi_communicator_t* comm, int 
verbose)
 
 /* TODO: don't forget to dump 
mca_pm

Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Josh Hursey
Yep. For the checkpoint/continue that patch looks good.


On Tue, Feb 18, 2014 at 11:30 AM, Adrian Reber  wrote:

> On Tue, Feb 18, 2014 at 10:21:23AM -0600, Josh Hursey wrote:
> > So when a process is restarted with CRIU, does it resume execution after
> > the criu_dump() or somewhere else?
>
> The process is resumed at the same point it was checkpointed with
> criu_dump().
>
> > In a continue/leave-running mode after checkpoint the MPI library does
> not
> > need to do quite a much work since we can depend on some things not
> > changing (such as the machine name, orted pid, ...).
>
> During criu_dump() nothing changes.
>
> > In a restart mode then the entire library has to be updated - much more
> > expensive than the continue mode.
>
> Ah. If I understand you correctly there are C/R methods which require
> that the checkpointed process is terminated and needs to be restarted to
> continue running. CRIU is completely transparent for the process. It
> needs no special environment (LD_PRELOAD) nor any special handling.
> criu_dump() pauses the process, checkpoints it and (if desired) lets it
> continue in the same state it was before.
>
> > The CRS components that we have supported emerge from their checkpointing
> > function (criu_dump in your case) knowing if they are in the continue or
> > restart mode. So that CRS function sets the flag according so the rest of
> > the library can do the right thing afterwards.
>
> So, I would say CRIU CRS is in continue mode after criu_dump().
>
> > The restart function is called by the opal_restart tool to restart the
> > process from an image. Some checkpointers have a library call to restart
> a
> > process others used external tools to do so. So that interface just let's
> > the checkpointer decide, given a snapshot image, how it should restart
> that
> > process. The restarted process is assumed to wake up in the
> > opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart
> > function name can be a bit misleading.
> >
> > Does that help?
>
> That helps a lot. Thanks. I am not 100% sure I understand the restart
> case, but I will try to implement it and probably then I will understand
> how it works.
>
> Would you say, that for the checkpoint only functionality in continue
> mode the patch can be checked in?
>
> Adrian
>
> > On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber  wrote:
> >
> > > I think I do not understand your question. So far I have only
> implemented
> > > the
> > > checkpoint part and not the restart part.
> > >
> > > Using criu_dump() the process can  be left in three different
> > > states. Without any special handling the process is dumped and then
> > > killed. I can also tell criu to leave the process stopped
> (--leave-stopped)
> > > or running (--leave-running). I decided to default to --leave-running
> so
> > > that after the checkpoint has been performed the process continues
> > > running where it stopped.
> > >
> > > What would be the difference between 'being restarted versus continuing
> > > after checkpointing'? Right now only 'continuing after checkpoint' is
> > > implemented. I do not understand how process 'is being restarted' fits
> > > in the checkpoint function.
> > >
> > > In opal_crs_criu_checkpoint() I am using criu_dump() to
> > > checkpoint the process and the plan is to use criu_restore() in
> > > opal_crs_criu_restart() (which I have not yet implemented).
> > >
> > > On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> > > > It look fine except that the restart state is not flagged. When a
> process
> > > > is restarted does it resume execution inside the criu_dump()
> function? If
> > > > so, is there a way to tell from its return code (or some other
> mechanism)
> > > > that it is being restarted versus continuing after checkpointing?
> > > >
> > > >
> > > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain 
> wrote:
> > > >
> > > > > Great - looks fine to me!!
> > > > >
> > > > >
> > > > > On Feb 17, 2014, at 11:39 AM, Adrian Reber 
> wrote:
> > > > >
> > > > > > I have prepared a patch I would like to commit which adds to
> code to
> > > > > > actually checkpoint a process. Thanks for the pointers about the
> > > string
> > > > > > variables I tried to do implement it correctly.
> > > > > >
> > > > > > CRIU currently has problems with the new OOB usock but I will
> contact
> > > > > > the CRIU developers about this error. Using tcp, checkpointing
> works.
> > > > > >
> > > > > > CRIU also has problems with --np > 1, but I am sure this can
> also be
> > > > > > resolved.
> > > > > >
> > > > > > The patch is at:
> > > > > >
> > > > > >
> > > > >
> > >
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > > > > >
> > > > > >   Adrian
> > > > > > ___
> > > > > > devel mailing list
> > > > > > de...@open-mpi.org
> > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > > >
> > 

Re: [OMPI devel] RFC: Changing 32-bit build behavior/sizes for MPI_Count and MPI_Offset

2014-02-18 Thread Jeff Squyres (jsquyres)
Just a reminder -- this RFC timed out today.

If there are no objections to this, I'll commit the patch on #4205 to the trunk 
tomorrow evening.

No one has come up with a patch yet for the v1.7 branch (because of ABI 
reasons, it must be different than what we do on the trunk), but since that is 
definitely a bug fix, it can go in at any time.


On Feb 10, 2014, at 7:14 PM, Jeff Squyres (jsquyres)  wrote:

> WHAT: On trunk, force MPI_Count/MPI_Offset to be 32 bits when building in 32 
> bit mode (they are currently 64 bit, even in a 32 bit build).  On v1.7, leave 
> the sizes at 64 bit (for ABI reasons), but put error checking in the MPI API 
> layer to ensure we won't over/underflow 32 bits.
> 
> WHY: See ticket #4205 (https://svn.open-mpi.org/trac/ompi/ticket/4205)
> 
> WHERE: On trunk, this can be solved entirely in configury.  In v1.7/v1.8, 
> make changes in the MPI API layer (e.g., check MPI_Send to ensure 
> (count*size_of_datatype)<2B)
> 
> TIMEOUT: I'll tentatively say next Tuesday teleconf, Feb 18, 2014, but it can 
> be pushed back -- there's no real rush; this isn't a hot issue (but it is 
> wrong and should be fixed).
> 
> MORE DETAIL:
> 
> I noticed that MPI_Get_elements_x() and MPI_Type_size_x() were giving wrong 
> answers when compiled in 32 bit mode on a 64 bit machine.  This is because in 
> that build:
> 
> - size_t: 4 bytes
> - ptrdiff_t: 4 bytes
> - MPI_Aint: 4 bytes
> - MPI_Offset: 8 bytes
> - MPI_Count: 8 bytes
> 
> Some data points:
> 
> 1. MPI-3 says that MPI_Count must be big enough to hold both an MPI_Aint and 
> MPI_Offset.
> 
> 2. The entire PML/BML/BTL/convertor infrastructure uses size_t as its 
> underlying computation type.
> 
> 3. The _x tests were failing in 32 bit builds because they take 
> (count,datatype) input that intentionally results in a number of bytes that 
> is larger than 2 billion, assigned that value to a size_t (which is 32 bits), 
> caused an overflow, and therefore got the wrong answer.
> 
> To solve this:
> 
> - On the trunk, we can just not allow MPI_Count (and therefore MPI_Offset) to 
> be larger than size_t.  This means that on 32 bit builds -- on both 32 and 64 
> bit systems -- sizeof(MPI_Aint) == sizeof(MPI_Offset) == sizeof(MPI_Count) == 
> 4.  There is a patch for this on #4205.
> 
> - Because of ABI issues, we cannot change the size of MPI_Count/MPI_Offset on 
> v1.7, so we can just check for over/underflow in the MPI API.  For example, 
> we can check that (count * size_of_datatype) < 2 billion (other checks will 
> also be necessary; this is just an example).  I have no patch for this yet.
> 
> As a side effect, this means that -- for 32 bit builds -- we will not support 
> large filesystems well (e.g., filesystems with 64 bit offsets).  BlueGene is 
> an example of such a system (not that OMPI supports BlueGene, but...).  
> Specifically: for 32 bit builds, we'll only allow MPI_Offset to be 32 bits.  
> I don't think that this is a major issue, because 32 bit builds are not a 
> huge issue for the OMPI community, but I raise the point in the spirit of 
> full disclosure.  Fixing it to allow 32 bit MPI_Aint but 64 bit MPI_Offset 
> and MPI_Count would likely mean re-tooling the PML/BML/BTL/convertor 
> infrastructure to use something other than size_t, and I have zero desire to 
> do that!  (please, no OMPI vendor reveal that they're going to seriously 
> build giant 32 bit systems...)
> 
> Also, while investigating this issue, I discovered that the configury for 
> determining the Fortran MPI_ADDRESS_KIND, MPI_OFFSET_KIND, and MPI_COUNT_KIND 
> values were unrelated to the C types that we discovered for these concepts.  
> The patch on #4205 fixes this issue as well -- the Fortran MPI_*_KIND value 
> are now directly correlated with the C types that were discovered.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
On Tue, Feb 18, 2014 at 10:21:23AM -0600, Josh Hursey wrote:
> So when a process is restarted with CRIU, does it resume execution after
> the criu_dump() or somewhere else?

The process is resumed at the same point it was checkpointed with
criu_dump().

> In a continue/leave-running mode after checkpoint the MPI library does not
> need to do quite a much work since we can depend on some things not
> changing (such as the machine name, orted pid, ...).

During criu_dump() nothing changes.

> In a restart mode then the entire library has to be updated - much more
> expensive than the continue mode.

Ah. If I understand you correctly there are C/R methods which require
that the checkpointed process is terminated and needs to be restarted to
continue running. CRIU is completely transparent for the process. It
needs no special environment (LD_PRELOAD) nor any special handling.
criu_dump() pauses the process, checkpoints it and (if desired) lets it
continue in the same state it was before.

> The CRS components that we have supported emerge from their checkpointing
> function (criu_dump in your case) knowing if they are in the continue or
> restart mode. So that CRS function sets the flag according so the rest of
> the library can do the right thing afterwards.

So, I would say CRIU CRS is in continue mode after criu_dump().

> The restart function is called by the opal_restart tool to restart the
> process from an image. Some checkpointers have a library call to restart a
> process others used external tools to do so. So that interface just let's
> the checkpointer decide, given a snapshot image, how it should restart that
> process. The restarted process is assumed to wake up in the
> opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart
> function name can be a bit misleading.
> 
> Does that help?

That helps a lot. Thanks. I am not 100% sure I understand the restart
case, but I will try to implement it and probably then I will understand
how it works.

Would you say, that for the checkpoint only functionality in continue
mode the patch can be checked in?

Adrian

> On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber  wrote:
> 
> > I think I do not understand your question. So far I have only implemented
> > the
> > checkpoint part and not the restart part.
> >
> > Using criu_dump() the process can  be left in three different
> > states. Without any special handling the process is dumped and then
> > killed. I can also tell criu to leave the process stopped (--leave-stopped)
> > or running (--leave-running). I decided to default to --leave-running so
> > that after the checkpoint has been performed the process continues
> > running where it stopped.
> >
> > What would be the difference between 'being restarted versus continuing
> > after checkpointing'? Right now only 'continuing after checkpoint' is
> > implemented. I do not understand how process 'is being restarted' fits
> > in the checkpoint function.
> >
> > In opal_crs_criu_checkpoint() I am using criu_dump() to
> > checkpoint the process and the plan is to use criu_restore() in
> > opal_crs_criu_restart() (which I have not yet implemented).
> >
> > On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> > > It look fine except that the restart state is not flagged. When a process
> > > is restarted does it resume execution inside the criu_dump() function? If
> > > so, is there a way to tell from its return code (or some other mechanism)
> > > that it is being restarted versus continuing after checkpointing?
> > >
> > >
> > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain  wrote:
> > >
> > > > Great - looks fine to me!!
> > > >
> > > >
> > > > On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:
> > > >
> > > > > I have prepared a patch I would like to commit which adds to code to
> > > > > actually checkpoint a process. Thanks for the pointers about the
> > string
> > > > > variables I tried to do implement it correctly.
> > > > >
> > > > > CRIU currently has problems with the new OOB usock but I will contact
> > > > > the CRIU developers about this error. Using tcp, checkpointing works.
> > > > >
> > > > > CRIU also has problems with --np > 1, but I am sure this can also be
> > > > > resolved.
> > > > >
> > > > > The patch is at:
> > > > >
> > > > >
> > > >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > > > >
> > > > >   Adrian
> > > > > ___
> > > > > devel mailing list
> > > > > de...@open-mpi.org
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] OPAL_CRS_* meaning

2014-02-18 Thread Josh Hursey
Just replied to your other email before seeing this. Take a look at those
comments and let me know if that helps differentiate those interfaces.


On Tue, Feb 18, 2014 at 5:28 AM, Jeff Squyres (jsquyres)  wrote:

> opal_crs.checkpoint() is not used to restart the process, but it does
> return in two different cases:
>
> - in the "continue" case, opal_crs.checkpoint() returns in the original
> process and keeps executing the same process and then, IIRC, invokes
> opal_crs.continue().
>
> - in the "restart" case, opal_crs.checkpoint() returns into a new process
> and then, IIRC, invokes opal_crs.restart().
>
>
> On Feb 18, 2014, at 5:29 AM, Adrian Reber  wrote:
>
> > I should have read this email before answering the other.
> >
> > So opal_crs.checkpoint() is used to checkpoint the process as well as
> > restart the process? I would have expected opal_crs.restart() is used
> > for restart. I am confused. Looking at CRS/BLCR checkpoint() seems to
> > only checkpoint and restart() seems to only restart. The comment in
> > opal/mca/crs/crs.h says the same as you say.
> >
> >
> > On Mon, Feb 17, 2014 at 03:43:08PM -0600, Josh Hursey wrote:
> >> These values indicate the current state of the checkpointing lifecycle.
> In
> >> particular CONTINUE/RESTART are set by the checkpointer in the CRS (all
> >> others are used by the INC mechanism). In the opal_crs.checkpoint() call
> >> the checkpointer will capture the program state and it is possible to
> >> emerge from this function in one of two scenarios. Either we are
> continuing
> >> execution in the original process (Continue state), or we are resuming
> >> execution from a checkpointed state (Restart state).
> >>
> >> So if the checkpoint was successful, and you are not restarting the
> process
> >> then you want OPAL_CRS_CONTINUE.
> >>
> >> If the process is being restarted from a checkpoint file, then we should
> >> emerge from this function setting the state to OPAL_CRS_RESTART.
> >>
> >> The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all
> of
> >> the components to prepare for checkpoint (we probably should have
> called it
> >> OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at
> all.
> >> You can see it used in the opal_cr_inc_core_prep() function in
> >> opal/runtime/opal_cr.c
> >>
> >> -- Josh
> >>
> >>
> >>
> >> On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber  wrote:
> >>
> >>> This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
> >>>
> >>> They are probably used to communicate the state of the CRS modules.
> >>> OPAL_CRS_ERROR seems to be used in case an error happened. What is the
> >>> CRS module supposed to set this to if the checkpoint was successful.
> >>>
> >>> OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?
> >>>
> >>>Adrian
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>
> >>
> >>
> >> --
> >> Joshua Hursey
> >> Assistant Professor of Computer Science
> >> University of Wisconsin-La Crosse
> >> http://cs.uwlax.edu/~jjhursey
> >
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Josh Hursey
So when a process is restarted with CRIU, does it resume execution after
the criu_dump() or somewhere else?

In a continue/leave-running mode after checkpoint the MPI library does not
need to do quite a much work since we can depend on some things not
changing (such as the machine name, orted pid, ...).

In a restart mode then the entire library has to be updated - much more
expensive than the continue mode.

The CRS components that we have supported emerge from their checkpointing
function (criu_dump in your case) knowing if they are in the continue or
restart mode. So that CRS function sets the flag according so the rest of
the library can do the right thing afterwards.

The restart function is called by the opal_restart tool to restart the
process from an image. Some checkpointers have a library call to restart a
process others used external tools to do so. So that interface just let's
the checkpointer decide, given a snapshot image, how it should restart that
process. The restarted process is assumed to wake up in the
opal_crs_*_checkpoint function, not opal_crs_*_restart. So the restart
function name can be a bit misleading.

Does that help?

-- Josh





On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber  wrote:

> I think I do not understand your question. So far I have only implemented
> the
> checkpoint part and not the restart part.
>
> Using criu_dump() the process can  be left in three different
> states. Without any special handling the process is dumped and then
> killed. I can also tell criu to leave the process stopped (--leave-stopped)
> or running (--leave-running). I decided to default to --leave-running so
> that after the checkpoint has been performed the process continues
> running where it stopped.
>
> What would be the difference between 'being restarted versus continuing
> after checkpointing'? Right now only 'continuing after checkpoint' is
> implemented. I do not understand how process 'is being restarted' fits
> in the checkpoint function.
>
> In opal_crs_criu_checkpoint() I am using criu_dump() to
> checkpoint the process and the plan is to use criu_restore() in
> opal_crs_criu_restart() (which I have not yet implemented).
>
> On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> > It look fine except that the restart state is not flagged. When a process
> > is restarted does it resume execution inside the criu_dump() function? If
> > so, is there a way to tell from its return code (or some other mechanism)
> > that it is being restarted versus continuing after checkpointing?
> >
> >
> > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain  wrote:
> >
> > > Great - looks fine to me!!
> > >
> > >
> > > On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:
> > >
> > > > I have prepared a patch I would like to commit which adds to code to
> > > > actually checkpoint a process. Thanks for the pointers about the
> string
> > > > variables I tried to do implement it correctly.
> > > >
> > > > CRIU currently has problems with the new OOB usock but I will contact
> > > > the CRIU developers about this error. Using tcp, checkpointing works.
> > > >
> > > > CRIU also has problems with --np > 1, but I am sure this can also be
> > > > resolved.
> > > >
> > > > The patch is at:
> > > >
> > > >
> > >
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > > >
> > > >   Adrian
> > > > ___
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> >
> >
> >
> > --
> > Joshua Hursey
> > Assistant Professor of Computer Science
> > University of Wisconsin-La Crosse
> > http://cs.uwlax.edu/~jjhursey
>
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] [PATCH] Fix typo defining macro _WORD_MASK_

2014-02-18 Thread Nathan Hjelm
_WORD_MASK_ violates C99 § 7.1.3:

"All identifiers that begin with an underscore and either an uppercase letter or 
another
underscore are always reserved for any use."


So we should probably rename the identifier.

-Nathan

On Mon, Feb 17, 2014 at 04:37:34PM +, Jeff Squyres (jsquyres) wrote:
> +1
> 
> On Feb 16, 2014, at 4:55 PM, Andreas Schwab  wrote:
> 
> > diff --git a/opal/util/crc.c b/opal/util/crc.c
> > index 9cfae94..c2112de 100644
> > --- a/opal/util/crc.c
> > +++ b/opal/util/crc.c
> > @@ -41,7 +41,7 @@
> > #elif (OPAL_ALIGNMENT_LONG == 4)
> > #define _WORD_MASK_ 0x3
> > #else
> > -#define _WORD_MASK 0x
> > +#define _WORD_MASK_ 0x
> > #endif
> > 
> > 
> > -- 
> > 1.9.0
> > 
> > -- 
> > Andreas Schwab, sch...@linux-m68k.org
> > GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
> > "And now for something completely different."
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


pgp_B5uV4Zcin.pgp
Description: PGP signature


Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Tue, Feb 18, 2014 at 06:39:12AM -0800, Ralph Castain wrote:
> On Feb 18, 2014, at 6:24 AM, Adrian Reber  wrote:
> 
> > On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote:
> >> On Feb 13, 2014, at 11:26 AM, Adrian Reber  wrote:
> >>> I tried to implement something like you described. It is not yet event
> >>> driven, but before continuing I wanted to get some feedback if it is at
> >>> least the right start:
> >>> 
> >>> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706
> >>> 
> >>> I looked at the other ORTE_OOB_* macros and tried to model my
> >>> functionality a bit after what I have seen there. Right now it is still
> >>> a simple function which just tries to call ft_event() on all oob
> >>> components. Does this look right so far?
> >> 
> >> Sorry for delay - yes, that looks like the right direction. I would 
> >> suggest doing it via the current state machine, though, by simply defining 
> >> another job or proc state in orte/mca/plm/plm_types.h, and then 
> >> registering a callback function using the 
> >> orte_state.add_job[proc]_state(state, function to be called, 
> >> ORTE_ERR_PRI). Then you can activate it by calling 
> >> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the 
> >> proper order.
> > 
> > What is a job/proc in the Open MPI context.
> 
> A "job" is the entire application, while a "proc" is just one process in that 
> application. In this case you could use either one as you are checkpointing 
> the entire job, but all this activity is occurring inside each proc. So I'd 
> suggest defining it as a proc state since it only really involves local 
> actions.
> 
> If you like, I can define the required code in the trunk and let you fill in 
> the event functionality.

That would be great.

Adrian


Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Ralph Castain

On Feb 18, 2014, at 6:24 AM, Adrian Reber  wrote:

> On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote:
>> On Feb 13, 2014, at 11:26 AM, Adrian Reber  wrote:
>>> I tried to implement something like you described. It is not yet event
>>> driven, but before continuing I wanted to get some feedback if it is at
>>> least the right start:
>>> 
>>> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706
>>> 
>>> I looked at the other ORTE_OOB_* macros and tried to model my
>>> functionality a bit after what I have seen there. Right now it is still
>>> a simple function which just tries to call ft_event() on all oob
>>> components. Does this look right so far?
>> 
>> Sorry for delay - yes, that looks like the right direction. I would suggest 
>> doing it via the current state machine, though, by simply defining another 
>> job or proc state in orte/mca/plm/plm_types.h, and then registering a 
>> callback function using the orte_state.add_job[proc]_state(state, function 
>> to be called, ORTE_ERR_PRI). Then you can activate it by calling 
>> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the 
>> proper order.
> 
> What is a job/proc in the Open MPI context.

A "job" is the entire application, while a "proc" is just one process in that 
application. In this case you could use either one as you are checkpointing the 
entire job, but all this activity is occurring inside each proc. So I'd suggest 
defining it as a proc state since it only really involves local actions.

If you like, I can define the required code in the trunk and let you fill in 
the event functionality.


> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote:
> On Feb 13, 2014, at 11:26 AM, Adrian Reber  wrote:
> > I tried to implement something like you described. It is not yet event
> > driven, but before continuing I wanted to get some feedback if it is at
> > least the right start:
> > 
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706
> > 
> > I looked at the other ORTE_OOB_* macros and tried to model my
> > functionality a bit after what I have seen there. Right now it is still
> > a simple function which just tries to call ft_event() on all oob
> > components. Does this look right so far?
> 
> Sorry for delay - yes, that looks like the right direction. I would suggest 
> doing it via the current state machine, though, by simply defining another 
> job or proc state in orte/mca/plm/plm_types.h, and then registering a 
> callback function using the orte_state.add_job[proc]_state(state, function to 
> be called, ORTE_ERR_PRI). Then you can activate it by calling 
> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the 
> proper order.

What is a job/proc in the Open MPI context.

Adrian


Re: [OMPI devel] RFC: new OMPI RTE define:

2014-02-18 Thread Jeff Squyres (jsquyres)
On Feb 18, 2014, at 8:18 AM, George Bosilca  wrote:

>> For the longer term (i.e., 1.9), should we add a little opal infrastructure 
>> that contains an event base that is run in its own progress thread?  This 
>> would allow the MPI layer to consolidate into one progress thread (for 
>> things that are event based, at least).  I don’t believe much work would be 
>> needed here.
> 
> +1. All frameworks/component with a need for event triggering without other 
> constraints must use it. In other terms the proposed infrastructure might not 
> be the most effective for high density fd listeners such as the TCP BTL.

Ok.  I can add an RFC/proposal to my to-do list... after v1.7.5 (and possibly 
after v1.8 -- we'll see how my to-do list plays out).

> Btw, now that we’re talking about this I wonder how do we deal with signals 
> in a non-ORTE environment. Who is registering the signal callbacks, such as 
> USR1?


Don't know / no one...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RFC: new OMPI RTE define:

2014-02-18 Thread George Bosilca

On Feb 18, 2014, at 13:16 , Jeff Squyres (jsquyres)  wrote:

> Ok, fair enough.  My goal was not to spin up another progress thread in my 
> BTL, but I can certainly do so (to meet the 1.7.5 timeframe).
> 
> For the longer term (i.e., 1.9), should we add a little opal infrastructure 
> that contains an event base that is run in its own progress thread?  This 
> would allow the MPI layer to consolidate into one progress thread (for things 
> that are event based, at least).  I don’t believe much work would be needed 
> here.

+1. All frameworks/component with a need for event triggering without other 
constraints must use it. In other terms the proposed infrastructure might not 
be the most effective for high density fd listeners such as the TCP BTL.

> For example, the openib BTL could use this async-thread event-driven 
> infrastructure, too (vs. spinning up 2 progress threads of its own).
> 
> FWIW: the usNIC BTL events I need will be driven by timers and fd's, so it 
> fits into the libevent model just fine (although I have some thoughts of 
> possibly adapting this functionality to run in the orted when possible in the 
> 1.9 timeframe, but I haven't thought that through yet... 1.7.5 first!).

Btw, now that we’re talking about this I wonder how do we deal with signals in 
a non-ORTE environment. Who is registering the signal callbacks, such as USR1?

  George.


> 
> 
> 
> On Feb 18, 2014, at 5:21 AM, George Bosilca  wrote:
> 
>> I concur with Brian, you should not expect the runtime to provide a default 
>> event base, especially if you want some level of quality-of-service out of 
>> it. Moreover, with the soon-to-happen move of the BTLs down in OPAL this 
>> approach will definitively not be suitable.
>> 
>> George.
>> 
>> 
>> On Feb 18, 2014, at 07:03 , Brian Barrett  wrote:
>> 
>>> And what will you do for RTE components that aren't ORTE?  This really 
>>> isn't a feature of a run-time, so it doesn't seem like it should be part of 
>>> the RTE interface...
>>> 
>>> Brian
>>> 
>>> On Feb 17, 2014, at 3:03 PM, Jeff Squyres (jsquyres)  
>>> wrote:
>>> 
 WHAT: New OMPI_RTE_EVENT_BASE define
 
 WHY: The usnic BTL needs to run some events asynchronously; the ORTE event 
 base already exists and is running asynchronously in MPI processes
 
 WHERE: in ompi/mca/rte/rte.h and rte_orte.h
 
 TIMEOUT: COB Friday, 21 Feb 2014
 
 MORE DETAIL:
 
 The WHY line described it pretty well: we want to run some things 
 asynchronously in the usnic BTL and we don't really want to re-invent the 
 wheel (or add yet another thread in each MPI process).  The ORTE event 
 base is already there, there's already a thread servicing it, and Ralph 
 tells me that it is safe to add our own events on to it.
 
 The patch below adds the new OMPI_RTE_EVENT_BASE #define.
 
 
 diff --git a/ompi/mca/rte/orte/rte_orte.h b/ompi/mca/rte/orte/rte_orte.h
 index 3c88c6d..3ceadb8 100644
 --- a/ompi/mca/rte/orte/rte_orte.h
 +++ b/ompi/mca/rte/orte/rte_orte.h
 @@ -142,6 +142,9 @@ typedef struct {
 } ompi_orte_tracker_t;
 OBJ_CLASS_DECLARATION(ompi_orte_tracker_t);
 
 +/* define the event base that the RTE exports */
 +#define OMPI_RTE_EVENT_BASE orte_event_base
 +
 END_C_DECLS
 
 #endif /* MCA_OMPI_RTE_ORTE_H */
 diff --git a/ompi/mca/rte/rte.h b/ompi/mca/rte/rte.h
 index 69ad488..de10dff 100644
 --- a/ompi/mca/rte/rte.h
 +++ b/ompi/mca/rte/rte.h
 @@ -150,7 +150,9 @@
 *a. OMPI_DB_HOSTNAME
 *b. OMPI_DB_LOCALITY
 *
 - * (g) Communication support
 + * (g) Asynchronous / event support
 + * 1. OMPI_RTE_EVENT_BASE - the libevent base that executes in a
 + *separate thread
 *
 */
 
 @@ -162,6 +164,7 @@
 #include "opal/dss/dss_types.h"
 #include "opal/mca/mca.h"
 #include "opal/mca/base/base.h"
 +#include "opal/mca/event/event.h"
 
 BEGIN_C_DECLS
 
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
>>> 
>>> -- 
>>> Brian Barrett
>>> 
>>> There is an art . . . to flying. The knack lies in learning how to
>>> throw yourself at the ground and miss.
>>>   Douglas Adams, 'The Hitchhikers Guide to the Galaxy'
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/

Re: [OMPI devel] RFC: new OMPI RTE define:

2014-02-18 Thread Jeff Squyres (jsquyres)
Ok, fair enough.  My goal was not to spin up another progress thread in my BTL, 
but I can certainly do so (to meet the 1.7.5 timeframe).

For the longer term (i.e., 1.9), should we add a little opal infrastructure 
that contains an event base that is run in its own progress thread?  This would 
allow the MPI layer to consolidate into one progress thread (for things that 
are event based, at least).  I don't believe much work would be needed here.

For example, the openib BTL could use this async-thread event-driven 
infrastructure, too (vs. spinning up 2 progress threads of its own).

FWIW: the usNIC BTL events I need will be driven by timers and fd's, so it fits 
into the libevent model just fine (although I have some thoughts of possibly 
adapting this functionality to run in the orted when possible in the 1.9 
timeframe, but I haven't thought that through yet... 1.7.5 first!).



On Feb 18, 2014, at 5:21 AM, George Bosilca  wrote:

> I concur with Brian, you should not expect the runtime to provide a default 
> event base, especially if you want some level of quality-of-service out of 
> it. Moreover, with the soon-to-happen move of the BTLs down in OPAL this 
> approach will definitively not be suitable.
> 
>  George.
> 
> 
> On Feb 18, 2014, at 07:03 , Brian Barrett  wrote:
> 
>> And what will you do for RTE components that aren't ORTE?  This really isn't 
>> a feature of a run-time, so it doesn't seem like it should be part of the 
>> RTE interface...
>> 
>> Brian
>> 
>> On Feb 17, 2014, at 3:03 PM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>>> WHAT: New OMPI_RTE_EVENT_BASE define
>>> 
>>> WHY: The usnic BTL needs to run some events asynchronously; the ORTE event 
>>> base already exists and is running asynchronously in MPI processes
>>> 
>>> WHERE: in ompi/mca/rte/rte.h and rte_orte.h
>>> 
>>> TIMEOUT: COB Friday, 21 Feb 2014
>>> 
>>> MORE DETAIL:
>>> 
>>> The WHY line described it pretty well: we want to run some things 
>>> asynchronously in the usnic BTL and we don't really want to re-invent the 
>>> wheel (or add yet another thread in each MPI process).  The ORTE event base 
>>> is already there, there's already a thread servicing it, and Ralph tells me 
>>> that it is safe to add our own events on to it.
>>> 
>>> The patch below adds the new OMPI_RTE_EVENT_BASE #define.
>>> 
>>> 
>>> diff --git a/ompi/mca/rte/orte/rte_orte.h b/ompi/mca/rte/orte/rte_orte.h
>>> index 3c88c6d..3ceadb8 100644
>>> --- a/ompi/mca/rte/orte/rte_orte.h
>>> +++ b/ompi/mca/rte/orte/rte_orte.h
>>> @@ -142,6 +142,9 @@ typedef struct {
>>> } ompi_orte_tracker_t;
>>> OBJ_CLASS_DECLARATION(ompi_orte_tracker_t);
>>> 
>>> +/* define the event base that the RTE exports */
>>> +#define OMPI_RTE_EVENT_BASE orte_event_base
>>> +
>>> END_C_DECLS
>>> 
>>> #endif /* MCA_OMPI_RTE_ORTE_H */
>>> diff --git a/ompi/mca/rte/rte.h b/ompi/mca/rte/rte.h
>>> index 69ad488..de10dff 100644
>>> --- a/ompi/mca/rte/rte.h
>>> +++ b/ompi/mca/rte/rte.h
>>> @@ -150,7 +150,9 @@
>>> *a. OMPI_DB_HOSTNAME
>>> *b. OMPI_DB_LOCALITY
>>> *
>>> - * (g) Communication support
>>> + * (g) Asynchronous / event support
>>> + * 1. OMPI_RTE_EVENT_BASE - the libevent base that executes in a
>>> + *separate thread
>>> *
>>> */
>>> 
>>> @@ -162,6 +164,7 @@
>>> #include "opal/dss/dss_types.h"
>>> #include "opal/mca/mca.h"
>>> #include "opal/mca/base/base.h"
>>> +#include "opal/mca/event/event.h"
>>> 
>>> BEGIN_C_DECLS
>>> 
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> -- 
>> Brian Barrett
>> 
>> There is an art . . . to flying. The knack lies in learning how to
>> throw yourself at the ground and miss.
>>Douglas Adams, 'The Hitchhikers Guide to the Galaxy'
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] OPAL_CRS_* meaning

2014-02-18 Thread Jeff Squyres (jsquyres)
opal_crs.checkpoint() is not used to restart the process, but it does return in 
two different cases:

- in the "continue" case, opal_crs.checkpoint() returns in the original process 
and keeps executing the same process and then, IIRC, invokes 
opal_crs.continue().

- in the "restart" case, opal_crs.checkpoint() returns into a new process and 
then, IIRC, invokes opal_crs.restart().


On Feb 18, 2014, at 5:29 AM, Adrian Reber  wrote:

> I should have read this email before answering the other.
> 
> So opal_crs.checkpoint() is used to checkpoint the process as well as
> restart the process? I would have expected opal_crs.restart() is used
> for restart. I am confused. Looking at CRS/BLCR checkpoint() seems to
> only checkpoint and restart() seems to only restart. The comment in
> opal/mca/crs/crs.h says the same as you say.
> 
> 
> On Mon, Feb 17, 2014 at 03:43:08PM -0600, Josh Hursey wrote:
>> These values indicate the current state of the checkpointing lifecycle. In
>> particular CONTINUE/RESTART are set by the checkpointer in the CRS (all
>> others are used by the INC mechanism). In the opal_crs.checkpoint() call
>> the checkpointer will capture the program state and it is possible to
>> emerge from this function in one of two scenarios. Either we are continuing
>> execution in the original process (Continue state), or we are resuming
>> execution from a checkpointed state (Restart state).
>> 
>> So if the checkpoint was successful, and you are not restarting the process
>> then you want OPAL_CRS_CONTINUE.
>> 
>> If the process is being restarted from a checkpoint file, then we should
>> emerge from this function setting the state to OPAL_CRS_RESTART.
>> 
>> The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all of
>> the components to prepare for checkpoint (we probably should have called it
>> OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at all.
>> You can see it used in the opal_cr_inc_core_prep() function in
>> opal/runtime/opal_cr.c
>> 
>> -- Josh
>> 
>> 
>> 
>> On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber  wrote:
>> 
>>> This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
>>> 
>>> They are probably used to communicate the state of the CRS modules.
>>> OPAL_CRS_ERROR seems to be used in case an error happened. What is the
>>> CRS module supposed to set this to if the checkpoint was successful.
>>> 
>>> OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?
>>> 
>>>Adrian
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> 
>> 
>> -- 
>> Joshua Hursey
>> Assistant Professor of Computer Science
>> University of Wisconsin-La Crosse
>> http://cs.uwlax.edu/~jjhursey
> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] OPAL_CRS_* meaning

2014-02-18 Thread Adrian Reber
I should have read this email before answering the other.

So opal_crs.checkpoint() is used to checkpoint the process as well as
restart the process? I would have expected opal_crs.restart() is used
for restart. I am confused. Looking at CRS/BLCR checkpoint() seems to
only checkpoint and restart() seems to only restart. The comment in
opal/mca/crs/crs.h says the same as you say.


On Mon, Feb 17, 2014 at 03:43:08PM -0600, Josh Hursey wrote:
> These values indicate the current state of the checkpointing lifecycle. In
> particular CONTINUE/RESTART are set by the checkpointer in the CRS (all
> others are used by the INC mechanism). In the opal_crs.checkpoint() call
> the checkpointer will capture the program state and it is possible to
> emerge from this function in one of two scenarios. Either we are continuing
> execution in the original process (Continue state), or we are resuming
> execution from a checkpointed state (Restart state).
> 
> So if the checkpoint was successful, and you are not restarting the process
> then you want OPAL_CRS_CONTINUE.
> 
> If the process is being restarted from a checkpoint file, then we should
> emerge from this function setting the state to OPAL_CRS_RESTART.
> 
> The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all of
> the components to prepare for checkpoint (we probably should have called it
> OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at all.
> You can see it used in the opal_cr_inc_core_prep() function in
> opal/runtime/opal_cr.c
> 
> -- Josh
> 
> 
> 
> On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber  wrote:
> 
> > This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
> >
> > They are probably used to communicate the state of the CRS modules.
> > OPAL_CRS_ERROR seems to be used in case an error happened. What is the
> > CRS module supposed to set this to if the checkpoint was successful.
> >
> > OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?
> >
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] RFC: new OMPI RTE define:

2014-02-18 Thread George Bosilca
I concur with Brian, you should not expect the runtime to provide a default 
event base, especially if you want some level of quality-of-service out of it. 
Moreover, with the soon-to-happen move of the BTLs down in OPAL this approach 
will definitively not be suitable.

  George.


On Feb 18, 2014, at 07:03 , Brian Barrett  wrote:

> And what will you do for RTE components that aren't ORTE?  This really isn't 
> a feature of a run-time, so it doesn't seem like it should be part of the RTE 
> interface...
> 
> Brian
> 
> On Feb 17, 2014, at 3:03 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> WHAT: New OMPI_RTE_EVENT_BASE define
>> 
>> WHY: The usnic BTL needs to run some events asynchronously; the ORTE event 
>> base already exists and is running asynchronously in MPI processes
>> 
>> WHERE: in ompi/mca/rte/rte.h and rte_orte.h
>> 
>> TIMEOUT: COB Friday, 21 Feb 2014
>> 
>> MORE DETAIL:
>> 
>> The WHY line described it pretty well: we want to run some things 
>> asynchronously in the usnic BTL and we don't really want to re-invent the 
>> wheel (or add yet another thread in each MPI process).  The ORTE event base 
>> is already there, there's already a thread servicing it, and Ralph tells me 
>> that it is safe to add our own events on to it.
>> 
>> The patch below adds the new OMPI_RTE_EVENT_BASE #define.
>> 
>> 
>> diff --git a/ompi/mca/rte/orte/rte_orte.h b/ompi/mca/rte/orte/rte_orte.h
>> index 3c88c6d..3ceadb8 100644
>> --- a/ompi/mca/rte/orte/rte_orte.h
>> +++ b/ompi/mca/rte/orte/rte_orte.h
>> @@ -142,6 +142,9 @@ typedef struct {
>> } ompi_orte_tracker_t;
>> OBJ_CLASS_DECLARATION(ompi_orte_tracker_t);
>> 
>> +/* define the event base that the RTE exports */
>> +#define OMPI_RTE_EVENT_BASE orte_event_base
>> +
>> END_C_DECLS
>> 
>> #endif /* MCA_OMPI_RTE_ORTE_H */
>> diff --git a/ompi/mca/rte/rte.h b/ompi/mca/rte/rte.h
>> index 69ad488..de10dff 100644
>> --- a/ompi/mca/rte/rte.h
>> +++ b/ompi/mca/rte/rte.h
>> @@ -150,7 +150,9 @@
>> *a. OMPI_DB_HOSTNAME
>> *b. OMPI_DB_LOCALITY
>> *
>> - * (g) Communication support
>> + * (g) Asynchronous / event support
>> + * 1. OMPI_RTE_EVENT_BASE - the libevent base that executes in a
>> + *separate thread
>> *
>> */
>> 
>> @@ -162,6 +164,7 @@
>> #include "opal/dss/dss_types.h"
>> #include "opal/mca/mca.h"
>> #include "opal/mca/base/base.h"
>> +#include "opal/mca/event/event.h"
>> 
>> BEGIN_C_DECLS
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> -- 
> Brian Barrett
> 
> There is an art . . . to flying. The knack lies in learning how to
> throw yourself at the ground and miss.
> Douglas Adams, 'The Hitchhikers Guide to the Galaxy'
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
I think I do not understand your question. So far I have only implemented the
checkpoint part and not the restart part.

Using criu_dump() the process can  be left in three different
states. Without any special handling the process is dumped and then
killed. I can also tell criu to leave the process stopped (--leave-stopped)
or running (--leave-running). I decided to default to --leave-running so
that after the checkpoint has been performed the process continues
running where it stopped.

What would be the difference between 'being restarted versus continuing
after checkpointing'? Right now only 'continuing after checkpoint' is
implemented. I do not understand how process 'is being restarted' fits
in the checkpoint function.

In opal_crs_criu_checkpoint() I am using criu_dump() to
checkpoint the process and the plan is to use criu_restore() in
opal_crs_criu_restart() (which I have not yet implemented).

On Mon, Feb 17, 2014 at 03:45:49PM -0600, Josh Hursey wrote:
> It look fine except that the restart state is not flagged. When a process
> is restarted does it resume execution inside the criu_dump() function? If
> so, is there a way to tell from its return code (or some other mechanism)
> that it is being restarted versus continuing after checkpointing?
> 
> 
> On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain  wrote:
> 
> > Great - looks fine to me!!
> >
> >
> > On Feb 17, 2014, at 11:39 AM, Adrian Reber  wrote:
> >
> > > I have prepared a patch I would like to commit which adds to code to
> > > actually checkpoint a process. Thanks for the pointers about the string
> > > variables I tried to do implement it correctly.
> > >
> > > CRIU currently has problems with the new OOB usock but I will contact
> > > the CRIU developers about this error. Using tcp, checkpointing works.
> > >
> > > CRIU also has problems with --np > 1, but I am sure this can also be
> > > resolved.
> > >
> > > The patch is at:
> > >
> > >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=89c9c27c87598706e8f798f84fe9520ee5884492
> > >
> > >   Adrian
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] RFC: new OMPI RTE define:

2014-02-18 Thread Brian Barrett
And what will you do for RTE components that aren't ORTE?  This really isn't a 
feature of a run-time, so it doesn't seem like it should be part of the RTE 
interface...

Brian

On Feb 17, 2014, at 3:03 PM, Jeff Squyres (jsquyres)  wrote:

> WHAT: New OMPI_RTE_EVENT_BASE define
> 
> WHY: The usnic BTL needs to run some events asynchronously; the ORTE event 
> base already exists and is running asynchronously in MPI processes
> 
> WHERE: in ompi/mca/rte/rte.h and rte_orte.h
> 
> TIMEOUT: COB Friday, 21 Feb 2014
> 
> MORE DETAIL:
> 
> The WHY line described it pretty well: we want to run some things 
> asynchronously in the usnic BTL and we don't really want to re-invent the 
> wheel (or add yet another thread in each MPI process).  The ORTE event base 
> is already there, there's already a thread servicing it, and Ralph tells me 
> that it is safe to add our own events on to it.
> 
> The patch below adds the new OMPI_RTE_EVENT_BASE #define.
> 
> 
> diff --git a/ompi/mca/rte/orte/rte_orte.h b/ompi/mca/rte/orte/rte_orte.h
> index 3c88c6d..3ceadb8 100644
> --- a/ompi/mca/rte/orte/rte_orte.h
> +++ b/ompi/mca/rte/orte/rte_orte.h
> @@ -142,6 +142,9 @@ typedef struct {
> } ompi_orte_tracker_t;
> OBJ_CLASS_DECLARATION(ompi_orte_tracker_t);
> 
> +/* define the event base that the RTE exports */
> +#define OMPI_RTE_EVENT_BASE orte_event_base
> +
> END_C_DECLS
> 
> #endif /* MCA_OMPI_RTE_ORTE_H */
> diff --git a/ompi/mca/rte/rte.h b/ompi/mca/rte/rte.h
> index 69ad488..de10dff 100644
> --- a/ompi/mca/rte/rte.h
> +++ b/ompi/mca/rte/rte.h
> @@ -150,7 +150,9 @@
>  *a. OMPI_DB_HOSTNAME
>  *b. OMPI_DB_LOCALITY
>  *
> - * (g) Communication support
> + * (g) Asynchronous / event support
> + * 1. OMPI_RTE_EVENT_BASE - the libevent base that executes in a
> + *separate thread
>  *
>  */
> 
> @@ -162,6 +164,7 @@
> #include "opal/dss/dss_types.h"
> #include "opal/mca/mca.h"
> #include "opal/mca/base/base.h"
> +#include "opal/mca/event/event.h"
> 
> BEGIN_C_DECLS
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

-- 
 Brian Barrett

 There is an art . . . to flying. The knack lies in learning how to
 throw yourself at the ground and miss.
 Douglas Adams, 'The Hitchhikers Guide to the Galaxy'