Re: [OMPI devel] openib error for message size 1.5 GB

2011-06-07 Thread Mike Dubman
Please try with "--mca mpi_leave_pinned 0"

On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke  wrote:

> Dear all,
>
> While trying to send a message of size 1610612736 B (1.5 GB), I get the
> following error:
>
> [[52363,1],1][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_wc]
> from grsacc20 to: grsacc19 error polling LP CQ with status LOCAL LENGTH
> ERROR status number 1 for wr_id 18729360 opcode 128  vendor error 105 qp_idx
> 3
>
> Reducing the size to 1 GB works fine. I assume that this is rather related
> to Infiniband itself than to Open MPI.
> I'm using Open MPI 1.4.1.
>
> Any ideas on that?
>
> Thank you very much.
> Sebastian.
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] openib error for message size 1.5 GB

2011-06-07 Thread Sebastian Rinke
Worked.

Thanks a lot!

On Jun 7, 2011, at 6:43 AM, Mike Dubman wrote:

> 
> Please try with "--mca mpi_leave_pinned 0"
> 
> On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke  wrote:
> Dear all,
> 
> While trying to send a message of size 1610612736 B (1.5 GB), I get the 
> following error:
> 
> [[52363,1],1][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_wc]
>  from grsacc20 to: grsacc19 error polling LP CQ with status LOCAL LENGTH 
> ERROR status number 1 for wr_id 18729360 opcode 128  vendor error 105 qp_idx 3
> 
> Reducing the size to 1 GB works fine. I assume that this is rather related to 
> Infiniband itself than to Open MPI.
> I'm using Open MPI 1.4.1.
> 
> Any ideas on that?
> 
> Thank you very much.
> Sebastian.
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
I'm on travel this week, but will look this over when I return. From the
description, it sounds nearly identical to what we did in ORCM, so I expect
there won't be many issues. You do get some race conditions that the new
state machine code should help resolve.

Only difference I can quickly see is that we chose not to modify the process
name structure, keeping the "epoch" (we called it "incarnation") as a
separate value. Since we aren't terribly concerned about backward
compatibility, I don't consider this a significant issue - but something the
community should recognize.

My main concern will be to ensure that the new code contains enough
flexibility to allow integration with other layers such as ORCM without
creating potential conflict over "double protection" - i.e., if the layer
above ORTE wants to provide a certain level of fault protection, then ORTE
needs to get out of the way.


On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca  wrote:

> WHAT: Allow the runtime to handle fail-stop failures for both runtime
> (daemons) or application level processes. This patch extends the
> orte_process_name_t structure with a field to store the process epoch (the
> number of times it died so far), and add an application failure notification
> callback function to be registered in the runtime.
>
> WHY: Necessary to correctly implement the error handling in the MPI 2.2
> standard. In addition, such a resilient runtime is a cornerstone for any
> level of fault tolerance support we want to provide in the future (such as
> the MPI-3 Run-Through Stabilization or FT-MPI).
>
> WHEN:
>
> WHERE: Patch attached to this email, based on trunk r24747.
> TIMEOUT: 2 weeks from now, on Monday 20 June.
>
> --
>
> MORE DETAILS:
>
> Currently the infrastructure required to enable any kind of fault tolerance
> development in Open MPI (with the exception of the checkpoint/restart) is
> missing. However, before developing any fault tolerant support at the
> application (MPI) level, we need to have a resilient runtime. The changes in
> this patch address this lack of support and would allow anyone to implement
> a fault tolerance protocol at the MPI layer without having to worry about
> the ORTE stabilization.
>
> This patch will allow the runtime to drop any dead daemons, and re-route
> all communications around the holes in order to __ALWAYS__ deliver a message
> as long as the destination process is alive. The application is informed
> (via a callback) about the loss of the processes with the same jobid. In
> this patch we do not address the MPI_ERROR_RETURN type of failures, we
> focused on the MPI_ERROR_ABORT ones. Moreover, we empowered the application
> level with the decision, instead of taking it down in the runtime.
>
> NEW STUFF:
>
> Epoch - A counter that tracks the number of times a process has been
> detected to have terminated, either from a failure or an expected
> termination. After the termination is detected, the HNP coordinates all
> other process’s knowledge of the new epoch. Each ORTED will know the epoch
> of the other processes in the job, but it will not actually store anything
> until the epochs change.
>
> Run-Through Stabilization - When an ORTED (or HNP) detects that another
> process has terminated, it repairs the routing layer and informs the HNP.
> The HNP tells all other processes about the failure so they can also repair
> their routing layers an update their internal bookkeeping. The processes do
> not abort after the termination is detected.
>
> Callback Function - When the HNP tells all the ORTEDs about the failures,
> they tell the ORTE layers within the applications. The application level
> ORTE layers have a callback function that they use to inform the OMPI layer
> about the error. Currently the OMPI errhandler code fills in this callback
> function so it is informed when there is an error and it aborts (to maintain
> the current default behavior of MPI). This callback function can also be
> used in an ORTE only application to perform application based fault
> tolerance (ABFT) and allow the application to continue.
>
> NECESSARY FOR IMPLEMENTATION:
>
> Epoch - The orte_process_name_t struct now has a field for epoch. This
> means that whenever sending a message, the most current version of the epoch
> needs to be in this field. This is a simple look up using the function in
> orte/util/nidmap.c: orte_util_lookup_epoch(). In the orte/orted/orted_comm.c
> code, there is a check to make sure that it isn’t trying to send messages to
> a process that has already terminated (don’t send to a process with an epoch
> less than the current epoch). Make sure that if you are sending a message,
> you have the most up to date data here.
>
> Routing - So far, only the binomial routing layer has been updated to use
> the new resilience features. To modify other routing layers to be able to
> continue running after a process failure, they need to be able to detect
> which processes are not currently running an

[OMPI devel] Nightly tarball problem: fixed

2011-06-07 Thread Jeff Squyres
FYI: Terry discovered yesterday that the nightlies hadn't been made in a while 
for v1.4 and trunk.  There was a filesystem permissions issue on the build 
server that has been fixed -- there are new nightly tarballs today for v1.4, 
v1.5, and trunk.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] VT support for 1.5

2011-06-07 Thread George Bosilca
I can't compile the 1.5 is I do not disable VT. Using the following configure 
line:

../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug 
--enable-mpirun-prefix-by-default --with-knem=/usr/local/knem 
--with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug

I get:

ar: /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libutil.a: 
No such file or directory

Any ideas?

  george.




Re: [OMPI devel] VT support for 1.5

2011-06-07 Thread Jeff Squyres
I've seen VT builds get confused sometimes.  I'm not sure of the exact cause, 
but if I get a new checkout, all the problems seem to go away.  I've never had 
the time to track it down.

Can you get a clean / new checkout and see if that fixes the problem?


On Jun 7, 2011, at 10:27 AM, George Bosilca wrote:

> I can't compile the 1.5 is I do not disable VT. Using the following configure 
> line:
> 
> ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug 
> --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem 
> --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug
> 
> I get:
> 
> ar: /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libutil.a: 
> No such file or directory
> 
> Any ideas?
> 
>  george.
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] VT support for 1.5

2011-06-07 Thread George Bosilca
My 'svn status' indicates no differences. I always build using a VPATH, and in 
this case I did remove the build directory. However, the issue persisted.

  george.

On Jun 7, 2011, at 10:31 , Jeff Squyres wrote:

> I've seen VT builds get confused sometimes.  I'm not sure of the exact cause, 
> but if I get a new checkout, all the problems seem to go away.  I've never 
> had the time to track it down.
> 
> Can you get a clean / new checkout and see if that fixes the problem?
> 
> 
> On Jun 7, 2011, at 10:27 AM, George Bosilca wrote:
> 
>> I can't compile the 1.5 is I do not disable VT. Using the following 
>> configure line:
>> 
>> ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug 
>> --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem 
>> --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug
>> 
>> I get:
>> 
>> ar: 
>> /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libutil.a: No 
>> such file or directory
>> 
>> Any ideas?
>> 
>> george.
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
This could certainly work alongside another ORCM or any other fault 
detection/prediction/recovery mechanism. Most of the code is just dedicated to 
keeping the epoch up to date and tracking the status of the processes. The 
underlying idea was to provide a way for the application to decide what its 
fault policy would be rather than trying to dictate one in the runtime. If any 
other layer wanted to register a callback function with this code, it could do 
anything it wanted to on top of it.

Wesley

On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:

> I'm on travel this week, but will look this over when I return. From the 
> description, it sounds nearly identical to what we did in ORCM, so I expect 
> there won't be many issues. You do get some race conditions that the new 
> state machine code should help resolve.
> 
> Only difference I can quickly see is that we chose not to modify the process 
> name structure, keeping the "epoch" (we called it "incarnation") as a 
> separate value. Since we aren't terribly concerned about backward 
> compatibility, I don't consider this a significant issue - but something the 
> community should recognize. 
> 
> My main concern will be to ensure that the new code contains enough 
> flexibility to allow integration with other layers such as ORCM without 
> creating potential conflict over "double protection" - i.e., if the layer 
> above ORTE wants to provide a certain level of fault protection, then ORTE 
> needs to get out of the way. 
> 
> 
> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca  (mailto:bosi...@eecs.utk.edu)> wrote:
> >  WHAT: Allow the runtime to handle fail-stop failures for both runtime 
> > (daemons) or application level processes. This patch extends the 
> > orte_process_name_t structure with a field to store the process epoch (the 
> > number of times it died so far), and add an application failure 
> > notification callback function to be registered in the runtime.
> > 
> >  WHY: Necessary to correctly implement the error handling in the MPI 2.2 
> > standard. In addition, such a resilient runtime is a cornerstone for any 
> > level of fault tolerance support we want to provide in the future (such as 
> > the MPI-3 Run-Through Stabilization or FT-MPI).
> > 
> >  WHEN:
> > 
> >  WHERE: Patch attached to this email, based on trunk r24747.
> >  TIMEOUT: 2 weeks from now, on Monday 20 June.
> > 
> >  --
> > 
> >  MORE DETAILS:
> > 
> >  Currently the infrastructure required to enable any kind of fault 
> > tolerance development in Open MPI (with the exception of the 
> > checkpoint/restart) is missing. However, before developing any fault 
> > tolerant support at the application (MPI) level, we need to have a 
> > resilient runtime. The changes in this patch address this lack of support 
> > and would allow anyone to implement a fault tolerance protocol at the MPI 
> > layer without having to worry about the ORTE stabilization.
> > 
> >  This patch will allow the runtime to drop any dead daemons, and re-route 
> > all communications around the holes in order to __ALWAYS__ deliver a 
> > message as long as the destination process is alive. The application is 
> > informed (via a callback) about the loss of the processes with the same 
> > jobid. In this patch we do not address the MPI_ERROR_RETURN type of 
> > failures, we focused on the MPI_ERROR_ABORT ones. Moreover, we empowered 
> > the application level with the decision, instead of taking it down in the 
> > runtime.
> > 
> >  NEW STUFF:
> > 
> >  Epoch - A counter that tracks the number of times a process has been 
> > detected to have terminated, either from a failure or an expected 
> > termination. After the termination is detected, the HNP coordinates all 
> > other process’s knowledge of the new epoch. Each ORTED will know the epoch 
> > of the other processes in the job, but it will not actually store anything 
> > until the epochs change.
> > 
> >  Run-Through Stabilization - When an ORTED (or HNP) detects that another 
> > process has terminated, it repairs the routing layer and informs the HNP. 
> > The HNP tells all other processes about the failure so they can also repair 
> > their routing layers an update their internal bookkeeping. The processes do 
> > not abort after the termination is detected.
> > 
> >  Callback Function - When the HNP tells all the ORTEDs about the failures, 
> > they tell the ORTE layers within the applications. The application level 
> > ORTE layers have a callback function that they use to inform the OMPI layer 
> > about the error. Currently the OMPI errhandler code fills in this callback 
> > function so it is informed when there is an error and it aborts (to 
> > maintain the current default behavior of MPI). This callback function can 
> > also be used in an ORTE only application to perform application based fault 
> > tolerance (ABFT) and allow the application to continue.
> > 
> >  NECESSARY FOR IMPLEMENTATION:
> > 
> >  Epoch - The orte

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Josh Hursey
I briefly looked over the patch. Excluding the epochs (which we don't
need now, but will soon) it looks similar to what I have setup on my
MPI run-through stabilization branch - so it should support that work
nicely. I'll try to test it this week and send back any other
comments.

Good work.

Thanks,
Josh

On Tue, Jun 7, 2011 at 10:46 AM, Wesley Bland  wrote:
> This could certainly work alongside another ORCM or any other fault
> detection/prediction/recovery mechanism. Most of the code is just dedicated
> to keeping the epoch up to date and tracking the status of the processes.
> The underlying idea was to provide a way for the application to decide what
> its fault policy would be rather than trying to dictate one in the runtime.
> If any other layer wanted to register a callback function with this code, it
> could do anything it wanted to on top of it.
> Wesley
>
> On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
>
> I'm on travel this week, but will look this over when I return. From the
> description, it sounds nearly identical to what we did in ORCM, so I expect
> there won't be many issues. You do get some race conditions that the new
> state machine code should help resolve.
> Only difference I can quickly see is that we chose not to modify the process
> name structure, keeping the "epoch" (we called it "incarnation") as a
> separate value. Since we aren't terribly concerned about backward
> compatibility, I don't consider this a significant issue - but something the
> community should recognize.
> My main concern will be to ensure that the new code contains enough
> flexibility to allow integration with other layers such as ORCM without
> creating potential conflict over "double protection" - i.e., if the layer
> above ORTE wants to provide a certain level of fault protection, then ORTE
> needs to get out of the way.
>
> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca  wrote:
>
> WHAT: Allow the runtime to handle fail-stop failures for both runtime
> (daemons) or application level processes. This patch extends the
> orte_process_name_t structure with a field to store the process epoch (the
> number of times it died so far), and add an application failure notification
> callback function to be registered in the runtime.
>
> WHY: Necessary to correctly implement the error handling in the MPI 2.2
> standard. In addition, such a resilient runtime is a cornerstone for any
> level of fault tolerance support we want to provide in the future (such as
> the MPI-3 Run-Through Stabilization or FT-MPI).
>
> WHEN:
>
> WHERE: Patch attached to this email, based on trunk r24747.
> TIMEOUT: 2 weeks from now, on Monday 20 June.
>
> --
>
> MORE DETAILS:
>
> Currently the infrastructure required to enable any kind of fault tolerance
> development in Open MPI (with the exception of the checkpoint/restart) is
> missing. However, before developing any fault tolerant support at the
> application (MPI) level, we need to have a resilient runtime. The changes in
> this patch address this lack of support and would allow anyone to implement
> a fault tolerance protocol at the MPI layer without having to worry about
> the ORTE stabilization.
>
> This patch will allow the runtime to drop any dead daemons, and re-route all
> communications around the holes in order to __ALWAYS__ deliver a message as
> long as the destination process is alive. The application is informed (via a
> callback) about the loss of the processes with the same jobid. In this patch
> we do not address the MPI_ERROR_RETURN type of failures, we focused on the
> MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the
> decision, instead of taking it down in the runtime.
>
> NEW STUFF:
>
> Epoch - A counter that tracks the number of times a process has been
> detected to have terminated, either from a failure or an expected
> termination. After the termination is detected, the HNP coordinates all
> other process’s knowledge of the new epoch. Each ORTED will know the epoch
> of the other processes in the job, but it will not actually store anything
> until the epochs change.
>
> Run-Through Stabilization - When an ORTED (or HNP) detects that another
> process has terminated, it repairs the routing layer and informs the HNP.
> The HNP tells all other processes about the failure so they can also repair
> their routing layers an update their internal bookkeeping. The processes do
> not abort after the termination is detected.
>
> Callback Function - When the HNP tells all the ORTEDs about the failures,
> they tell the ORTE layers within the applications. The application level
> ORTE layers have a callback function that they use to inform the OMPI layer
> about the error. Currently the OMPI errhandler code fills in this callback
> function so it is informed when there is an error and it aborts (to maintain
> the current default behavior of MPI). This callback function can also be
> used in an ORTE only application to perform applica

Re: [OMPI devel] VT support for 1.5

2011-06-07 Thread Jeff Squyres
You might want to try a new checkout, just in case there's something in there 
that is svn:ignored...?

(yes, I'm grasping at straws here, but I'm able to build ok with a clean 
checkout...?)


On Jun 7, 2011, at 10:38 AM, George Bosilca wrote:

> My 'svn status' indicates no differences. I always build using a VPATH, and 
> in this case I did remove the build directory. However, the issue persisted.
> 
>  george.
> 
> On Jun 7, 2011, at 10:31 , Jeff Squyres wrote:
> 
>> I've seen VT builds get confused sometimes.  I'm not sure of the exact 
>> cause, but if I get a new checkout, all the problems seem to go away.  I've 
>> never had the time to track it down.
>> 
>> Can you get a clean / new checkout and see if that fixes the problem?
>> 
>> 
>> On Jun 7, 2011, at 10:27 AM, George Bosilca wrote:
>> 
>>> I can't compile the 1.5 is I do not disable VT. Using the following 
>>> configure line:
>>> 
>>> ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug 
>>> --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem 
>>> --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug
>>> 
>>> I get:
>>> 
>>> ar: 
>>> /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libutil.a: 
>>> No such file or directory
>>> 
>>> Any ideas?
>>> 
>>> george.
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks for the explanation - as I said, I won't have time to really review
the patch this week, but appreciate the info. I don't really expect to see a
conflict as George had discussed this with me previously.

I know I'll have merge conflicts with my state machine branch, which would
be ready for commit in the same time frame, but I'll hold off on that one
and deal with the merge issues on my side.



On Tue, Jun 7, 2011 at 8:46 AM, Wesley Bland  wrote:

>  This could certainly work alongside another ORCM or any other fault
> detection/prediction/recovery mechanism. Most of the code is just dedicated
> to keeping the epoch up to date and tracking the status of the processes.
> The underlying idea was to provide a way for the application to decide what
> its fault policy would be rather than trying to dictate one in the runtime.
> If any other layer wanted to register a callback function with this code, it
> could do anything it wanted to on top of it.
>
> Wesley
>
> On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
>
> I'm on travel this week, but will look this over when I return. From the
> description, it sounds nearly identical to what we did in ORCM, so I expect
> there won't be many issues. You do get some race conditions that the new
> state machine code should help resolve.
>
> Only difference I can quickly see is that we chose not to modify the
> process name structure, keeping the "epoch" (we called it "incarnation") as
> a separate value. Since we aren't terribly concerned about backward
> compatibility, I don't consider this a significant issue - but something the
> community should recognize.
>
> My main concern will be to ensure that the new code contains enough
> flexibility to allow integration with other layers such as ORCM without
> creating potential conflict over "double protection" - i.e., if the layer
> above ORTE wants to provide a certain level of fault protection, then ORTE
> needs to get out of the way.
>
>
> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca wrote:
>
> WHAT: Allow the runtime to handle fail-stop failures for both runtime
> (daemons) or application level processes. This patch extends the
> orte_process_name_t structure with a field to store the process epoch (the
> number of times it died so far), and add an application failure notification
> callback function to be registered in the runtime.
>
> WHY: Necessary to correctly implement the error handling in the MPI 2.2
> standard. In addition, such a resilient runtime is a cornerstone for any
> level of fault tolerance support we want to provide in the future (such as
> the MPI-3 Run-Through Stabilization or FT-MPI).
>
> WHEN:
>
> WHERE: Patch attached to this email, based on trunk r24747.
> TIMEOUT: 2 weeks from now, on Monday 20 June.
>
> --
>
> MORE DETAILS:
>
> Currently the infrastructure required to enable any kind of fault tolerance
> development in Open MPI (with the exception of the checkpoint/restart) is
> missing. However, before developing any fault tolerant support at the
> application (MPI) level, we need to have a resilient runtime. The changes in
> this patch address this lack of support and would allow anyone to implement
> a fault tolerance protocol at the MPI layer without having to worry about
> the ORTE stabilization.
>
> This patch will allow the runtime to drop any dead daemons, and re-route
> all communications around the holes in order to __ALWAYS__ deliver a message
> as long as the destination process is alive. The application is informed
> (via a callback) about the loss of the processes with the same jobid. In
> this patch we do not address the MPI_ERROR_RETURN type of failures, we
> focused on the MPI_ERROR_ABORT ones. Moreover, we empowered the application
> level with the decision, instead of taking it down in the runtime.
>
> NEW STUFF:
>
> Epoch - A counter that tracks the number of times a process has been
> detected to have terminated, either from a failure or an expected
> termination. After the termination is detected, the HNP coordinates all
> other process’s knowledge of the new epoch. Each ORTED will know the epoch
> of the other processes in the job, but it will not actually store anything
> until the epochs change.
>
> Run-Through Stabilization - When an ORTED (or HNP) detects that another
> process has terminated, it repairs the routing layer and informs the HNP.
> The HNP tells all other processes about the failure so they can also repair
> their routing layers an update their internal bookkeeping. The processes do
> not abort after the termination is detected.
>
> Callback Function - When the HNP tells all the ORTEDs about the failures,
> they tell the ORTE layers within the applications. The application level
> ORTE layers have a callback function that they use to inform the OMPI layer
> about the error. Currently the OMPI errhandler code fills in this callback
> function so it is informed when there is an error and it aborts (to maintain
> the current default 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
To adress your concerns about putting the epoch in the process name structure, 
putting it in there rather than in a separately maintained list simplifies 
things later. 

For example, during communication you need to attach the epoch to each of your 
messages so they can be tracked later. If a process dies while the message is 
in flight, or you need to cancel your communication, you need to be able to 
find the matching message to the matching epoch. If the epoch isn't in the 
process name, then you have to modify to the message header for each type of 
message to include that information. Each process not only needs to know what 
the current version of the epoch is from it's own perspective, but also from 
the perspective of whomever is sending the message.

This is also true for things like reporting failures. To prevent duplicate 
notifications you would need to include your epoch in all the notifications so 
no one marks a process as failing twice.

Really the point is that by changing the process name, you prevent the need to 
pack the epoch each time you have any sort of communication. All that work is 
done along with packing the rest of the structure. 

On Tuesday, June 7, 2011 at 11:21 AM, Ralph Castain wrote:

> Thanks for the explanation - as I said, I won't have time to really review 
> the patch this week, but appreciate the info. I don't really expect to see a 
> conflict as George had discussed this with me previously.
> 
> I know I'll have merge conflicts with my state machine branch, which would be 
> ready for commit in the same time frame, but I'll hold off on that one and 
> deal with the merge issues on my side.
> 
> 
> 
> On Tue, Jun 7, 2011 at 8:46 AM, Wesley Bland  (mailto:wbl...@eecs.utk.edu)> wrote:
> > This could certainly work alongside another ORCM or any other fault 
> > detection/prediction/recovery mechanism. Most of the code is just dedicated 
> > to keeping the epoch up to date and tracking the status of the processes. 
> > The underlying idea was to provide a way for the application to decide what 
> > its fault policy would be rather than trying to dictate one in the runtime. 
> > If any other layer wanted to register a callback function with this code, 
> > it could do anything it wanted to on top of it. 
> > 
> > Wesley
> > 
> > On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
> > 
> > > I'm on travel this week, but will look this over when I return. From the 
> > > description, it sounds nearly identical to what we did in ORCM, so I 
> > > expect there won't be many issues. You do get some race conditions that 
> > > the new state machine code should help resolve.
> > > 
> > > Only difference I can quickly see is that we chose not to modify the 
> > > process name structure, keeping the "epoch" (we called it "incarnation") 
> > > as a separate value. Since we aren't terribly concerned about backward 
> > > compatibility, I don't consider this a significant issue - but something 
> > > the community should recognize. 
> > > 
> > > My main concern will be to ensure that the new code contains enough 
> > > flexibility to allow integration with other layers such as ORCM without 
> > > creating potential conflict over "double protection" - i.e., if the layer 
> > > above ORTE wants to provide a certain level of fault protection, then 
> > > ORTE needs to get out of the way. 
> > > 
> > > 
> > > On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca  > > (mailto:bosi...@eecs.utk.edu)> wrote:
> > > >  WHAT: Allow the runtime to handle fail-stop failures for both runtime 
> > > > (daemons) or application level processes. This patch extends the 
> > > > orte_process_name_t structure with a field to store the process epoch 
> > > > (the number of times it died so far), and add an application failure 
> > > > notification callback function to be registered in the runtime.
> > > > 
> > > >  WHY: Necessary to correctly implement the error handling in the MPI 
> > > > 2.2 standard. In addition, such a resilient runtime is a cornerstone 
> > > > for any level of fault tolerance support we want to provide in the 
> > > > future (such as the MPI-3 Run-Through Stabilization or FT-MPI).
> > > > 
> > > >  WHEN:
> > > > 
> > > >  WHERE: Patch attached to this email, based on trunk r24747.
> > > >  TIMEOUT: 2 weeks from now, on Monday 20 June.
> > > > 
> > > >  --
> > > > 
> > > >  MORE DETAILS:
> > > > 
> > > >  Currently the infrastructure required to enable any kind of fault 
> > > > tolerance development in Open MPI (with the exception of the 
> > > > checkpoint/restart) is missing. However, before developing any fault 
> > > > tolerant support at the application (MPI) level, we need to have a 
> > > > resilient runtime. The changes in this patch address this lack of 
> > > > support and would allow anyone to implement a fault tolerance protocol 
> > > > at the MPI layer without having to worry about the ORTE stabilization.
> > > > 
> > > >  This patch will allow the runtime to 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland  wrote:

>  To adress your concerns about putting the epoch in the process name
> structure, putting it in there rather than in a separately maintained list
> simplifies things later.
>

Not really concerned - I was just noting we had done it a tad differently,
but nothing important.


>
> For example, during communication you need to attach the epoch to each of
> your messages so they can be tracked later. If a process dies while the
> message is in flight, or you need to cancel your communication, you need to
> be able to find the matching message to the matching epoch. If the epoch
> isn't in the process name, then you have to modify to the message header for
> each type of message to include that information. Each process not only
> needs to know what the current version of the epoch is from it's own
> perspective, but also from the perspective of whomever is sending the
> message.
>

But the epoch is process-unique - i.e., it is the number of times that this
specific process has been started, which differs per proc since we don't
restart all the procs every time one fails. So if I look at the epoch of the
proc sending me a message, I really can't check it against my own value as
the comparison is meaningless. All I really can do is check to see if it
changed from the last time I heard from that proc, which would tell me that
the proc has been restarted in the interim.


> This is also true for things like reporting failures. To prevent duplicate
> notifications you would need to include your epoch in all the notifications
> so no one marks a process as failing twice.
>

I'm not sure of the relevance here. We handle this without problem right now
(at least, within orcm - haven't looked inside orte yet to see what needs to
be brought back, if anything) without an epoch - and the state machine will
resolve the remaining race conditions, which really don't pertain to epoch
anyway.


>
> Really the point is that by changing the process name, you prevent the need
> to pack the epoch each time you have any sort of communication. All that
> work is done along with packing the rest of the structure.
>

No argument - I don't mind having the value in the name. Makes no difference
to me.


>  On Tuesday, June 7, 2011 at 11:21 AM, Ralph Castain wrote:
>
> Thanks for the explanation - as I said, I won't have time to really review
> the patch this week, but appreciate the info. I don't really expect to see a
> conflict as George had discussed this with me previously.
>
> I know I'll have merge conflicts with my state machine branch, which would
> be ready for commit in the same time frame, but I'll hold off on that one
> and deal with the merge issues on my side.
>
>
>
> On Tue, Jun 7, 2011 at 8:46 AM, Wesley Bland  wrote:
>
>  This could certainly work alongside another ORCM or any other fault
> detection/prediction/recovery mechanism. Most of the code is just dedicated
> to keeping the epoch up to date and tracking the status of the processes.
> The underlying idea was to provide a way for the application to decide what
> its fault policy would be rather than trying to dictate one in the runtime.
> If any other layer wanted to register a callback function with this code, it
> could do anything it wanted to on top of it.
>
> Wesley
>
> On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
>
> I'm on travel this week, but will look this over when I return. From the
> description, it sounds nearly identical to what we did in ORCM, so I expect
> there won't be many issues. You do get some race conditions that the new
> state machine code should help resolve.
>
> Only difference I can quickly see is that we chose not to modify the
> process name structure, keeping the "epoch" (we called it "incarnation") as
> a separate value. Since we aren't terribly concerned about backward
> compatibility, I don't consider this a significant issue - but something the
> community should recognize.
>
> My main concern will be to ensure that the new code contains enough
> flexibility to allow integration with other layers such as ORCM without
> creating potential conflict over "double protection" - i.e., if the layer
> above ORTE wants to provide a certain level of fault protection, then ORTE
> needs to get out of the way.
>
>
> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca wrote:
>
> WHAT: Allow the runtime to handle fail-stop failures for both runtime
> (daemons) or application level processes. This patch extends the
> orte_process_name_t structure with a field to store the process epoch (the
> number of times it died so far), and add an application failure notification
> callback function to be registered in the runtime.
>
> WHY: Necessary to correctly implement the error handling in the MPI 2.2
> standard. In addition, such a resilient runtime is a cornerstone for any
> level of fault tolerance support we want to provide in the future (such as
> the MPI-3 Run-Through Stabilization o

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland


On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:

> 
> 
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland  (mailto:wbl...@eecs.utk.edu)> wrote:
> > To adress your concerns about putting the epoch in the process name 
> > structure, putting it in there rather than in a separately maintained list 
> > simplifies things later. 
> 
> Not really concerned - I was just noting we had done it a tad differently, 
> but nothing important.
> 
> > 
> > For example, during communication you need to attach the epoch to each of 
> > your messages so they can be tracked later. If a process dies while the 
> > message is in flight, or you need to cancel your communication, you need to 
> > be able to find the matching message to the matching epoch. If the epoch 
> > isn't in the process name, then you have to modify to the message header 
> > for each type of message to include that information. Each process not only 
> > needs to know what the current version of the epoch is from it's own 
> > perspective, but also from the perspective of whomever is sending the 
> > message. 
> 
> But the epoch is process-unique - i.e., it is the number of times that this 
> specific process has been started, which differs per proc since we don't 
> restart all the procs every time one fails. So if I look at the epoch of the 
> proc sending me a message, I really can't check it against my own value as 
> the comparison is meaningless. All I really can do is check to see if it 
> changed from the last time I heard from that proc, which would tell me that 
> the proc has been restarted in the interim.
But that is the point of the epoch. It prevents communication with a failed 
process. If the epoch. If the epoch is too low, you know you're communicating 
with an old process and you need to drop the message. If it is too high, you 
know that the process has been restarted and you need to update your known 
epoch.

Maybe I'm misunderstanding what you're saying?
> 
> > 
> > This is also true for things like reporting failures. To prevent duplicate 
> > notifications you would need to include your epoch in all the notifications 
> > so no one marks a process as failing twice. 
> 
> I'm not sure of the relevance here. We handle this without problem right now 
> (at least, within orcm - haven't looked inside orte yet to see what needs to 
> be brought back, if anything) without an epoch - and the state machine will 
> resolve the remaining race conditions, which really don't pertain to epoch 
> anyway.
An example here might be if a process fails and two other processes detect it. 
By marking which version of the process failed, the HNP knows that it is one 
failure detected by two processes rather than two failures being detected in 
quick succession.

I'm not sure what ORCM does in the respect, but I don't know of anything in 
ORTE that would track this data other than the process state and that doesn't 
keep track of anything beyond one failure (which admittedly isn't an issue 
until we implement process recovery).
> 
> > 
> > Really the point is that by changing the process name, you prevent the need 
> > to pack the epoch each time you have any sort of communication. All that 
> > work is done along with packing the rest of the structure. 
> 
> No argument - I don't mind having the value in the name. Makes no difference 
> to me.
> 
> > 
> > On Tuesday, June 7, 2011 at 11:21 AM, Ralph Castain wrote:
> > 
> > > Thanks for the explanation - as I said, I won't have time to really 
> > > review the patch this week, but appreciate the info. I don't really 
> > > expect to see a conflict as George had discussed this with me previously.
> > > 
> > > I know I'll have merge conflicts with my state machine branch, which 
> > > would be ready for commit in the same time frame, but I'll hold off on 
> > > that one and deal with the merge issues on my side. 
> > > 
> > > 
> > > 
> > > On Tue, Jun 7, 2011 at 8:46 AM, Wesley Bland  > > (mailto:wbl...@eecs.utk.edu)> wrote:
> > > > This could certainly work alongside another ORCM or any other fault 
> > > > detection/prediction/recovery mechanism. Most of the code is just 
> > > > dedicated to keeping the epoch up to date and tracking the status of 
> > > > the processes. The underlying idea was to provide a way for the 
> > > > application to decide what its fault policy would be rather than trying 
> > > > to dictate one in the runtime. If any other layer wanted to register a 
> > > > callback function with this code, it could do anything it wanted to on 
> > > > top of it. 
> > > > 
> > > > Wesley
> > > > 
> > > > On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
> > > > 
> > > > > I'm on travel this week, but will look this over when I return. From 
> > > > > the description, it sounds nearly identical to what we did in ORCM, 
> > > > > so I expect there won't be many issues. You do get some race 
> > > > > conditions that the new state machine code should help resolve.
> > > > > 
> > > 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread George Bosilca

On Jun 7, 2011, at 12:14 , Ralph Castain wrote:

> But the epoch is process-unique - i.e., it is the number of times that this 
> specific process has been started, which differs per proc since we don't 
> restart all the procs every time one fails.

Yes the epoch is per process, but it is distributed among all participants. The 
difficulty here is to make sure the global view of the processes converges 
toward a common value of the epoch for each process. 

> So if I look at the epoch of the proc sending me a message, I really can't 
> check it against my own value as the comparison is meaningless. All I really 
> can do is check to see if it changed from the last time I heard from that 
> proc, which would tell me that the proc has been restarted in the interim.

I fail to understand your statement here. However, comparing message epoch is 
critical to ensure the correct behavior.  It ensures we do not react on old 
messages (that were floating in the system for some obscure reasons), and that 
we have the right contact information for a specific peer (on the correct 
epoch).

  george.





Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland  wrote:

>
>  On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland  wrote:
>
>  To adress your concerns about putting the epoch in the process name
> structure, putting it in there rather than in a separately maintained list
> simplifies things later.
>
>
> Not really concerned - I was just noting we had done it a tad differently,
> but nothing important.
>
>
>
> For example, during communication you need to attach the epoch to each of
> your messages so they can be tracked later. If a process dies while the
> message is in flight, or you need to cancel your communication, you need to
> be able to find the matching message to the matching epoch. If the epoch
> isn't in the process name, then you have to modify to the message header for
> each type of message to include that information. Each process not only
> needs to know what the current version of the epoch is from it's own
> perspective, but also from the perspective of whomever is sending the
> message.
>
>
> But the epoch is process-unique - i.e., it is the number of times that this
> specific process has been started, which differs per proc since we don't
> restart all the procs every time one fails. So if I look at the epoch of the
> proc sending me a message, I really can't check it against my own value as
> the comparison is meaningless. All I really can do is check to see if it
> changed from the last time I heard from that proc, which would tell me that
> the proc has been restarted in the interim.
>
> But that is the point of the epoch. It prevents communication with a failed
> process. If the epoch. If the epoch is too low, you know you're
> communicating with an old process and you need to drop the message. If it is
> too high, you know that the process has been restarted and you need to
> update your known epoch.
>
> Maybe I'm misunderstanding what you're saying?
>

Perhaps it would help if you folks could provide a little explanation about
how you use epoch? While the value sounds similar, your explanations are
beginning to sound very different from what we are doing and/or had
envisioned.

I'm not sure how you can talk about an epoch being too high or too low,
unless you are envisioning an overall system where procs try to maintain
some global notion of the value - which sounds like a race condition begging
to cause problems.


>
>
> This is also true for things like reporting failures. To prevent duplicate
> notifications you would need to include your epoch in all the notifications
> so no one marks a process as failing twice.
>
>
> I'm not sure of the relevance here. We handle this without problem right
> now (at least, within orcm - haven't looked inside orte yet to see what
> needs to be brought back, if anything) without an epoch - and the state
> machine will resolve the remaining race conditions, which really don't
> pertain to epoch anyway.
>
> An example here might be if a process fails and two other processes detect
> it. By marking which version of the process failed, the HNP knows that it is
> one failure detected by two processes rather than two failures being
> detected in quick succession.
>

Are you then thinking that MPI processes are going to detect failure instead
of local orteds?? Right now, no MPI process would ever report failure of a
peer - the orted detects failure using the sigchild and reports it. What
mechanism would the MPI procs use, and how would that be more reliable than
sigchild??

So right now the HNP can -never- receive more than one failure report at a
time for a process. The only issue we've been working is that there are
several pathways for reporting that error - e.g., if the orted detects the
process fails and reports it, and then the orted itself fails, we can get
multiple failure events back at the HNP before we respond to the first one.

Not the same issue as having MPI procs reporting failures...



>
> I'm not sure what ORCM does in the respect, but I don't know of anything in
> ORTE that would track this data other than the process state and that
> doesn't keep track of anything beyond one failure (which admittedly isn't an
> issue until we implement process recovery).
>

We aren't having any problems with process recovery and process state -
without tracking epochs. We only track "incarnations" so that we can pass it
down to the apps, which use that info to guide their restart.

Could you clarify why you are having a problem in this regard? Might help to
better understand your proposed changes.


>
>
> Really the point is that by changing the process name, you prevent the need
> to pack the epoch each time you have any sort of communication. All that
> work is done along with packing the rest of the structure.
>
>
> No argument - I don't mind having the value in the name. Makes no
> difference to me.
>
>
>  On Tuesday, June 7, 2011 at 11:21 AM, Ralph Castain wrote:
>
> Thanks for the exp

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote:

>
> On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
>
> > But the epoch is process-unique - i.e., it is the number of times that
> this specific process has been started, which differs per proc since we
> don't restart all the procs every time one fails.
>
> Yes the epoch is per process, but it is distributed among all participants.
> The difficulty here is to make sure the global view of the processes
> converges toward a common value of the epoch for each process.
>

Sounds racy...is it actually necessary to have a global agreement on epoch?
Per my other note, perhaps we really need a primer on this epoch concept.



>
> > So if I look at the epoch of the proc sending me a message, I really
> can't check it against my own value as the comparison is meaningless. All I
> really can do is check to see if it changed from the last time I heard from
> that proc, which would tell me that the proc has been restarted in the
> interim.
>
> I fail to understand your statement here. However, comparing message epoch
> is critical to ensure the correct behavior.  It ensures we do not react on
> old messages (that were floating in the system for some obscure reasons),
> and that we have the right contact information for a specific peer (on the
> correct epoch).
>

Again, maybe we need a better understanding of what you mean by epoch -
clearly, there is misunderstanding of what you are proposing to do.

I'm leery of anything that requires a general consensus as it creates a lot
of race conditions - might work under certain circumstances, but we've been
burned by that approach too many times.



>  george.
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
> 
> Perhaps it would help if you folks could provide a little explanation about 
> how you use epoch? While the value sounds similar, your explanations are 
> beginning to sound very different from what we are doing and/or had 
> envisioned. 
> 
> I'm not sure how you can talk about an epoch being too high or too low, 
> unless you are envisioning an overall system where procs try to maintain some 
> global notion of the value - which sounds like a race condition begging to 
> cause problems. 
> 
> 
> 
> 

When we say epoch we mean a value that is stored locally. When a failure is 
detected the detector notifies the HNP who notifies everyone else. Thus 
everyone will _eventually_ receive the notification that the process has 
failed. It may take a while for you to receive the notification, but in the 
meantime you will behave normally. When you do receive the notification that 
the failure occurred, you update your local copy of the epoch.

This is similar to the definition of the "perfect" failure detector that Josh 
references. It doesn't matter if you don't find about the failure immediately, 
as long as you find out about it eventually. If you aren't actually in the same 
jobid as the failed process you might never find out about the failure because 
it does not apply to you.
> Are you then thinking that MPI processes are going to detect failure instead 
> of local orteds?? Right now, no MPI process would ever report failure of a 
> peer - the orted detects failure using the sigchild and reports it. What 
> mechanism would the MPI procs use, and how would that be more reliable than 
> sigchild??
> 
> 
> 

Definitely not. ORTEDs are the processes that detect and report the failures. 
They can detect the failure of other ORTEDs or of applications. Basically 
anything to which they have a connection.
> 
> So right now the HNP can -never- receive more than one failure report at a 
> time for a process. The only issue we've been working is that there are 
> several pathways for reporting that error - e.g., if the orted detects the 
> process fails and reports it, and then the orted itself fails, we can get 
> multiple failure events back at the HNP before we respond to the first one. 
> 
> Not the same issue as having MPI procs reporting failures...
This is where the epoch becomes necessary. When reporting a failure, you tell 
the HNP which process failed by name, including the epoch. Thus the HNP will 
not make a process as having failed twice (thus incrementing the epoch twice 
and notifying everyone about the failure twice). The HNP might receive multiple 
notifications because more than one ORTED could (and often will) detect the 
failure. It is easier to have the HNP decide what is a failure and what is a 
duplicate rather than have the ORTEDs reach some consensus about the fact that 
a process has failed. Much less overhead this way.
> > 
> > I'm not sure what ORCM does in the respect, but I don't know of anything in 
> > ORTE that would track this data other than the process state and that 
> > doesn't keep track of anything beyond one failure (which admittedly isn't 
> > an issue until we implement process recovery). 
> 
> We aren't having any problems with process recovery and process state - 
> without tracking epochs. We only track "incarnations" so that we can pass it 
> down to the apps, which use that info to guide their restart. 
> 
> Could you clarify why you are having a problem in this regard? Might help to 
> better understand your proposed changes.
I think we're talking about the same thing here. The only difference is that 
I'm not looking at the ORCM code so I don't have the "incarnations".




Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Ah - thanks! That really helped clarify things. Much appreciated.

Will look at the patch in this light...

On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland  wrote:

>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the value sounds similar, your explanations are
> beginning to sound very different from what we are doing and/or had
> envisioned.
>
> I'm not sure how you can talk about an epoch being too high or too low,
> unless you are envisioning an overall system where procs try to maintain
> some global notion of the value - which sounds like a race condition begging
> to cause problems.
>
>
> When we say epoch we mean a value that is stored locally. When a failure is
> detected the detector notifies the HNP who notifies everyone else. Thus
> everyone will _eventually_ receive the notification that the process has
> failed. It may take a while for you to receive the notification, but in the
> meantime you will behave normally. When you do receive the notification that
> the failure occurred, you update your local copy of the epoch.
>
> This is similar to the definition of the "perfect" failure detector that
> Josh references. It doesn't matter if you don't find about the failure
> immediately, as long as you find out about it eventually. If you aren't
> actually in the same jobid as the failed process you might never find out
> about the failure because it does not apply to you.
>
> Are you then thinking that MPI processes are going to detect failure
> instead of local orteds?? Right now, no MPI process would ever report
> failure of a peer - the orted detects failure using the sigchild and reports
> it. What mechanism would the MPI procs use, and how would that be more
> reliable than sigchild??
>
> Definitely not. ORTEDs are the processes that detect and report the
> failures. They can detect the failure of other ORTEDs or of applications.
> Basically anything to which they have a connection.
>
>
> So right now the HNP can -never- receive more than one failure report at a
> time for a process. The only issue we've been working is that there are
> several pathways for reporting that error - e.g., if the orted detects the
> process fails and reports it, and then the orted itself fails, we can get
> multiple failure events back at the HNP before we respond to the first one.
>
> Not the same issue as having MPI procs reporting failures...
>
> This is where the epoch becomes necessary. When reporting a failure, you
> tell the HNP which process failed by name, including the epoch. Thus the HNP
> will not make a process as having failed twice (thus incrementing the epoch
> twice and notifying everyone about the failure twice). The HNP might receive
> multiple notifications because more than one ORTED could (and often will)
> detect the failure. It is easier to have the HNP decide what is a failure
> and what is a duplicate rather than have the ORTEDs reach some consensus
> about the fact that a process has failed. Much less overhead this way.
>
>
> I'm not sure what ORCM does in the respect, but I don't know of anything in
> ORTE that would track this data other than the process state and that
> doesn't keep track of anything beyond one failure (which admittedly isn't an
> issue until we implement process recovery).
>
>
> We aren't having any problems with process recovery and process state -
> without tracking epochs. We only track "incarnations" so that we can pass it
> down to the apps, which use that info to guide their restart.
>
> Could you clarify why you are having a problem in this regard? Might help
> to better understand your proposed changes.
>
> I think we're talking about the same thing here. The only difference is
> that I'm not looking at the ORCM code so I don't have the "incarnations".
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Quick question: could you please clarify this statement:

...because more than one ORTED could (and often will) detect the failure.
>

I don't understand how this can be true, except for detecting an ORTED
failure. Only one orted can detect an MPI process failure, unless you have
now involved orted's in MPI communications (and I don't believe you did). If
the HNP directs another orted to restart that proc, and then that
incarnation fails, then the epoch number -should- increment again, shouldn't
it?

So are you concerned (re having the HNP mark a proc down multiple times)
about orted failure detection? In that case, I agree that you can have
multiple failure detections - we dealt with it differently in orcm, but I
have no issue with doing it another way. Just helps to know what problem you
are trying to solve.


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
Definitely we are targeting ORTED failures here. If an ORTED fails than any 
other ORTEDs connected to it will notice and report the failure. Of course if 
the failure is an application than the ORTED on that node will be the only one 
to detect it.

Also, if an ORTED is lost, all of the applications running underneath it are 
also lost because we have no way to communicate with them anymore.

On Tuesday, June 7, 2011 at 3:14 PM, Ralph Castain wrote:

> Quick question: could you please clarify this statement:
> 
> > ...because more than one ORTED could (and often will) detect the failure. 
> 
> I don't understand how this can be true, except for detecting an ORTED 
> failure. Only one orted can detect an MPI process failure, unless you have 
> now involved orted's in MPI communications (and I don't believe you did). If 
> the HNP directs another orted to restart that proc, and then that incarnation 
> fails, then the epoch number -should- increment again, shouldn't it? 
> 
> So are you concerned (re having the HNP mark a proc down multiple times) 
> about orted failure detection? In that case, I agree that you can have 
> multiple failure detections - we dealt with it differently in orcm, but I 
> have no issue with doing it another way. Just helps to know what problem you 
> are trying to solve. 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org (mailto:de...@open-mpi.org)
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] MPI application hangs after a checkpoint

2011-06-07 Thread Kishor Kharbas
Hello,

I am trying to use checkpoint-restart functionality of OpenMPI. Most of the
times checkpointing of MPI application behaves correctly, but in some
situations the MPI application hangs indefinitely after the checkpoint is
taken. Ompi-checkpoint terminates without error and I do get the snapshot
reference, but the application does not resume (seems to be busy waiting
somewhere in mpi code). I am not able to reproduce this problem to find the
exact scenario which leads to this issue.
But, these things are common in all the scenarios which lead to error:
1. OpenIB BTL is used. (using TCP btl does not produce this error)
2. The communication is of the form - Isends/Irecvs followed by Waitall(...)

I saw a ticket(#2397) which shows some bug fixes targeted for V1.7 ; I went
through them, but not sure whether my problem is because of those bugs. Are
there any known issues specifically when OpenIB btl is used?

I am using Open-MPI version 1.5.3
Please find the output of ompi-info and config.log as attachments.


I am providing these back-traces of a single process taken at different
times, if it helps. All the MPI application processes are in running state.
Please let me know if additional information is required.

Back trace 1
#0  mca_btl_sm_component_progress () at btl_sm_component.c:560
#1  0x2b09eb1d3105 in opal_progress () at runtime/opal_progress.c:207
#2  0x2b09eb0f9b3f in opal_condition_wait (count=2, requests=0x326bd3b0,
statuses=0x7fffbab81780) at ../opal/threads/condition.h:92
#3  ompi_request_default_wait_all (count=2, requests=0x326bd3b0,
statuses=0x7fffbab81780) at request/req_wait.c:263
#4  0x2b09eb126db6 in PMPI_Waitall (count=2, requests=0x326bd3b0,
statuses=0x7fffbab81780) at pwaitall.c:70
#5  0x2b09eac69c65 in MPI_Wait (request=0x7fffbab81838,
status=0x7fffbab81820) at mrmpi_p2p.c:3330
#6  0x2b09eac6a1aa in mpi_wait_ (request=0x7fffbab81948,
status=0x7fffbab81920, ierror=0x7fffbab81958) at mrmpi_p2p.c:3418
#7  0x0040476c in conj_grad (colidx=0x608a40, rowstr=0x41c1f7c,
x=0x1c4b6298, z=0x1c624608, a=0x14d43820, p=0x1c792978, q=0x1c900ce8,
r=0x1ca6f058,
w=0x1cbdd3c8, rnorm=@0x7fffbab81dd0, l2npcols=@0x7fffbab81e2c,
reduce_exch_proc=0x7fffbab81d50, reduce_send_starts=0x7fffbab81cf0,
reduce_send_lengths=0x7fffbab81d30, reduce_recv_starts=0x7fffbab81d10,
reduce_recv_lengths=0x7fffbab81d70) at cg.f:1295
#8  0x00402271 in cg_unit () at cg.f:502
#9  0x0040181b in MAIN__ () at cg.f:56
#10 0x00406e8e in main ()

Back trace 2
#0  0x2f710a8a in get_sw_cqe (cq=, n=19) at
src/cq.c:119
#1  0x2f710f01 in next_cqe_sw (ibcq=0x32a7cde0, ne=1, wc=) at src/cq.c:125
#2  mlx4_poll_one (ibcq=0x32a7cde0, ne=1, wc=) at
src/cq.c:205
#3  mlx4_poll_cq (ibcq=0x32a7cde0, ne=1, wc=) at
src/cq.c:352
#4  0x2d9d7b53 in opal_pointer_array_get_item () at
../../../../opal/threads/mutex_unix.h:102
#5  btl_openib_component_progress () at btl_openib_component.c:3540
#6  0x2b09eb1d3105 in opal_progress () at runtime/opal_progress.c:207
#7  0x2b09eb0f9b3f in opal_condition_wait (count=2, requests=0x326bd3b0,
statuses=0x7fffbab81780) at ../opal/threads/condition.h:92
#8  ompi_request_default_wait_all (count=2, requests=0x326bd3b0,
statuses=0x7fffbab81780) at request/req_wait.c:263
#9  0x2b09eb126db6 in PMPI_Waitall (count=2, requests=0x326bd3b0,
statuses=0x7fffbab81780) at pwaitall.c:70
#10 0x2b09eac69c65 in MPI_Wait (request=0x7fffbab81838,
status=0x7fffbab81820) at mrmpi_p2p.c:3330
#11 0x2b09eac6a1aa in mpi_wait_ (request=0x7fffbab81948,
status=0x7fffbab81920, ierror=0x7fffbab81958) at mrmpi_p2p.c:3418
#12 0x0040476c in conj_grad (colidx=0x608a40, rowstr=0x41c1f7c,
x=0x1c4b6298, z=0x1c624608, a=0x14d43820, p=0x1c792978, q=0x1c900ce8,
r=0x1ca6f058,
w=0x1cbdd3c8, rnorm=@0x7fffbab81dd0, l2npcols=@0x7fffbab81e2c,
reduce_exch_proc=0x7fffbab81d50, reduce_send_starts=0x7fffbab81cf0,
reduce_send_lengths=0x7fffbab81d30, reduce_recv_starts=0x7fffbab81d10,
reduce_recv_lengths=0x7fffbab81d70) at cg.f:1295
#13 0x00402271 in cg_unit () at cg.f:502
#14 0x0040181b in MAIN__ () at cg.f:56
#15 0x00406e8e in main ()

Back trace 3
#0  mlx4_poll_cq (ibcq=0x32a7cc60, ne=1, wc=) at
src/cq.c:360
#1  0x2d9d7b53 in opal_pointer_array_get_item () at
../../../../opal/threads/mutex_unix.h:102
#2  btl_openib_component_progress () at btl_openib_component.c:3540
#3  0x2b09eb1d3105 in opal_progress () at runtime/opal_progress.c:207
#4  0x2b09eb0f9b3f in opal_condition_wait (count=2, requests=0x326bd3b0,
statuses=0x7fffbab81780) at ../opal/threads/condition.h:92
#5  ompi_request_default_wait_all (count=2, requests=0x326bd3b0,
statuses=0x7fffbab81780) at request/req_wait.c:263
#6  0x2b09eb126db6 in PMPI_Waitall (count=2, requests=0x326bd3b0,
statuses=0x7fffbab81780) at pwaitall.c:70
#7  0x2b09eac69c65 in MPI_Wait (request=0x7fffbab81838,
status=0x7fffbab81820) at mrmpi_p2p.c:3330
#8  0x2b09eac6a1

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Josh Hursey
I looked through the patch a bit more today and had a few notes/questions.
- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?
- in the orte_errmgr.set_fault_callback: it would be nice if it
returned the previous callback, so you could layer more than one
'thing' on top of ORTE and have them chain in a sigaction-like manner.
- orte_process_info.max_procs: this seems to be only used in the
binomial routed, but I was a bit unclear about its purpose. Can you
describe what it does, and how it is used?
- in orted_comm.c: you process the ORTE_PROCESS_FAILED_NOTIFICATION
message here. Why not push all of that logic into the errmgr
components? It is not a big deal, just curious.

I'll probably send more notes after some more digging and testing of
the code. But the patch is looking good. Good work!

-- Josh

On Tue, Jun 7, 2011 at 10:51 AM, Josh Hursey  wrote:
> I briefly looked over the patch. Excluding the epochs (which we don't
> need now, but will soon) it looks similar to what I have setup on my
> MPI run-through stabilization branch - so it should support that work
> nicely. I'll try to test it this week and send back any other
> comments.
>
> Good work.
>
> Thanks,
> Josh
>
> On Tue, Jun 7, 2011 at 10:46 AM, Wesley Bland  wrote:
>> This could certainly work alongside another ORCM or any other fault
>> detection/prediction/recovery mechanism. Most of the code is just dedicated
>> to keeping the epoch up to date and tracking the status of the processes.
>> The underlying idea was to provide a way for the application to decide what
>> its fault policy would be rather than trying to dictate one in the runtime.
>> If any other layer wanted to register a callback function with this code, it
>> could do anything it wanted to on top of it.
>> Wesley
>>
>> On Tuesday, June 7, 2011 at 7:41 AM, Ralph Castain wrote:
>>
>> I'm on travel this week, but will look this over when I return. From the
>> description, it sounds nearly identical to what we did in ORCM, so I expect
>> there won't be many issues. You do get some race conditions that the new
>> state machine code should help resolve.
>> Only difference I can quickly see is that we chose not to modify the process
>> name structure, keeping the "epoch" (we called it "incarnation") as a
>> separate value. Since we aren't terribly concerned about backward
>> compatibility, I don't consider this a significant issue - but something the
>> community should recognize.
>> My main concern will be to ensure that the new code contains enough
>> flexibility to allow integration with other layers such as ORCM without
>> creating potential conflict over "double protection" - i.e., if the layer
>> above ORTE wants to provide a certain level of fault protection, then ORTE
>> needs to get out of the way.
>>
>> On Mon, Jun 6, 2011 at 1:00 PM, George Bosilca  wrote:
>>
>> WHAT: Allow the runtime to handle fail-stop failures for both runtime
>> (daemons) or application level processes. This patch extends the
>> orte_process_name_t structure with a field to store the process epoch (the
>> number of times it died so far), and add an application failure notification
>> callback function to be registered in the runtime.
>>
>> WHY: Necessary to correctly implement the error handling in the MPI 2.2
>> standard. In addition, such a resilient runtime is a cornerstone for any
>> level of fault tolerance support we want to provide in the future (such as
>> the MPI-3 Run-Through Stabilization or FT-MPI).
>>
>> WHEN:
>>
>> WHERE: Patch attached to this email, based on trunk r24747.
>> TIMEOUT: 2 weeks from now, on Monday 20 June.
>>
>> --
>>
>> MORE DETAILS:
>>
>> Currently the infrastructure required to enable any kind of fault tolerance
>> development in Open MPI (with the exception of the checkpoint/restart) is
>> missing. However, before developing any fault tolerant support at the
>> application (MPI) level, we need to have a resilient runtime. The changes in
>> this patch address this lack of support and would allow anyone to implement
>> a fault tolerance protocol at the MPI layer without having to worry about
>> the ORTE stabilization.
>>
>> This patch will allow the runtime to drop any dead daemons, and re-route all
>> communications around the holes in order to __ALWAYS__ deliver a message as
>> long as the destination process is alive. The application is informed (via a
>> callback) about the loss of the processes with the same jobid. In this patch
>> we do not address the MPI_ERROR_RETURN type of failures, we focused on the
>> MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the
>> decision, instead of taking it down in the runtime.
>>
>> NEW STUFF:
>>
>> Epoch - A counter that tracks the number of times a process has been
>> detected to have terminated, either from a failure or an expected
>> termination

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks - that helps!


On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland  wrote:

>  Definitely we are targeting ORTED failures here. If an ORTED fails than
> any other ORTEDs connected to it will notice and report the failure. Of
> course if the failure is an application than the ORTED on that node will be
> the only one to detect it.
>
> Also, if an ORTED is lost, all of the applications running underneath it
> are also lost because we have no way to communicate with them anymore.
>
> On Tuesday, June 7, 2011 at 3:14 PM, Ralph Castain wrote:
>
> Quick question: could you please clarify this statement:
>
> ...because more than one ORTED could (and often will) detect the failure.
>
>
> I don't understand how this can be true, except for detecting an ORTED
> failure. Only one orted can detect an MPI process failure, unless you have
> now involved orted's in MPI communications (and I don't believe you did). If
> the HNP directs another orted to restart that proc, and then that
> incarnation fails, then the epoch number -should- increment again, shouldn't
> it?
>
> So are you concerned (re having the HNP mark a proc down multiple times)
> about orted failure detection? In that case, I agree that you can have
> multiple failure detections - we dealt with it differently in orcm, but I
> have no issue with doing it another way. Just helps to know what problem you
> are trying to solve.
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>