Yesterday evening George, Thomas and I discussed some of their concerns about 
this RFC at the MPI Forum meeting. After the discussion, we seemed to be in 
agreement that the RecoS framework is a good idea and the concepts and fixes in 
this RFC should move forward with a couple of notes:

 - They wanted to test the branch a bit more over the next couple of days. Some 
MCA parameters that you will need are at the bottom of this message.

 - Reiterate that this RFC only addresses ORTE stability, not OMPI stability. 
The OMPI stability extension is a second step for the line of work, and 
should/will fit in nicely with the RecoS framework being proposed in this RFC. 
The OMPI layer stability will require a significant amount of work, but the 
RecoS framework will provide the ORTE layer stability that is required as a 
foundation for OMPI layer stability in the future.

 - The purpose of the ErrMgr becomes slightly unclear with the addition of the 
RecoS framework, since both are focused on responding to faults in the system 
(and RecoS, when enabled, overrides most/all of the ErrMgr functionality). 
Should the RecoS framework be merged with the ErrMgr framework to create a new 
ErrMgr interface?

We are typing to decide if we should merge these frameworks, but at this point 
we are interested in hearing how other developers feel about merging the ErrMgr 
and RecoS frameworks, which would change the ErrMgr API. Are there any 
developers out there that are developing ErrMgr components, or are using any 
particular features of the existing ErrMgr framework that they would like to 
see preserved in the next revision. By default, the existing default abort 
behavior of the ErrMgr framework will be preserved, so the user will have to 
'opt-in' to any fault recovery capabilities.

So we are continuing the discussion a bit more off-list, and will return to the 
list with an updated RFC (and possibly a new branch) soon (hopefully end of the 
week/early next week). I would like to briefly discuss this RFC at the Open MPI 
teleconf next Tuesday.

-- Josh

On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote:

> Sounds good to me.
> 
> For those casually following this RFC let me summarize its current state.
> 
> Josh and George (and anyone else that wishes to participate attending the 
> forum) will meet sometime at the next MPI Forum meeting (March 8-10). I will 
> post any relevant notes from this meeting back to the list afterwards. So the 
> RFC is on hold pending the outcome of that meeting. For those developers 
> interested in this RFC that will not be able to attend, feel free to continue 
> using this thread for discussion.
> 
> Thanks,
> Josh
> 
> On Feb 26, 2010, at 6:09 AM, George Bosilca wrote:
> 
>> 
>> On Feb 26, 2010, at 01:50 , Josh Hursey wrote:
>> 
>>> Any of those options are fine with me. I was thinking that if you wanted to 
>>> talk sooner, we might be able to help explain our intentions with this 
>>> framework a bit better. I figure that the framework interface will change a 
>>> bit as we all advance and incorporate our various techniques into it. I 
>>> think that the current interface is a good first step, but there are 
>>> certainly many more steps to come.
>>> 
>>> I am fine delaying this code a bit, just not too long. Meeting at the forum 
>>> for a while might be a good option (we could probably even arrange to call 
>>> in others if you wanted).
>> 
>> Sounds good, let do this.
>> 
>> Thanks,
>>   george.
>> 
>>> 
>>> Cheers,
>>> Josh
>>> 
>>> On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote:
>>> 
>>>> If Josh is going to be at the forum, perhaps you folks could chat there? 
>>>> Might as well take advantage of being colocated, if possible.
>>>> 
>>>> Otherwise, I'm available pretty much any time. I can't contribute much 
>>>> about the MPI recovery issues, but can contribute to the RTE issues if 
>>>> that helps.
>>>> 
>>>> 
>>>> On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca <bosi...@eecs.utk.edu> 
>>>> wrote:
>>>> Josh,
>>>> 
>>>> Next week is a little bit too early as will need some time to figure out 
>>>> how to integrate with this new framework, and at what extent our code and 
>>>> requirements fit into. Then the week after is the MPI Forum. How about on 
>>>> Thursday 11 March?
>>>> 
>>>> Thanks,
>>>> george.
>>>> 
>>>> On Feb 25, 2010, at 12:46 , Josh Hursey wrote:
>>>> 
>>>>> Per my previous suggestion, would it be useful to chat on the phone early 
>>>>> next week about our various strategies?
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to