On Feb 13, 2014, at 11:26 AM, Adrian Reber <adr...@lisas.de> wrote:

> On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote:
>> On Feb 6, 2014, at 2:16 PM, Adrian Reber <adr...@lisas.de> wrote:
>> 
>>> Josh explained it to me a few days ago, that after a checkpoint has been
>>> received TCP should no longer be used to not lose any messages. The
>>> communication happens over named pipes and therefore (I think) OOB
>>> ft_event() is used to quite anything besides the pipes. This all seems
>>> to work but I was just confused as the functions for ft_event()
>>> in oob/tcp and oob/ud do not seem to contain any functionality.
>>> 
>>> So do I try to fix the ft_event() function in oob/base/ to call the
>>> registered ft_event() function which does nothing or do I just remove
>>> the call to orte oob ft_event().
>> 
>> Sounds like you'll need to tell the OOB components to stop processing 
>> messages, so that will require that you insert an event into the system. You 
>> have to account for two things:
>> 
>> (a) the OOB base and OOB components are operating on the orte_event_base, but
>> 
>> (b) each OOB component can have multiple active modules (one per NIC) that 
>> are operating on their own event base/thread.
>> 
>> So you have to start by pushing an event that calls the OOB base, which then 
>> loops across the components calling their ft_event interface. Each component 
>> would then have to create an event for each active module, inserting that 
>> event into the module's event base/thread. When activated, each module would 
>> have to shutdown its message engine, and activate another event to notify 
>> its component that all is quiet.
>> 
>> Once a component finds out that all its modules are quiet, it would then 
>> have to activate an event to the OOB base. Once the OOB base sees all 
>> components report quiet, then it would have to activate an event to take you 
>> to the next step in your process.
>> 
>> In other words, you need to turn the quieting process into its own set of 
>> states and run it through the state machine. This is the only way to 
>> guarantee that you'll keep things orderly, and is the major change needed in 
>> the C/R procedure as it flows thru ORTE. You can't just progress thru a set 
>> of function calls as you'll inevitably run into a roadblock requiring that 
>> you wait for an event-driven process to complete.
> 
> I tried to implement something like you described. It is not yet event
> driven, but before continuing I wanted to get some feedback if it is at
> least the right start:
> 
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706
> 
> I looked at the other ORTE_OOB_* macros and tried to model my
> functionality a bit after what I have seen there. Right now it is still
> a simple function which just tries to call ft_event() on all oob
> components. Does this look right so far?

Sorry for delay - yes, that looks like the right direction. I would suggest 
doing it via the current state machine, though, by simply defining another job 
or proc state in orte/mca/plm/plm_types.h, and then registering a callback 
function using the orte_state.add_job[proc]_state(state, function to be called, 
ORTE_ERR_PRI). Then you can activate it by calling 
ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the proper 
order.


> 
>               Adrian
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to