On Feb 13, 2014, at 11:26 AM, Adrian Reber <adr...@lisas.de> wrote: > On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote: >> On Feb 6, 2014, at 2:16 PM, Adrian Reber <adr...@lisas.de> wrote: >> >>> Josh explained it to me a few days ago, that after a checkpoint has been >>> received TCP should no longer be used to not lose any messages. The >>> communication happens over named pipes and therefore (I think) OOB >>> ft_event() is used to quite anything besides the pipes. This all seems >>> to work but I was just confused as the functions for ft_event() >>> in oob/tcp and oob/ud do not seem to contain any functionality. >>> >>> So do I try to fix the ft_event() function in oob/base/ to call the >>> registered ft_event() function which does nothing or do I just remove >>> the call to orte oob ft_event(). >> >> Sounds like you'll need to tell the OOB components to stop processing >> messages, so that will require that you insert an event into the system. You >> have to account for two things: >> >> (a) the OOB base and OOB components are operating on the orte_event_base, but >> >> (b) each OOB component can have multiple active modules (one per NIC) that >> are operating on their own event base/thread. >> >> So you have to start by pushing an event that calls the OOB base, which then >> loops across the components calling their ft_event interface. Each component >> would then have to create an event for each active module, inserting that >> event into the module's event base/thread. When activated, each module would >> have to shutdown its message engine, and activate another event to notify >> its component that all is quiet. >> >> Once a component finds out that all its modules are quiet, it would then >> have to activate an event to the OOB base. Once the OOB base sees all >> components report quiet, then it would have to activate an event to take you >> to the next step in your process. >> >> In other words, you need to turn the quieting process into its own set of >> states and run it through the state machine. This is the only way to >> guarantee that you'll keep things orderly, and is the major change needed in >> the C/R procedure as it flows thru ORTE. You can't just progress thru a set >> of function calls as you'll inevitably run into a roadblock requiring that >> you wait for an event-driven process to complete. > > I tried to implement something like you described. It is not yet event > driven, but before continuing I wanted to get some feedback if it is at > least the right start: > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706 > > I looked at the other ORTE_OOB_* macros and tried to model my > functionality a bit after what I have seen there. Right now it is still > a simple function which just tries to call ft_event() on all oob > components. Does this look right so far?
Sorry for delay - yes, that looks like the right direction. I would suggest doing it via the current state machine, though, by simply defining another job or proc state in orte/mca/plm/plm_types.h, and then registering a callback function using the orte_state.add_job[proc]_state(state, function to be called, ORTE_ERR_PRI). Then you can activate it by calling ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the proper order. > > Adrian > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel