On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> 
> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > I am currently trying to understand how callbacks are working. Right now
> > I am looking at orte/mca/rml/base/rml_base_receive.c
> > orte_rml_base_comm_start() which does 
> > 
> >    orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >                            ORTE_RML_TAG_RML_INFO_UPDATE,
> >                            ORTE_RML_PERSISTENT,
> >                            orte_rml_base_recv,
> >                            NULL);
> > 
> > As far as I understand it orte_rml_base_recv() is the callback function.
> > At which point should this function run? When the data is actually
> > received?
> 
> Not precisely. When data is received by the OOB, it pushes the data into an 
> event. When that event gets serviced, it calls the orte_rml_base_receive 
> function which processes the data to find the matching tag, and then uses 
> that to execute the callback to the user code.
> 
> > 
> > The same for send_buffer_nb() functions. I do not see the callback
> > functions actually running. How can I verify that the callback functions
> > are running. Especially for the send case it sounds pretty obvious how
> > it should work but I never see the callback function running. At least
> > in my setup.
> 
> The data is not immediately sent. It gets pushed into an event. When that 
> event gets serviced, it calls the orte_oob_base_send function which then 
> passes the data to each active OOB component until one of them says it can 
> send it. The data is then pushed into another event to get it into the event 
> base for that component's active module - when that event gets serviced, the 
> data is sent. Once the data is sent, an event is created that, when serviced, 
> executes the callback to the user code.
> 
> If you aren't seeing callbacks, the most likely cause is that the orte 
> progress thread isn't running. Without it, none of this will work.

Thanks. Running configure without '--with-ft=cr' I can run a program and
use orte-top. In orterun I can see that the callback is running and
orte-top displays the retrieved information. I can also see in orte-top
that the callbacks are working. Doing the same with '--with-ft=cr'
enabled orte-top crashes as well as orte-checkpoint and both (-top and
-checkpoint) seem to no longer have working callbacks and that is why
they are probably crashing. So some code which is enabled by '--with-ft=cr'
seems to break callbacks in orte-top as well as in orte-checkpoint.
orterun handles callbacks no matter if configured with or without
'--with-ft=cr'.

                Adrian

Reply via email to