If it is the application, then there is probably a barrier in the
app_coord_init() to make sure all the applications are up and running.
After this point then the global coordinator knows that the application can
be checkpointed.

I don't think orte-checkpoint should be calling a barrier - from what I
recall.


On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Is it orte-checkpoint that is hanging, or the app you are trying to
> checkpoint?
>
>
> On Jan 20, 2014, at 2:10 PM, Adrian Reber <adr...@lisas.de> wrote:
>
> Thanks for your help. I tried initializing the barrier correctly (see
> attached patch) but now, instead of crashing, it just hangs on the
> barrier while running orte-checkpoint
>
> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
>
> #0  0x00007ffff69befa0 in __nanosleep_nocancel () at
> ../sysdeps/unix/syscall-template.S:81
> #1  0x00007ffff7b456ba in app_coord_init () at
> ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> #2  0x00007ffff7b3a582 in orte_snapc_full_module_init (seed=false,
> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
>
> it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
>
> I do not understand on what the barrier here is actually waiting for. Where
> do I need to look to find the place the barrier is waiting for?
>
> I also tried initializing the collective id's in
> orte/mca/plm/base/plm_base_launch_support.c but that code is never
> used running the orte-checkpoint tool
>
>  Adrian
>
> On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
>
> I took a look at this, and I'm afraid you have some work to do in the
> orte/mca/snapc code base:
>
> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See
> r30261 for an example of the changes that need to be made - I did some, but
> can't swear to catching them all. It was enough to at least get a proc past
> the initial snapc registration
>
> 2. you are reusing collective id's to execute several orte_grpcomm.barrier
> calls - those ids are used elsewhere during MPI_Init. This is not allowed -
> a collective id can only be used *once*. What you need to do is go into
> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add
> cr-specific collective id's for this purpose. I don't know how many places
> in the cr code create their own barriers, but they each need a collective
> id.
>
> If you prefer and have the time, you are welcome to extend the collective
> code to allow id reuse. This would require that each daemon and app "reset"
> the collective fields when a collective is declared complete. It isn't that
> hard to do - just never had a reason to do it. I can take a shot at it when
> time permits (may have some time this weekend)
>
> 3. when you post the non-blocking recv in the snapc/full code, it looks to
> me like you need to block until you get the answer. I don't know where in
> the code flow this is occurring - if you are not in an event, then it is
> okay to block using ORTE_WAIT_FOR_COMPLETION. Look in
> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
>
> HTH
> Ralph
>
> On Jan 10, 2014, at 12:55 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>
> On Jan 10, 2014, at 12:45 PM, Adrian Reber <adr...@lisas.de> wrote:
>
> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
>
>
> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adr...@lisas.de> wrote:
>
> I am currently trying to understand how callbacks are working. Right now
> I am looking at orte/mca/rml/base/rml_base_receive.c
> orte_rml_base_comm_start() which does
>
>  orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>                          ORTE_RML_TAG_RML_INFO_UPDATE,
>                          ORTE_RML_PERSISTENT,
>                          orte_rml_base_recv,
>                          NULL);
>
> As far as I understand it orte_rml_base_recv() is the callback function.
> At which point should this function run? When the data is actually
> received?
>
>
> Not precisely. When data is received by the OOB, it pushes the data into
> an event. When that event gets serviced, it calls the orte_rml_base_receive
> function which processes the data to find the matching tag, and then uses
> that to execute the callback to the user code.
>
>
> The same for send_buffer_nb() functions. I do not see the callback
> functions actually running. How can I verify that the callback functions
> are running. Especially for the send case it sounds pretty obvious how
> it should work but I never see the callback function running. At least
> in my setup.
>
>
> The data is not immediately sent. It gets pushed into an event. When that
> event gets serviced, it calls the orte_oob_base_send function which then
> passes the data to each active OOB component until one of them says it can
> send it. The data is then pushed into another event to get it into the
> event base for that component's active module - when that event gets
> serviced, the data is sent. Once the data is sent, an event is created
> that, when serviced, executes the callback to the user code.
>
> If you aren't seeing callbacks, the most likely cause is that the orte
> progress thread isn't running. Without it, none of this will work.
>
>
> Thanks. Running configure without '--with-ft=cr' I can run a program and
> use orte-top. In orterun I can see that the callback is running and
> orte-top displays the retrieved information. I can also see in orte-top
> that the callbacks are working.
>
>
> Actually, I'm rather impressed - I hadn't tested orte-top and didn't
> honestly know if it would work any more! Glad to hear it does :-)
>
> Doing the same with '--with-ft=cr'
> enabled orte-top crashes as well as orte-checkpoint and both (-top and
> -checkpoint) seem to no longer have working callbacks and that is why
> they are probably crashing. So some code which is enabled by '--with-ft=cr'
> seems to break callbacks in orte-top as well as in orte-checkpoint.
> orterun handles callbacks no matter if configured with or without
> '--with-ft=cr'.
>
>
> I can take a look this weekend - probably something silly
>
>
>  Adrian
>
> <grpcomm.txt>_______________________________________________
>
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey

Reply via email to