orte-checkpoint before communicating with orterun which runs the
processes I am trying to checkpoint. The full backtrace:

#0  0x00007ffff69befa0 in __nanosleep_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x00007ffff7b45712 in app_coord_init () at 
../../../../../orte/mca/snapc/full/snapc_full_app.c:208
#2  0x00007ffff7b3a5ce in orte_snapc_full_module_init (seed=false, app=true) at 
../../../../../orte/mca/snapc/full/snapc_full_module.c:207
#3  0x00007ffff7b375de in orte_snapc_base_select (seed=false, app=true) at 
../../../../orte/mca/snapc/base/snapc_base_select.c:96
#4  0x00007ffff7a9884a in orte_ess_base_tool_setup () at 
../../../../orte/mca/ess/base/ess_base_std_tool.c:192
#5  0x00007ffff7a9fe85 in rte_init () at 
../../../../../orte/mca/ess/tool/ess_tool_module.c:83
#6  0x00007ffff7a4647f in orte_init (pargc=0x7fffffffd94c, 
pargv=0x7fffffffd940, flags=8) at ../../orte/runtime/orte_init.c:158
#7  0x0000000000402859 in ckpt_init (argc=51, argv=0x7fffffffda78) at 
../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
#8  0x0000000000401d7a in main (argc=51, argv=0x7fffffffda78) at 
../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245


On Mon, Jan 20, 2014 at 02:46:04PM -0800, Ralph Castain wrote:
> Is it orte-checkpoint that is hanging, or the app you are trying to 
> checkpoint?
> 
> 
> On Jan 20, 2014, at 2:10 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > Thanks for your help. I tried initializing the barrier correctly (see
> > attached patch) but now, instead of crashing, it just hangs on the
> > barrier while running orte-checkpoint
> > 
> > [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> > [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at 
> > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> > 
> > #0  0x00007ffff69befa0 in __nanosleep_nocancel () at 
> > ../sysdeps/unix/syscall-template.S:81
> > #1  0x00007ffff7b456ba in app_coord_init () at 
> > ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> > #2  0x00007ffff7b3a582 in orte_snapc_full_module_init (seed=false, 
> > app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> > 
> > it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
> > 
> > I do not understand on what the barrier here is actually waiting for. Where
> > do I need to look to find the place the barrier is waiting for?
> > 
> > I also tried initializing the collective id's in
> > orte/mca/plm/base/plm_base_launch_support.c but that code is never
> > used running the orte-checkpoint tool
> > 
> >             Adrian
> > 
> > On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
> >> I took a look at this, and I'm afraid you have some work to do in the 
> >> orte/mca/snapc code base:
> >> 
> >> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See 
> >> r30261 for an example of the changes that need to be made - I did some, 
> >> but can't swear to catching them all. It was enough to at least get a proc 
> >> past the initial snapc registration
> >> 
> >> 2. you are reusing collective id's to execute several orte_grpcomm.barrier 
> >> calls - those ids are used elsewhere during MPI_Init. This is not allowed 
> >> - a collective id can only be used *once*. What you need to do is go into 
> >> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) 
> >> add cr-specific collective id's for this purpose. I don't know how many 
> >> places in the cr code create their own barriers, but they each need a 
> >> collective id.
> >> 
> >> If you prefer and have the time, you are welcome to extend the collective 
> >> code to allow id reuse. This would require that each daemon and app 
> >> "reset" the collective fields when a collective is declared complete. It 
> >> isn't that hard to do - just never had a reason to do it. I can take a 
> >> shot at it when time permits (may have some time this weekend)
> >> 
> >> 3. when you post the non-blocking recv in the snapc/full code, it looks to 
> >> me like you need to block until you get the answer. I don't know where in 
> >> the code flow this is occurring - if you are not in an event, then it is 
> >> okay to block using ORTE_WAIT_FOR_COMPLETION. Look in 
> >> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
> >> 
> >> HTH
> >> Ralph
> >> 
> >> On Jan 10, 2014, at 12:55 PM, Ralph Castain <r...@open-mpi.org> wrote:
> >> 
> >>> 
> >>> On Jan 10, 2014, at 12:45 PM, Adrian Reber <adr...@lisas.de> wrote:
> >>> 
> >>>> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> >>>>> 
> >>>>> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>>>> 
> >>>>>> I am currently trying to understand how callbacks are working. Right 
> >>>>>> now
> >>>>>> I am looking at orte/mca/rml/base/rml_base_receive.c
> >>>>>> orte_rml_base_comm_start() which does 
> >>>>>> 
> >>>>>>  orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >>>>>>                          ORTE_RML_TAG_RML_INFO_UPDATE,
> >>>>>>                          ORTE_RML_PERSISTENT,
> >>>>>>                          orte_rml_base_recv,
> >>>>>>                          NULL);
> >>>>>> 
> >>>>>> As far as I understand it orte_rml_base_recv() is the callback 
> >>>>>> function.
> >>>>>> At which point should this function run? When the data is actually
> >>>>>> received?
> >>>>> 
> >>>>> Not precisely. When data is received by the OOB, it pushes the data 
> >>>>> into an event. When that event gets serviced, it calls the 
> >>>>> orte_rml_base_receive function which processes the data to find the 
> >>>>> matching tag, and then uses that to execute the callback to the user 
> >>>>> code.
> >>>>> 
> >>>>>> 
> >>>>>> The same for send_buffer_nb() functions. I do not see the callback
> >>>>>> functions actually running. How can I verify that the callback 
> >>>>>> functions
> >>>>>> are running. Especially for the send case it sounds pretty obvious how
> >>>>>> it should work but I never see the callback function running. At least
> >>>>>> in my setup.
> >>>>> 
> >>>>> The data is not immediately sent. It gets pushed into an event. When 
> >>>>> that event gets serviced, it calls the orte_oob_base_send function 
> >>>>> which then passes the data to each active OOB component until one of 
> >>>>> them says it can send it. The data is then pushed into another event to 
> >>>>> get it into the event base for that component's active module - when 
> >>>>> that event gets serviced, the data is sent. Once the data is sent, an 
> >>>>> event is created that, when serviced, executes the callback to the user 
> >>>>> code.
> >>>>> 
> >>>>> If you aren't seeing callbacks, the most likely cause is that the orte 
> >>>>> progress thread isn't running. Without it, none of this will work.
> >>>> 
> >>>> Thanks. Running configure without '--with-ft=cr' I can run a program and
> >>>> use orte-top. In orterun I can see that the callback is running and
> >>>> orte-top displays the retrieved information. I can also see in orte-top
> >>>> that the callbacks are working.
> >>> 
> >>> Actually, I'm rather impressed - I hadn't tested orte-top and didn't 
> >>> honestly know if it would work any more! Glad to hear it does :-)
> >>> 
> >>>> Doing the same with '--with-ft=cr'
> >>>> enabled orte-top crashes as well as orte-checkpoint and both (-top and
> >>>> -checkpoint) seem to no longer have working callbacks and that is why
> >>>> they are probably crashing. So some code which is enabled by 
> >>>> '--with-ft=cr'
> >>>> seems to break callbacks in orte-top as well as in orte-checkpoint.
> >>>> orterun handles callbacks no matter if configured with or without
> >>>> '--with-ft=cr'.
> >>> 
> >>> I can take a look this weekend - probably something silly
> >>> 
> >>>> 
> >>>>          Adrian
> > <grpcomm.txt>_______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to