Thanks for your help. I tried initializing the barrier correctly (see
attached patch) but now, instead of crashing, it just hangs on the
barrier while running orte-checkpoint

[dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
[dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at 
../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206

#0  0x00007ffff69befa0 in __nanosleep_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x00007ffff7b456ba in app_coord_init () at 
../../../../../orte/mca/snapc/full/snapc_full_app.c:207
#2  0x00007ffff7b3a582 in orte_snapc_full_module_init (seed=false, app=true) at 
../../../../../orte/mca/snapc/full/snapc_full_module.c:207

it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);

I do not understand on what the barrier here is actually waiting for. Where
do I need to look to find the place the barrier is waiting for?

I also tried initializing the collective id's in
orte/mca/plm/base/plm_base_launch_support.c but that code is never
used running the orte-checkpoint tool

                Adrian

On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
> I took a look at this, and I'm afraid you have some work to do in the 
> orte/mca/snapc code base:
> 
> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See 
> r30261 for an example of the changes that need to be made - I did some, but 
> can't swear to catching them all. It was enough to at least get a proc past 
> the initial snapc registration
> 
> 2. you are reusing collective id's to execute several orte_grpcomm.barrier 
> calls - those ids are used elsewhere during MPI_Init. This is not allowed - a 
> collective id can only be used *once*. What you need to do is go into 
> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add 
> cr-specific collective id's for this purpose. I don't know how many places in 
> the cr code create their own barriers, but they each need a collective id.
> 
> If you prefer and have the time, you are welcome to extend the collective 
> code to allow id reuse. This would require that each daemon and app "reset" 
> the collective fields when a collective is declared complete. It isn't that 
> hard to do - just never had a reason to do it. I can take a shot at it when 
> time permits (may have some time this weekend)
> 
> 3. when you post the non-blocking recv in the snapc/full code, it looks to me 
> like you need to block until you get the answer. I don't know where in the 
> code flow this is occurring - if you are not in an event, then it is okay to 
> block using ORTE_WAIT_FOR_COMPLETION. Look in 
> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
> 
> HTH
> Ralph
> 
> On Jan 10, 2014, at 12:55 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > 
> > On Jan 10, 2014, at 12:45 PM, Adrian Reber <adr...@lisas.de> wrote:
> > 
> >> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> >>> 
> >>> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>> 
> >>>> I am currently trying to understand how callbacks are working. Right now
> >>>> I am looking at orte/mca/rml/base/rml_base_receive.c
> >>>> orte_rml_base_comm_start() which does 
> >>>> 
> >>>>   orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >>>>                           ORTE_RML_TAG_RML_INFO_UPDATE,
> >>>>                           ORTE_RML_PERSISTENT,
> >>>>                           orte_rml_base_recv,
> >>>>                           NULL);
> >>>> 
> >>>> As far as I understand it orte_rml_base_recv() is the callback function.
> >>>> At which point should this function run? When the data is actually
> >>>> received?
> >>> 
> >>> Not precisely. When data is received by the OOB, it pushes the data into 
> >>> an event. When that event gets serviced, it calls the 
> >>> orte_rml_base_receive function which processes the data to find the 
> >>> matching tag, and then uses that to execute the callback to the user code.
> >>> 
> >>>> 
> >>>> The same for send_buffer_nb() functions. I do not see the callback
> >>>> functions actually running. How can I verify that the callback functions
> >>>> are running. Especially for the send case it sounds pretty obvious how
> >>>> it should work but I never see the callback function running. At least
> >>>> in my setup.
> >>> 
> >>> The data is not immediately sent. It gets pushed into an event. When that 
> >>> event gets serviced, it calls the orte_oob_base_send function which then 
> >>> passes the data to each active OOB component until one of them says it 
> >>> can send it. The data is then pushed into another event to get it into 
> >>> the event base for that component's active module - when that event gets 
> >>> serviced, the data is sent. Once the data is sent, an event is created 
> >>> that, when serviced, executes the callback to the user code.
> >>> 
> >>> If you aren't seeing callbacks, the most likely cause is that the orte 
> >>> progress thread isn't running. Without it, none of this will work.
> >> 
> >> Thanks. Running configure without '--with-ft=cr' I can run a program and
> >> use orte-top. In orterun I can see that the callback is running and
> >> orte-top displays the retrieved information. I can also see in orte-top
> >> that the callbacks are working.
> > 
> > Actually, I'm rather impressed - I hadn't tested orte-top and didn't 
> > honestly know if it would work any more! Glad to hear it does :-)
> > 
> >> Doing the same with '--with-ft=cr'
> >> enabled orte-top crashes as well as orte-checkpoint and both (-top and
> >> -checkpoint) seem to no longer have working callbacks and that is why
> >> they are probably crashing. So some code which is enabled by '--with-ft=cr'
> >> seems to break callbacks in orte-top as well as in orte-checkpoint.
> >> orterun handles callbacks no matter if configured with or without
> >> '--with-ft=cr'.
> > 
> > I can take a look this weekend - probably something silly
> > 
> >> 
> >>            Adrian
diff --git a/orte/mca/ess/base/ess_base_std_tool.c 
b/orte/mca/ess/base/ess_base_std_tool.c
index 357ea60..98c1685 100644
--- a/orte/mca/ess/base/ess_base_std_tool.c
+++ b/orte/mca/ess/base/ess_base_std_tool.c
@@ -42,6 +42,7 @@
 #include "orte/mca/iof/base/base.h"
 #include "orte/mca/state/base/base.h"
 #if OPAL_ENABLE_FT_CR == 1
+#include "orte/mca/grpcomm/base/base.h"
 #include "orte/mca/snapc/base/base.h"
 #endif
 #include "orte/util/proc_info.h"
@@ -168,6 +169,19 @@ int orte_ess_base_tool_setup(void)

 #if OPAL_ENABLE_FT_CR == 1
     /*
+     * Group communications
+     */
+    if (ORTE_SUCCESS != (ret = 
mca_base_framework_open(&orte_grpcomm_base_framework, 0))) {
+        ORTE_ERROR_LOG(ret);
+        error = "orte_grpcomm_base_open";
+        goto error;
+    }
+    if (ORTE_SUCCESS != (ret = orte_grpcomm_base_select())) {
+        ORTE_ERROR_LOG(ret);
+        error = "orte_grpcomm_base_select";
+        goto error;
+    }
+    /*
      * Setup the SnapC
      */
     if (ORTE_SUCCESS != (ret = 
mca_base_framework_open(&orte_snapc_base_framework, 0))) {
@@ -200,7 +214,8 @@ int orte_ess_base_tool_finalize(void)
     orte_wait_finalize();

 #if OPAL_ENABLE_FT_CR == 1
-    mca_base_framework_close(&orte_snapc_base_framework);
+    (void) mca_base_framework_close(&orte_snapc_base_framework);
+    (void) mca_base_framework_close(&orte_grpcomm_base_framework);
 #endif

     /* if I am a tool, then all I will have done is
diff --git a/orte/mca/snapc/full/snapc_full_app.c 
b/orte/mca/snapc/full/snapc_full_app.c
index 0f0f147..a3df3c7 100644
--- a/orte/mca/snapc/full/snapc_full_app.c
+++ b/orte/mca/snapc/full/snapc_full_app.c
@@ -53,6 +53,7 @@
 #include "orte/mca/snapc/snapc.h"
 #include "orte/mca/snapc/base/base.h"
 #include "orte/mca/errmgr/errmgr.h"
+#include "orte/mca/grpcomm/base/base.h"
 #include "orte/mca/grpcomm/grpcomm.h"
 #include "orte/mca/rml/rml.h"
 #include "orte/mca/rml/rml_types.h"
@@ -84,6 +85,9 @@ static char *app_comm_pipe_w = NULL;
 static int   app_comm_pipe_r_fd = -1;
 static int   app_comm_pipe_w_fd = -1;

+static int snapc_init_barrier = -1;
+static int snapc_fini_barrier = -1;
+
 static opal_crs_base_snapshot_t *local_snapshot = NULL;

 static bool app_notif_processed = false;
@@ -109,6 +113,36 @@ static void snapc_full_app_callback_recv(int status,
  * Function Definitions
  ************************/

+static void init_barriers()
+{
+    orte_grpcomm_collective_t *coll;
+    orte_namelist_t *nm;
+
+    if (-1 == snapc_init_barrier) {
+        snapc_init_barrier = orte_grpcomm_base_get_coll_id();
+    } else {
+        return;
+    }
+
+    if (-1 == snapc_fini_barrier) {
+        snapc_fini_barrier = orte_grpcomm_base_get_coll_id();
+    } else {
+        return;
+    }
+
+    coll = orte_grpcomm_base_setup_collective(snapc_init_barrier);
+    nm = OBJ_NEW(orte_namelist_t);
+    nm->name.vpid = ORTE_PROC_MY_NAME->vpid;
+    nm->name.jobid = ORTE_PROC_MY_NAME->jobid;
+    opal_list_append(&coll->participants, &nm->super);
+
+    coll = orte_grpcomm_base_setup_collective(snapc_fini_barrier);
+    nm = OBJ_NEW(orte_namelist_t);
+    nm->name.vpid = ORTE_PROC_MY_NAME->vpid;
+    nm->name.jobid = ORTE_PROC_MY_NAME->jobid;
+    opal_list_append(&coll->participants, &nm->super);
+}
+
 int app_coord_init()
 {
     int ret, exit_status  = ORTE_SUCCESS;
@@ -160,8 +194,10 @@ int app_coord_init()
                              "app) Startup Barrier..."));
     }

+    init_barriers();
+
     coll = OBJ_NEW(orte_grpcomm_collective_t);
-    coll->id = orte_process_info.peer_init_barrier;
+    coll->id = snapc_init_barrier;
     if( ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll)) ) {
            ORTE_ERROR_LOG(ret);
         exit_status = ret;
@@ -236,7 +272,7 @@ int app_coord_finalize()
     }

     coll = OBJ_NEW(orte_grpcomm_collective_t);
-    coll->id = orte_process_info.peer_init_barrier;
+    coll->id = snapc_init_barrier;
     if( ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll)) ) {
         ORTE_ERROR_LOG(ret);
         exit_status = ret;
@@ -318,7 +354,7 @@ int app_coord_finalize()
                              "app) Shutdown Barrier: Waiting on barrier...!"));
     }

-    coll->id = orte_process_info.peer_fini_barrier;
+    coll->id = snapc_fini_barrier;
     if( ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll)) ) {
         ORTE_ERROR_LOG(ret);
         exit_status = ret;

Reply via email to