Hi.

I have an issue with understanding /ompi_mpi_init() /logic. Could you please tell me if you have any guesses about following behavior.

I wonder if I understand ringh, there is a block in /ompi_mpi_init() /function for exchanging procs information between processes (denote this block 'modex'):

        coll = OBJ_NEW(orte_grpcomm_collective_t);
        coll->id = orte_process_info.peer_modex;
        if (ORTE_SUCCESS != (ret = orte_grpcomm.modex(coll))) {
            error = "orte_grpcomm_modex failed";
            goto error;
        }
        /* wait for modex to complete - this may be moved anywhere in
   mpi_init
         * so long as it occurs prior to calling a function that needs
         * the modex info!
         */
        while (coll->active) {
            opal_progress();  /* block in progress pending events */
        }
        OBJ_RELEASE(coll);

and several instructions after this there is a block for processes synchronization (denote this block 'barrier'):

        coll = OBJ_NEW(orte_grpcomm_collective_t);
        coll->id = orte_process_info.peer_init_barrier;
        if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier(coll))) {
            error = "orte_grpcomm_barrier failed";
            goto error;
        }
        /* wait for barrier to complete */
        while (coll->active) {
            opal_progress();  /* block in progress pending events */
        }
        OBJ_RELEASE(coll);

So,//initially///ompi_mpi_init()/ has following structure:

   ...
   'modex' block;
   ...
   'barrier' block;
   ...

I made several experiments with this code and the following one is of interest: if I add sequence of two additional blocks, 'barrier' and 'modex', right after 'modex' block, then///ompi_mpi_init() /hangs in /opal_progress()/ of the last 'modex' block.

   ...
   'modex' block;
   'barrier' block;
   'modex' block; <- hangs
   ...
   'barrier' block;
   ...

Thanks,
Victor Kocheganov.

Reply via email to