Hi,

I have a question regarding MPI_Comm_spawn and proc flags.

What I understand about procs and spawns in ompi:
Processes are identified by the proc structure.
proc structure stores proc_name and proc_flags (and many other things).
proc_flags defines locality related to the actual process.
proc_name is a unique couple (jobid, vpid) that identifies an ompi process.

proc_name.jobid is the generation id of the process.
In spawn case, origin processes and spawned processes have different jobids. 
(saw it in ompi4.x, hope it is still the case in ompi5.x)

In btl/sm add_procs function 
(https://github.com/open-mpi/ompi/blob/main/opal/mca/btl/sm/btl_sm_module.c#L266),
 on this part

    for (int32_t proc = 0; proc < (int32_t) nprocs; ++proc) {
        /* check to see if this proc can be reached via shmem (i.e.,
           if they're on my local host and in my job) */
        if (procs[proc]->proc_name.jobid != my_proc->proc_name.jobid
            || !OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags)) {
            peers[proc] = NULL;
            continue;
        }

        if (my_proc != procs[proc] && NULL != reachability) {
            /* add this proc to shared memory accessibility list */
            rc = opal_bitmap_set_bit(reachability, proc);
            if (OPAL_SUCCESS != rc) {
                return rc;
            }
        }

        /* setup endpoint */
        rc = init_sm_endpoint(peers + proc, procs[proc]);
        if (OPAL_SUCCESS != rc) {
            break;
        }
    }
It prevents btl/sm to be selected between processes that are not in the same 
spawn generation (procs[proc]->proc_name.jobid != my_proc->proc_name.jobid).
A simple spawn test results in this error (mono-node test).

--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[58931,2],20]) is on host: pm0-nod48
  Process 2 ([[58931,1],0]) is on host: unknown!
  BTLs attempted: vader self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------

It also seems like proc_flags are not valid:
OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags) returns true for a process 
spawned on another node.

The ompi tested is based on 4.1.7 (+ some of our code), configured with 
pmix-5.0.3 and hwloc=internal, ran with salloc ... mpirun ...

(And the questions)

Is it intended?
Should I try to reproduce with ompi-5 and open an issue?

Thanks,


Florent GERMAIN

Ingénieur de développement – BDS-R&D
2 rue de la Piquetterie – Bruyères le Chatel – France
eviden.com<https://eviden.com/>
[LinkedIn icon]<https://www.linkedin.com/company/eviden> [Twitter icon] 
<https://twitter.com/EvidenLive>  [Instagram icon] 
<https://www.instagram.com/evidenlive>  [YouTube icon] 
<https://www.youtube.com/@EvidenLive>

[Eviden logo]

an atos business




To unsubscribe from this group and stop receiving emails from it, send an email 
to devel+unsubscr...@lists.open-mpi.org.

Reply via email to