Andrej, I can reproduce this behavior ... when running outside of a slurm allocation.
What does $ env | grep ^SLURM_ reports? Cheers, Gilles On Tue, Feb 2, 2021 at 9:06 AM Andrej Prsa via devel <devel@lists.open-mpi.org> wrote: > > Hi Ralph, Gilles, > > > I fail to understand why you continue to think that PMI has anything to do > > with this problem. I see no indication of a PMIx-related issue in anything > > you have provided to date. > > Oh, I went off the traceback that yelled about pmix, and slurm not being > able to find it until I patched the latest version; I'm an > astrophysicist pretending to be a sys admin for our research cluster, so > while I can hold my ground with c, python and technical computing, I'm > out of my depths when it comes to mpi, pmix, slurm and all that good > stuff. So I appreciate your patience. I am trying though. :) > > > In the output below, it is clear what the problem is - you locked it to the > > "slurm" launcher (with -mca plm slurm) and the "slurm" launcher was not > > found. Try adding "--mca plm_base_verbose 10" to your cmd line and let's > > see why that launcher wasn't accepted. > > andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca > plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python > testmpi.py > [terra:168998] mca: base: components_register: registering framework plm > components > [terra:168998] mca: base: components_register: found loaded component slurm > [terra:168998] mca: base: components_register: component slurm register > function successful > [terra:168998] mca: base: components_open: opening plm components > [terra:168998] mca: base: components_open: found loaded component slurm > [terra:168998] mca: base: components_open: component slurm open function > successful > [terra:168998] mca:base:select: Auto-selecting plm components > [terra:168998] mca:base:select:( plm) Querying component [slurm] > [terra:168998] mca:base:select:( plm) No component selected! > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_plm_base_select failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > > Gilles, I did try all the suggestions from the previous email but that > led me to think that slurm is the culprit, and now I'm back to openmpi. > > Cheers, > Andrej >