Hi Ralph, Gilles,
I fail to understand why you continue to think that PMI has anything to do with
this problem. I see no indication of a PMIx-related issue in anything you have
provided to date.
Oh, I went off the traceback that yelled about pmix, and slurm not being
able to find it until I patched the latest version; I'm an
astrophysicist pretending to be a sys admin for our research cluster, so
while I can hold my ground with c, python and technical computing, I'm
out of my depths when it comes to mpi, pmix, slurm and all that good
stuff. So I appreciate your patience. I am trying though. :)
In the output below, it is clear what the problem is - you locked it to the "slurm" launcher (with
-mca plm slurm) and the "slurm" launcher was not found. Try adding "--mca plm_base_verbose
10" to your cmd line and let's see why that launcher wasn't accepted.
andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca
plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python
testmpi.py
[terra:168998] mca: base: components_register: registering framework plm
components
[terra:168998] mca: base: components_register: found loaded component slurm
[terra:168998] mca: base: components_register: component slurm register
function successful
[terra:168998] mca: base: components_open: opening plm components
[terra:168998] mca: base: components_open: found loaded component slurm
[terra:168998] mca: base: components_open: component slurm open function
successful
[terra:168998] mca:base:select: Auto-selecting plm components
[terra:168998] mca:base:select:( plm) Querying component [slurm]
[terra:168998] mca:base:select:( plm) No component selected!
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_plm_base_select failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
Gilles, I did try all the suggestions from the previous email but that
led me to think that slurm is the culprit, and now I'm back to openmpi.
Cheers,
Andrej