Hi Ralph, Gilles,

I fail to understand why you continue to think that PMI has anything to do with 
this problem. I see no indication of a PMIx-related issue in anything you have 
provided to date.

Oh, I went off the traceback that yelled about pmix, and slurm not being able to find it until I patched the latest version; I'm an astrophysicist pretending to be a sys admin for our research cluster, so while I can hold my ground with c, python and technical computing, I'm out of my depths when it comes to mpi, pmix, slurm and all that good stuff. So I appreciate your patience. I am trying though. :)

In the output below, it is clear what the problem is - you locked it to the "slurm" launcher (with 
-mca plm slurm) and the "slurm" launcher was not found. Try adding "--mca plm_base_verbose 
10" to your cmd line and let's see why that launcher wasn't accepted.

andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py [terra:168998] mca: base: components_register: registering framework plm components
[terra:168998] mca: base: components_register: found loaded component slurm
[terra:168998] mca: base: components_register: component slurm register function successful
[terra:168998] mca: base: components_open: opening plm components
[terra:168998] mca: base: components_open: found loaded component slurm
[terra:168998] mca: base: components_open: component slurm open function successful
[terra:168998] mca:base:select: Auto-selecting plm components
[terra:168998] mca:base:select:(  plm) Querying component [slurm]
[terra:168998] mca:base:select:(  plm) No component selected!
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

Gilles, I did try all the suggestions from the previous email but that led me to think that slurm is the culprit, and now I'm back to openmpi.

Cheers,
Andrej

Reply via email to