The saga continues.
I managed to build slurm with pmix by first patching slurm using this
patch and manually building the plugin:
https://bugs.schedmd.com/show_bug.cgi?id=10683
Now srun shows pmix as an option:
andrej@terra:~/system/tests/MPI$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v4
But when I try to run mpirun with slurm plugin, it still fails:
andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca
pmix_base_verbose 10 -mca plm slurm -np 384 -H
node15:96,node16:96,node17:96,node18:96 python testmpi.py
[terra:149214] mca: base: components_register: registering framework ess
components
[terra:149214] mca: base: components_register: found loaded component slurm
[terra:149214] mca: base: components_register: component slurm has no
register or open function
[terra:149214] mca: base: components_register: found loaded component env
[terra:149214] mca: base: components_register: component env has no
register or open function
[terra:149214] mca: base: components_register: found loaded component pmi
[terra:149214] mca: base: components_register: component pmi has no
register or open function
[terra:149214] mca: base: components_register: found loaded component tool
[terra:149214] mca: base: components_register: component tool register
function successful
[terra:149214] mca: base: components_register: found loaded component hnp
[terra:149214] mca: base: components_register: component hnp has no
register or open function
[terra:149214] mca: base: components_register: found loaded component
singleton
[terra:149214] mca: base: components_register: component singleton
register function successful
[terra:149214] mca: base: components_open: opening ess components
[terra:149214] mca: base: components_open: found loaded component slurm
[terra:149214] mca: base: components_open: component slurm open function
successful
[terra:149214] mca: base: components_open: found loaded component env
[terra:149214] mca: base: components_open: component env open function
successful
[terra:149214] mca: base: components_open: found loaded component pmi
[terra:149214] mca: base: components_open: component pmi open function
successful
[terra:149214] mca: base: components_open: found loaded component tool
[terra:149214] mca: base: components_open: component tool open function
successful
[terra:149214] mca: base: components_open: found loaded component hnp
[terra:149214] mca: base: components_open: component hnp open function
successful
[terra:149214] mca: base: components_open: found loaded component singleton
[terra:149214] mca: base: components_open: component singleton open
function successful
[terra:149214] mca:base:select: Auto-selecting ess components
[terra:149214] mca:base:select:( ess) Querying component [slurm]
[terra:149214] mca:base:select:( ess) Querying component [env]
[terra:149214] mca:base:select:( ess) Querying component [pmi]
[terra:149214] mca:base:select:( ess) Querying component [tool]
[terra:149214] mca:base:select:( ess) Querying component [hnp]
[terra:149214] mca:base:select:( ess) Query of component [hnp] set
priority to 100
[terra:149214] mca:base:select:( ess) Querying component [singleton]
[terra:149214] mca:base:select:( ess) Selected component [hnp]
[terra:149214] mca: base: close: component slurm closed
[terra:149214] mca: base: close: unloading component slurm
[terra:149214] mca: base: close: component env closed
[terra:149214] mca: base: close: unloading component env
[terra:149214] mca: base: close: component pmi closed
[terra:149214] mca: base: close: unloading component pmi
[terra:149214] mca: base: close: component tool closed
[terra:149214] mca: base: close: unloading component tool
[terra:149214] mca: base: close: component singleton closed
[terra:149214] mca: base: close: unloading component singleton
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_plm_base_select failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
I'm at my wits' end what to try, and all ears if anyone has any leads or
suggestions.
Thanks,
Andrej