The saga continues.

I managed to build slurm with pmix by first patching slurm using this patch and manually building the plugin:

https://bugs.schedmd.com/show_bug.cgi?id=10683

Now srun shows pmix as an option:

andrej@terra:~/system/tests/MPI$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v4

But when I try to run mpirun with slurm plugin, it still fails:

andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca pmix_base_verbose 10 -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py [terra:149214] mca: base: components_register: registering framework ess components
[terra:149214] mca: base: components_register: found loaded component slurm
[terra:149214] mca: base: components_register: component slurm has no register or open function
[terra:149214] mca: base: components_register: found loaded component env
[terra:149214] mca: base: components_register: component env has no register or open function
[terra:149214] mca: base: components_register: found loaded component pmi
[terra:149214] mca: base: components_register: component pmi has no register or open function
[terra:149214] mca: base: components_register: found loaded component tool
[terra:149214] mca: base: components_register: component tool register function successful
[terra:149214] mca: base: components_register: found loaded component hnp
[terra:149214] mca: base: components_register: component hnp has no register or open function [terra:149214] mca: base: components_register: found loaded component singleton [terra:149214] mca: base: components_register: component singleton register function successful
[terra:149214] mca: base: components_open: opening ess components
[terra:149214] mca: base: components_open: found loaded component slurm
[terra:149214] mca: base: components_open: component slurm open function successful
[terra:149214] mca: base: components_open: found loaded component env
[terra:149214] mca: base: components_open: component env open function successful
[terra:149214] mca: base: components_open: found loaded component pmi
[terra:149214] mca: base: components_open: component pmi open function successful
[terra:149214] mca: base: components_open: found loaded component tool
[terra:149214] mca: base: components_open: component tool open function successful
[terra:149214] mca: base: components_open: found loaded component hnp
[terra:149214] mca: base: components_open: component hnp open function successful
[terra:149214] mca: base: components_open: found loaded component singleton
[terra:149214] mca: base: components_open: component singleton open function successful
[terra:149214] mca:base:select: Auto-selecting ess components
[terra:149214] mca:base:select:(  ess) Querying component [slurm]
[terra:149214] mca:base:select:(  ess) Querying component [env]
[terra:149214] mca:base:select:(  ess) Querying component [pmi]
[terra:149214] mca:base:select:(  ess) Querying component [tool]
[terra:149214] mca:base:select:(  ess) Querying component [hnp]
[terra:149214] mca:base:select:(  ess) Query of component [hnp] set priority to 100
[terra:149214] mca:base:select:(  ess) Querying component [singleton]
[terra:149214] mca:base:select:(  ess) Selected component [hnp]
[terra:149214] mca: base: close: component slurm closed
[terra:149214] mca: base: close: unloading component slurm
[terra:149214] mca: base: close: component env closed
[terra:149214] mca: base: close: unloading component env
[terra:149214] mca: base: close: component pmi closed
[terra:149214] mca: base: close: unloading component pmi
[terra:149214] mca: base: close: component tool closed
[terra:149214] mca: base: close: unloading component tool
[terra:149214] mca: base: close: component singleton closed
[terra:149214] mca: base: close: unloading component singleton
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

I'm at my wits' end what to try, and all ears if anyone has any leads or suggestions.

Thanks,
Andrej

Reply via email to