Andrej, My previous email listed other things to try
Cheers, Gilles Sent from my iPod > On Feb 2, 2021, at 6:23, Andrej Prsa via devel <devel@lists.open-mpi.org> > wrote: > > The saga continues. > > I managed to build slurm with pmix by first patching slurm using this patch > and manually building the plugin: > > https://bugs.schedmd.com/show_bug.cgi?id=10683 > > Now srun shows pmix as an option: > > andrej@terra:~/system/tests/MPI$ srun --mpi=list > srun: MPI types are... > srun: cray_shasta > srun: none > srun: pmi2 > srun: pmix > srun: pmix_v4 > > But when I try to run mpirun with slurm plugin, it still fails: > > andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca > pmix_base_verbose 10 -mca plm slurm -np 384 -H > node15:96,node16:96,node17:96,node18:96 python testmpi.py > [terra:149214] mca: base: components_register: registering framework ess > components > [terra:149214] mca: base: components_register: found loaded component slurm > [terra:149214] mca: base: components_register: component slurm has no > register or open function > [terra:149214] mca: base: components_register: found loaded component env > [terra:149214] mca: base: components_register: component env has no register > or open function > [terra:149214] mca: base: components_register: found loaded component pmi > [terra:149214] mca: base: components_register: component pmi has no register > or open function > [terra:149214] mca: base: components_register: found loaded component tool > [terra:149214] mca: base: components_register: component tool register > function successful > [terra:149214] mca: base: components_register: found loaded component hnp > [terra:149214] mca: base: components_register: component hnp has no register > or open function > [terra:149214] mca: base: components_register: found loaded component > singleton > [terra:149214] mca: base: components_register: component singleton register > function successful > [terra:149214] mca: base: components_open: opening ess components > [terra:149214] mca: base: components_open: found loaded component slurm > [terra:149214] mca: base: components_open: component slurm open function > successful > [terra:149214] mca: base: components_open: found loaded component env > [terra:149214] mca: base: components_open: component env open function > successful > [terra:149214] mca: base: components_open: found loaded component pmi > [terra:149214] mca: base: components_open: component pmi open function > successful > [terra:149214] mca: base: components_open: found loaded component tool > [terra:149214] mca: base: components_open: component tool open function > successful > [terra:149214] mca: base: components_open: found loaded component hnp > [terra:149214] mca: base: components_open: component hnp open function > successful > [terra:149214] mca: base: components_open: found loaded component singleton > [terra:149214] mca: base: components_open: component singleton open function > successful > [terra:149214] mca:base:select: Auto-selecting ess components > [terra:149214] mca:base:select:( ess) Querying component [slurm] > [terra:149214] mca:base:select:( ess) Querying component [env] > [terra:149214] mca:base:select:( ess) Querying component [pmi] > [terra:149214] mca:base:select:( ess) Querying component [tool] > [terra:149214] mca:base:select:( ess) Querying component [hnp] > [terra:149214] mca:base:select:( ess) Query of component [hnp] set priority > to 100 > [terra:149214] mca:base:select:( ess) Querying component [singleton] > [terra:149214] mca:base:select:( ess) Selected component [hnp] > [terra:149214] mca: base: close: component slurm closed > [terra:149214] mca: base: close: unloading component slurm > [terra:149214] mca: base: close: component env closed > [terra:149214] mca: base: close: unloading component env > [terra:149214] mca: base: close: component pmi closed > [terra:149214] mca: base: close: unloading component pmi > [terra:149214] mca: base: close: component tool closed > [terra:149214] mca: base: close: unloading component tool > [terra:149214] mca: base: close: component singleton closed > [terra:149214] mca: base: close: unloading component singleton > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_plm_base_select failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > > I'm at my wits' end what to try, and all ears if anyone has any leads or > suggestions. > > Thanks, > Andrej >