Andrej,

My previous email listed other things to try

Cheers,

Gilles

Sent from my iPod

> On Feb 2, 2021, at 6:23, Andrej Prsa via devel <devel@lists.open-mpi.org> 
> wrote:
> 
> The saga continues.
> 
> I managed to build slurm with pmix by first patching slurm using this patch 
> and manually building the plugin:
> 
> https://bugs.schedmd.com/show_bug.cgi?id=10683
> 
> Now srun shows pmix as an option:
> 
> andrej@terra:~/system/tests/MPI$ srun --mpi=list
> srun: MPI types are...
> srun: cray_shasta
> srun: none
> srun: pmi2
> srun: pmix
> srun: pmix_v4
> 
> But when I try to run mpirun with slurm plugin, it still fails:
> 
> andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca 
> pmix_base_verbose 10 -mca plm slurm -np 384 -H 
> node15:96,node16:96,node17:96,node18:96 python testmpi.py
> [terra:149214] mca: base: components_register: registering framework ess 
> components
> [terra:149214] mca: base: components_register: found loaded component slurm
> [terra:149214] mca: base: components_register: component slurm has no 
> register or open function
> [terra:149214] mca: base: components_register: found loaded component env
> [terra:149214] mca: base: components_register: component env has no register 
> or open function
> [terra:149214] mca: base: components_register: found loaded component pmi
> [terra:149214] mca: base: components_register: component pmi has no register 
> or open function
> [terra:149214] mca: base: components_register: found loaded component tool
> [terra:149214] mca: base: components_register: component tool register 
> function successful
> [terra:149214] mca: base: components_register: found loaded component hnp
> [terra:149214] mca: base: components_register: component hnp has no register 
> or open function
> [terra:149214] mca: base: components_register: found loaded component 
> singleton
> [terra:149214] mca: base: components_register: component singleton register 
> function successful
> [terra:149214] mca: base: components_open: opening ess components
> [terra:149214] mca: base: components_open: found loaded component slurm
> [terra:149214] mca: base: components_open: component slurm open function 
> successful
> [terra:149214] mca: base: components_open: found loaded component env
> [terra:149214] mca: base: components_open: component env open function 
> successful
> [terra:149214] mca: base: components_open: found loaded component pmi
> [terra:149214] mca: base: components_open: component pmi open function 
> successful
> [terra:149214] mca: base: components_open: found loaded component tool
> [terra:149214] mca: base: components_open: component tool open function 
> successful
> [terra:149214] mca: base: components_open: found loaded component hnp
> [terra:149214] mca: base: components_open: component hnp open function 
> successful
> [terra:149214] mca: base: components_open: found loaded component singleton
> [terra:149214] mca: base: components_open: component singleton open function 
> successful
> [terra:149214] mca:base:select: Auto-selecting ess components
> [terra:149214] mca:base:select:(  ess) Querying component [slurm]
> [terra:149214] mca:base:select:(  ess) Querying component [env]
> [terra:149214] mca:base:select:(  ess) Querying component [pmi]
> [terra:149214] mca:base:select:(  ess) Querying component [tool]
> [terra:149214] mca:base:select:(  ess) Querying component [hnp]
> [terra:149214] mca:base:select:(  ess) Query of component [hnp] set priority 
> to 100
> [terra:149214] mca:base:select:(  ess) Querying component [singleton]
> [terra:149214] mca:base:select:(  ess) Selected component [hnp]
> [terra:149214] mca: base: close: component slurm closed
> [terra:149214] mca: base: close: unloading component slurm
> [terra:149214] mca: base: close: component env closed
> [terra:149214] mca: base: close: unloading component env
> [terra:149214] mca: base: close: component pmi closed
> [terra:149214] mca: base: close: unloading component pmi
> [terra:149214] mca: base: close: component tool closed
> [terra:149214] mca: base: close: unloading component tool
> [terra:149214] mca: base: close: component singleton closed
> [terra:149214] mca: base: close: unloading component singleton
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_plm_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> 
> I'm at my wits' end what to try, and all ears if anyone has any leads or 
> suggestions.
> 
> Thanks,
> Andrej
> 

Reply via email to