Andrej, you are now invoking mpirun from a slurm allocation, right?
you can try this: /usr/local/bin/mpirun -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py if it does not work, you can collect more relevant logs with mpirun -mca plm slurm -mca plm_base_verbose 10 -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py an other test you can do is srun -N 1 -n 1 orted that is expected to fail, but it should at least find all its dependencies and start Cheers, Gilles On Tue, Feb 2, 2021 at 12:32 AM Andrej Prsa via devel <devel@lists.open-mpi.org> wrote: > > Alright, I rebuilt mpirun and it's working on a local machine. But now > I'm back to my original problem: running this works: > > mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96 > python testmpi.py > > but running this doesn't: > > mpirun -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 > python testmpi.py > > Here's the verbose output from the latter command: > > andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca > pmix_base_verbose 10 -mca plm slurm -np 384 -H > node15:96,node16:96,node17:96,node18:96 python testmpi.py > [terra:387112] mca: base: components_register: registering framework ess > components > [terra:387112] mca: base: components_register: found loaded component slurm > [terra:387112] mca: base: components_register: component slurm has no > register or open function > [terra:387112] mca: base: components_register: found loaded component env > [terra:387112] mca: base: components_register: component env has no > register or open function > [terra:387112] mca: base: components_register: found loaded component pmi > [terra:387112] mca: base: components_register: component pmi has no > register or open function > [terra:387112] mca: base: components_register: found loaded component tool > [terra:387112] mca: base: components_register: component tool register > function successful > [terra:387112] mca: base: components_register: found loaded component hnp > [terra:387112] mca: base: components_register: component hnp has no > register or open function > [terra:387112] mca: base: components_register: found loaded component > singleton > [terra:387112] mca: base: components_register: component singleton > register function successful > [terra:387112] mca: base: components_open: opening ess components > [terra:387112] mca: base: components_open: found loaded component slurm > [terra:387112] mca: base: components_open: component slurm open function > successful > [terra:387112] mca: base: components_open: found loaded component env > [terra:387112] mca: base: components_open: component env open function > successful > [terra:387112] mca: base: components_open: found loaded component pmi > [terra:387112] mca: base: components_open: component pmi open function > successful > [terra:387112] mca: base: components_open: found loaded component tool > [terra:387112] mca: base: components_open: component tool open function > successful > [terra:387112] mca: base: components_open: found loaded component hnp > [terra:387112] mca: base: components_open: component hnp open function > successful > [terra:387112] mca: base: components_open: found loaded component singleton > [terra:387112] mca: base: components_open: component singleton open > function successful > [terra:387112] mca:base:select: Auto-selecting ess components > [terra:387112] mca:base:select:( ess) Querying component [slurm] > [terra:387112] mca:base:select:( ess) Querying component [env] > [terra:387112] mca:base:select:( ess) Querying component [pmi] > [terra:387112] mca:base:select:( ess) Querying component [tool] > [terra:387112] mca:base:select:( ess) Querying component [hnp] > [terra:387112] mca:base:select:( ess) Query of component [hnp] set > priority to 100 > [terra:387112] mca:base:select:( ess) Querying component [singleton] > [terra:387112] mca:base:select:( ess) Selected component [hnp] > [terra:387112] mca: base: close: component slurm closed > [terra:387112] mca: base: close: unloading component slurm > [terra:387112] mca: base: close: component env closed > [terra:387112] mca: base: close: unloading component env > [terra:387112] mca: base: close: component pmi closed > [terra:387112] mca: base: close: unloading component pmi > [terra:387112] mca: base: close: component tool closed > [terra:387112] mca: base: close: unloading component tool > [terra:387112] mca: base: close: component singleton closed > [terra:387112] mca: base: close: unloading component singleton > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_plm_base_select failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > > This was the exact problem that prompted me to try and upgrade from > 4.0.3 to 4.1.0. Openmpi 4.1.0 (in debug mode, with internal pmix) is now > installed on the head and on all compute nodes. > > I'd appreciate any ideas on what to try to overcome this. > > Cheers, > Andrej > > > On 2/1/21 9:57 AM, Andrej Prsa wrote: > > Hi Gilles, > > > >> that's odd, there should be a mca_pmix_pmix3x.so (assuming you built > >> with the internal pmix) > > > > Ah, I didn't -- I linked against the latest git pmix; here's the > > configure line: > > > > ./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm > > --without-tm --without-moab --without-singularity --without-fca > > --without-hcoll --without-ime --without-lustre --without-psm > > --without-psm2 --without-mxm --with-gnu-ld --enable-debug > > > > I'll try nuking the install again and configuring it to use internal > > pmix. > > > > Cheers, > > Andrej > > >