Bug#978022: libopenmpi3 Runtime failure opal_pmix_base_select failed

2020-12-26 Thread Lucas Nussbaum
Hi,

On 24/12/20 at 17:16 +0100, Michael Banck wrote:
> Package: libopenmpi3
> Version: 3.1.3-11
> Severity: serious
> 
> Even with the fixed libpmix2_4.0.0~rc1-2, I am getting runtime failures
> trying to run MPI programs, e.g. the nwchem autopkgtests all fail like
> this:

A simple way to reproduce is:

$ mpiexec -n 1 true
[groff:16932] [[40958,0],0] ORTE_ERROR_LOG: Not found in file 
../../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 320
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--

It happens with those versions:

$ dpkg -l |grep -e openmpi -e pmi
ii  libopenmpi3:amd64 4.1.0-1  amd64
high performance message passing library -- shared library
ii  libpmix2:amd644.0.0~rc1-2  amd64
Process Management Interface (Exascale) library
ii  openmpi-bin   4.1.0-1  amd64
high performance message passing library -- binaries
ii  openmpi-common4.1.0-1  all  
high performance message passing library -- common files

It doesn't fail after downgrading openmpi to the version in testing
(4.0.5-7)

Lucas



Bug#978022: libopenmpi3 Runtime failure opal_pmix_base_select failed

2020-12-24 Thread Michael Banck
Package: libopenmpi3
Version: 3.1.3-11
Severity: serious

Even with the fixed libpmix2_4.0.0~rc1-2, I am getting runtime failures
trying to run MPI programs, e.g. the nwchem autopkgtests all fail like
this:

| Running tests/water/water_md 
| 
| cleaning scratch
| copying input and verified output files
| running nwchem (/usr/bin/nwchem)  with 1 processors 
| 
| NWChem execution failed
|[kohn:13218] [[5127,0],0] ORTE_ERROR_LOG: Not found in file 
../../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 320
|--
|It looks like orte_init failed for some reason; your parallel process is
|likely to abort.  There are many reasons that a parallel process can
|fail during orte_init; some of which are due to configuration or
|environment problems.  This failure appears to be an internal failure;
|here's some additional information (which may only be relevant to an
|Open MPI developer):
|
|  opal_pmix_base_select failed
|  --> Returned value Not found (-13) instead of ORTE_SUCCESS
|--

Not sure whether this is libopenmpi3, openmpi-bin, libpmix2 or something
else, so please reassign as needed. But at least the openmpi excuses is
full of ci.debian.net regressions:

https://qa.debian.org/excuses.php?package=openmpi

Or is there something needed on the application side, like a new
environment variable or library to be linked in?


Michael