Source: mpi4py
Followup-For: Bug #1131533
Control: tags -1 help

The str2bool error reported here appears to arise from a mismatch
between the old mpi4py build against mpich 4, and the new mpich 5.
MPICH is supposed to have ABI compatibility, so perhaps the error is a
bug in the compatibility.  But at the same time, MPI-5 (supported by mpich 5)
is introducing a common MPI ABI (common with openmpi). Perhaps mpi4py
was getting confused over which ABI it's working with. I'm not sure
exactly.

To keep the build consistent, I uploaded mpi4py 4.1.1-3 to rebuild
against mpich 5. With the rebuild, testGetFortranInfo is now passing
on i386, as tested on porterbox (barriere):

testGetFortranInfo (test_mpiabi.TestMPIABI.testGetFortranInfo) ... 
testGetFortranInfo (test_mpiabi.TestMPIABI.testGetFortranInfo) ... 
testGetFortranInfo (test_mpiabi.TestMPIABI.testGetFortranInfo) ... 
testGetFortranInfo (test_mpiabi.TestMPIABI.testGetFortranInfo) ... ok
ok
testGetInfo (test_mpiabi.TestMPIABI.testGetInfo) ... ok
testGetVersion (test_mpiabi.TestMPIABI.testGetVersion) ... ok

In fact all tests are passing successfully on barriere, with
autopkgtest ending with:

autopkgtest [16:29:40]: test mpi4py-test: -----------------------]
autopkgtest [16:29:40]: test mpi4py-test:  - - - - - - - - - - results - - - - 
- - - - - -
mpi4py-test          PASS
autopkgtest [16:29:40]: @@@@@@@@@@@@@@@@@@@@ summary
command1             SKIP unknown restriction hint-testsuite-triggers
command1             SKIP unknown restriction hint-testsuite-triggers
mpi4py-test          PASS

So in that sense, this bug is resolved by rebuilding mpi4py against mpich 5.

However, on debci the tests are now hanging at 
test_cco_buf.TestCCOBufWorld.testAllreduce
after hitting an ERROR in test_cco_buf.TestCCOBufWorld.testAllgather
so debci fails on timeout,
https://ci.debian.net/data/autopkgtest/testing/i386/m/mpi4py/69692323/log.gz

I suspect the timeout in testAllreduce is indirectly triggered by the
error in testAllgather.

I can't see what the substantial difference between the two test
environments is. Why are the same tests passing on barriere,
but hitting an error and failing with timeout on debci?

Help wanted.

Reply via email to