Your message dated Thu, 31 May 2018 13:40:32 +0200
with message-id <[email protected]>
and subject line Re: Bug#896886: openmpi: upstream version 3.0.1 makes lots of
autopkgtests flaky
has caused the Debian Bug report #896886,
regarding openmpi: upstream version 3.0.1 makes lots of autopkgtests flaky
to be marked as done.
This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.
(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)
--
896886: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=896886
Debian Bug Tracking System
Contact [email protected] with problems
--- Begin Message ---
Source: openmpi
Version: 3.0.1-1
Severity: normal
User: [email protected]
Usertags: breaks
Control: affects -1 src:lammps
Control: affects -1 src:esys-particle
Control: affects -1 src:liggghts
Control: affects -1 src:gerris
Control: affects -1 src:gmsh
Control: affects -1 src:ray
With the upload of upstream version 3.0.1 of openmpi to Debian, the
autopkgtest of lammps¹, esys-particle², liggghts³, gerris⁴, gmsh⁵, ray⁶
started to regularly fail with an error similar to the one copied below.
(ray is also seeing another issue)
Unfortunately there was the transition and some issues with the CI
infrastructure intermixed (those give different errors though), so not
all failures are due to this issue. However, I can also not exclude that
all these issues are due to a packages mixing different versions of
libopenmpi* due to the transition. However, if they can't be mixed, I
think openmpi should be blocked from migrating to testing until the
transition is finished (I thought that as a library it would be allowed
to migrate before all reverse dependencies are rebuild if it doesn't
break installability, as the old library will stay in the archive until
all reverse dependencies are rebuild and migrated).
It has been pointed out in a previous issue that autopkgtest may be
sensitive to the hardware they run on, so I tried to check which workers
pass and which workers fail with this error (for most, I also note the
version of openmpi that was involved). Unfortunately, there are worker
were test both pass and fail (7, 8).
I hope you can investigate the issue.
Paul
¹ https://ci.debian.net/packages/l/lammps
² https://ci.debian.net/packages/e/esys-particle
³ https://ci.debian.net/packages/l/liggghts
⁴ https://ci.debian.net/packages/g/gerris
⁵ https://ci.debian.net/packages/g/gmsh
⁶ https://ci.debian.net/packages/r/ray
fail:
worker#5 -8
https://ci.debian.net/data/autopkgtest/testing/amd64/g/gmsh/204541/log.gz
worker#7 -8
https://ci.debian.net/data/autopkgtest/testing/amd64/g/gerris/204539/log.gz
worker#3 -8
https://ci.debian.net/data/autopkgtest/testing/amd64/r/ray/204537/log.gz
worker#3 -8
https://ci.debian.net/data/autopkgtest/testing/amd64/e/esys-particle/204527/log.gz
worker#2
https://ci.debian.net/data/autopkgtest/testing/amd64/l/lammps/201744/log.gz
worker#3 -8
https://ci.debian.net/data/autopkgtest/testing/amd64/g/gerris/201739/log.gz
worker#1 -8
https://ci.debian.net/data/autopkgtest/testing/amd64/e/esys-particle/201735/log.gz
worker#6
https://ci.debian.net/data/autopkgtest/testing/amd64/l/lammps/189185/log.gz
worker#7 -6
https://ci.debian.net/data/autopkgtest/testing/amd64/e/esys-particle/189181/log.gz
worker#2 -6
https://ci.debian.net/data/autopkgtest/unstable/amd64/l/liggghts/188523/log.gz
worker#8 -6
https://ci.debian.net/data/autopkgtest/testing/amd64/e/esys-particle/184333/log.gz
worker#3 -6
https://ci.debian.net/data/autopkgtest/unstable/amd64/e/esys-particle/180310/log.gz
worker#3
https://ci.debian.net/data/autopkgtest/unstable/amd64/l/lammps/173837/log.gz
pass:
worker#4 -8
https://ci.debian.net/data/autopkgtest/unstable/amd64/g/gerris/205247/log.gz
worker#4
https://ci.debian.net/data/autopkgtest/unstable/amd64/l/lammps/202788/log.gz
worker#9 -8
https://ci.debian.net/data/autopkgtest/unstable/amd64/g/gmsh/202759/log.gz
worker#9
https://ci.debian.net/data/autopkgtest/testing/amd64/l/lammps/195323/log.gz
worker#7 -6
https://ci.debian.net/data/autopkgtest/testing/amd64/g/gerris/195319/log.gz
worker#1 -6
https://ci.debian.net/data/autopkgtest/testing/amd64/e/esys-particle/190183/log.gz
worker#8 -6
https://ci.debian.net/data/autopkgtest/unstable/amd64/g/gmsh/189823/log.gz
worker#8
https://ci.debian.net/data/autopkgtest/unstable/amd64/l/lammps/185784/log.gz
worker#10
https://ci.debian.net/data/autopkgtest/unstable/amd64/l/lammps/180338/log.gz
worker#8
https://ci.debian.net/data/autopkgtest/unstable/amd64/l/lammps/177294/log.gz
transition issue:
worker#1
https://ci.debian.net/data/autopkgtest/testing/amd64/l/lammps/201064/log.gz
worker#7 -7
https://ci.debian.net/data/autopkgtest/testing/amd64/e/esys-particle/201044/log.gz
worker#7
https://ci.debian.net/data/autopkgtest/testing/amd64/l/lammps/196312/log.gz
worker#9 -7
https://ci.debian.net/data/autopkgtest/testing/amd64/e/esys-particle/196303/log.gz
worker#7
https://ci.debian.net/data/autopkgtest/unstable/amd64/l/lammps/168795/log.gz
worker#9
https://ci.debian.net/data/autopkgtest/unstable/amd64/l/lammps/156573/log.gz
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[lammps-1523947718:1047] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not able
to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[lammps-1523947718:1048] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not able
to guarantee that all other processes were killed!
signature.asc
Description: OpenPGP digital signature
--- End Message ---
--- Begin Message ---
Hi Alastair,
On Wed, 16 May 2018 15:06:07 +0100 Alastair McKinstry
<[email protected]> wrote:
> Yes, I'll package 3.1.0 ASAP. pmix 2.1.1 was uploaded today.
> It appears all the outstanding FTBFS are due to MPI hangs in tests. I'm
> trying to get builds on armhf hardware to debug.
I noticed that you did lots of openmpi uploads, thanks for working on
issues.
To my eye it looks like the flakiness of the packages has gone, therefor
I close the bug.
Graham likes me to remind you about the armhf issue though. Should he
file a separate bug about that?
Paul
signature.asc
Description: OpenPGP digital signature
--- End Message ---