Some updates for this OpenMPI bug:
 1) It appears to OpenMPI 2.1.x when configured with --enable-heterogeneous, 
which is not a default option and is not commonly used. But Ubuntu somehow used 
that.
 2) OpenMPI fixed it in 3.x
 3) It was reported to Ubuntu two years ago but is still unassigned. 
https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/1731938. A user's 
comment from last year, "We have just spent today hunting down a user bug 
report for Xyce (which uses Trilinos, and its Zoltan library) that turn out to 
be exactly this issue "

--Junchao Zhang


On Wed, Jul 31, 2019 at 2:17 PM Junchao Zhang 
<[email protected]<mailto:[email protected]>> wrote:
Hi, Fabian,
I found it is an OpenMPI bug w.r.t self-to-self MPI_Send/Recv using 
MPI_ANY_SOURCE for message matching. OpenMPI does not put correct value in recv 
buffer.
I have a workaround 
jczhang/fix-ubuntu-openmpi-anysource<https://bitbucket.org/petsc/petsc/branch/jczhang/fix-ubuntu-openmpi-anysource>.
 I tested with your petsc_ex.F90 and $PETSC_DIR/src/dm/examples/tests/ex14.  
The majority of valgrind errors disappeared. A few left are in ompi_mpi_init 
and we can ignore them.
I filed a bug report to OpenMPI 
https://www.mail-archive.com/[email protected]//msg33383.html and hope 
they can fix it in Ubuntu.
Thanks.

--Junchao Zhang


On Tue, Jul 30, 2019 at 9:47 AM Fabian.Jakub via petsc-dev 
<[email protected]<mailto:[email protected]>> wrote:
Dear Petsc Team,
Our cluster recently switched to Ubuntu 18.04 which has gcc 7.4 and
(Open MPI) 2.1.1 - with this I ended up with segfault and valgrind
errors in DMDAGlobalToNatural.

This is evident in a minimal fortran example such as the attached
example petsc_ex.F90

with the following error:

==22616== Conditional jump or move depends on uninitialised value(s)
==22616==    at 0x4FA5CDB: PetscTrMallocDefault (mtr.c:185)
==22616==    by 0x4FA4DAC: PetscMallocA (mal.c:413)
==22616==    by 0x5090E94: VecScatterSetUp_SF (vscatsf.c:652)
==22616==    by 0x50A1104: VecScatterSetUp (vscatfce.c:209)
==22616==    by 0x509EE3B: VecScatterCreate (vscreate.c:280)
==22616==    by 0x577B48B: DMDAGlobalToNatural_Create (dagtol.c:108)
==22616==    by 0x577BB6D: DMDAGlobalToNaturalBegin (dagtol.c:155)
==22616==    by 0x5798446: VecView_MPI_DA (gr2.c:720)
==22616==    by 0x51BC7D8: VecView (vector.c:574)
==22616==    by 0x4F4ECA1: PetscObjectView (destroy.c:90)
==22616==    by 0x4F4F05E: PetscObjectViewFromOptions (destroy.c:126)

and consequently wrong results in the natural vec


I was looking at the fortran example if I did forget something but I can
also see the same error, i.e. not being valgrind clean, in pure C - PETSc:

cd $PETSC_DIR/src/dm/examples/tests && make ex14 && mpirun
--allow-run-as-root -np 2 valgrind ./ex14

I then tried various docker/podman linux distributions to make sure that
my setup is clean and to me it seems that this error is confined to the
particular gcc version 7.4 and (Open MPI) 2.1.1 from the ubuntu:latest repo.

I tried other images from dockerhub including

gcc:7.4.0 :: where I could neither install openmpi nor mpich through
apt, however works with --download-openmpi and --download-mpich

ubuntu:rolling(19.04) <-- work

debian:latest & :stable <-- works

ubuntu:latest(18.04) <-- fails in case of openmpi, but works with mpich
or with petsc-configure --download-openmpi or --download-mpich


Is this error with (Open MPI) 2.1.1 a known issue? In the meantime, I
guess I'll go with a custom mpi install but given that ubuntu:latest is
widely spread, do you think there is an easy solution to the error?

I guess you are not eager to delve into this issue with old mpi versions
but in case you find some spare time, maybe you find the root cause
and/or a workaround.

Many thanks,
Fabian

Reply via email to