Bug#995599: libopenmpi3: segfault in mca_btl_vader.so on 32-bit arches

Drew Parsons Tue, 12 Oct 2021 17:39:15 -0700

On 2021-10-12 22:24, Drew Parsons wrote:

On 2021-10-12 17:46, Jeff Squyres (jsquyres) wrote:

...

Ok, so this is an MPI_Alltoall issue.  Does it use MPI_IN_PLACE?

...

I'll apply PR1738 to the debian dolfinx build and see how it turnsout.
Looks like removing MPI_IN_PLACE is not enough. dolfinx is stillcrashing,
https://buildd.debian.org/status/fetch.php?pkg=fenics-dolfinx&arch=i386&ver=1%3A0.3.0-5&stamp=1634060713&raw=0

Debugging a bit further (with MPI_IN_PLACE removed), I can identify thatthe bug is in dolfinx not openmpi (unless there are two bugs here).

Comparing detailed debug output from 2 threads, I find one thread skipsthe facet loop in compute_nonlocal_dual_graph() in dolfinx'mesh/graphbuild.cpp, while the other thread crashes at

  buffer[pos[dest] + max_num_vertices_per_facet] += cell_offset;
because pos[dest] is 0, but max_num_vertices_per_facet=-1.

A value max_num_vertices_per_facet=-1 seems wrong in principle, so thismust be a dolfinx bug after all. Looks like I read the backtraces thewrong way around: one thread got ahead to mca_btl_vader.so where it mustbe waiting for the second thread. The second thread crashes before itreaches MPI_Alltoall. So the fatal signal we saw in the trace aftermca_btl_vader.so would be the kill signal coming from the segfault inthe other thread at compute_nonlocal_dual_graph().

I'll test that again restoring MPI_IN_PLACE to confirm dolfinxmax_num_vertices_per_facet=-1 is the true problem here.


Drew

Bug#995599: libopenmpi3: segfault in mca_btl_vader.so on 32-bit arches

Reply via email to