On 2021-10-12 22:24, Drew Parsons wrote:
On 2021-10-12 17:46, Jeff Squyres (jsquyres) wrote:
...
Ok, so this is an MPI_Alltoall issue. Does it use MPI_IN_PLACE?
...
I'll apply PR1738 to the debian dolfinx build and see how it turns
out.
Looks like removing MPI_IN_PLACE is not enough. dolfinx is still
crashing,
https://buildd.debian.org/status/fetch.php?pkg=fenics-dolfinx&arch=i386&ver=1%3A0.3.0-5&stamp=1634060713&raw=0
Debugging a bit further (with MPI_IN_PLACE removed), I can identify that
the bug is in dolfinx not openmpi (unless there are two bugs here).
Comparing detailed debug output from 2 threads, I find one thread skips
the facet loop in compute_nonlocal_dual_graph() in dolfinx'
mesh/graphbuild.cpp, while the other thread crashes at
buffer[pos[dest] + max_num_vertices_per_facet] += cell_offset;
because pos[dest] is 0, but max_num_vertices_per_facet=-1.
A value max_num_vertices_per_facet=-1 seems wrong in principle, so this
must be a dolfinx bug after all. Looks like I read the backtraces the
wrong way around: one thread got ahead to mca_btl_vader.so where it must
be waiting for the second thread. The second thread crashes before it
reaches MPI_Alltoall. So the fatal signal we saw in the trace after
mca_btl_vader.so would be the kill signal coming from the segfault in
the other thread at compute_nonlocal_dual_graph().
I'll test that again restoring MPI_IN_PLACE to confirm dolfinx
max_num_vertices_per_facet=-1 is the true problem here.
Drew