On 2021-10-12 22:24, Drew Parsons wrote:
On 2021-10-12 17:46, Jeff Squyres (jsquyres) wrote:
...
Ok, so this is an MPI_Alltoall issue.  Does it use MPI_IN_PLACE?
...
I'll apply PR1738 to the debian dolfinx build and see how it turns out.

Looks like removing MPI_IN_PLACE is not enough. dolfinx is still crashing,
https://buildd.debian.org/status/fetch.php?pkg=fenics-dolfinx&arch=i386&ver=1%3A0.3.0-5&stamp=1634060713&raw=0


Debugging a bit further (with MPI_IN_PLACE removed), I can identify that the bug is in dolfinx not openmpi (unless there are two bugs here).

Comparing detailed debug output from 2 threads, I find one thread skips the facet loop in compute_nonlocal_dual_graph() in dolfinx' mesh/graphbuild.cpp, while the other thread crashes at
  buffer[pos[dest] + max_num_vertices_per_facet] += cell_offset;
because pos[dest] is 0, but max_num_vertices_per_facet=-1.

A value max_num_vertices_per_facet=-1 seems wrong in principle, so this must be a dolfinx bug after all. Looks like I read the backtraces the wrong way around: one thread got ahead to mca_btl_vader.so where it must be waiting for the second thread. The second thread crashes before it reaches MPI_Alltoall. So the fatal signal we saw in the trace after mca_btl_vader.so would be the kill signal coming from the segfault in the other thread at compute_nonlocal_dual_graph().

I'll test that again restoring MPI_IN_PLACE to confirm dolfinx max_num_vertices_per_facet=-1 is the true problem here.

Drew

Reply via email to