On Mon, 27 Feb 2017, Barna Becsek wrote:

> I hope this does not come too late.

That depends entirely on your deadlines, I fear.  ;-)

> But I finally got the debugger to work with some help of support on
> the Supercomputer that we are using. This is the complete stack:
>
> To recap the problem:

> We were reading in a mesh that was created outside of libmesh
> (rectilinear) and wanted to use that in the context of the MOOSE
> framework. Before calling the function prepare_for_use we wanted to
> make sure that the ghost elements for the neighbouring processes
> were found. This seemed to work fine when compiled and run in opt
> mode but caused the code to get stuck when done so in dbg mode (no
> crash, just a running but hung-up executable). So this is the
> complete stack of what is happening. It seems like MPI_Probe is
> blocking the code for some reason that it is not getting its message
> on one of the processes.

> Have you ever come across something like that?

I can't perfectly parse that stack (it appears to have been generated
using an older version of mesh_tools.C), but it appears that the max()
is what's blocking, and the probe() is further ahead in the code.

I've never seen this happen, but I can guess one way it might: if the
processors don't agree on mesh.max_elem_id(), then whichever
processor(s) don't see the largest ID would exit that loop too early
and leave the others hanging.

Your mesh having an incorrect max_elem_id() right before
prepare_for_use() isn't a bug and won't affect a real run, since we
update_parallel_id_counts() from within prepare_for_use(), after which
point every processor should know the correct max_elem_id().  It is
theoretically possible that this failure is *masking* a real bug,
though, so it's definitely worth fixing.

My suggestion would be to replace

for (dof_id_type i=0; i != mesh.max_elem_id(); ++i)

with

const dof_id_type max_id = mesh.parallel_max_elem_id();
for (dof_id_type i=0; i != max_id; ++i)

in libmesh_assert_valid_neighbors in mesh_tools.C

and if that works for you then you can put in a PR, or just let me
know and I'll do so.


Sorry about the hassle!  This wasn't an intended use case of
libmesh_assert_valid_neighbors(), but it *should* have been a
supported use case.  I hate finding bugs in our debugging code.  I
refuse to write debugging-code-debugging code.
---
Roy

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Libmesh-users mailing list
Libmesh-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to