Thank you so much for your help. I still have some issues/questions: I have verified that the problem is exactly what you said by checking how far the loop over i (till mesh.max_elem_id()) goes before that process (rank 0) gets ahead of the other processes. It exactly only goes as far as the local maximum element number. Afterwards it exits and leaves the others hanging. The same thing would eventually happen with all the processes because only the highest ranking process knows the absolute maximum element number of the mesh. All others that don’t would exist prematurely.
To remedy this I called prepare_for_use once additionally before gather_neigboring_elements (to make sure that the mesh is set up as a distributed one) and then once after. This actually made me overcome the original problem, i.e. the above loop ran just fine. Now I am facing the following: From the output it looks like it is casting a -1 integer in the templated function. The problem is, I cannot find the call to that function in the stack (it is the same for every process). Do you know what could cause this error? Sincerely, Barna > On 27 Feb 2017, at 13:51, Roy Stogner <royst...@ices.utexas.edu> wrote: > > > On Mon, 27 Feb 2017, Barna Becsek wrote: > >> I hope this does not come too late. > > That depends entirely on your deadlines, I fear. ;-) > >> But I finally got the debugger to work with some help of support on >> the Supercomputer that we are using. This is the complete stack: >> >> To recap the problem: > >> We were reading in a mesh that was created outside of libmesh >> (rectilinear) and wanted to use that in the context of the MOOSE >> framework. Before calling the function prepare_for_use we wanted to >> make sure that the ghost elements for the neighbouring processes >> were found. This seemed to work fine when compiled and run in opt >> mode but caused the code to get stuck when done so in dbg mode (no >> crash, just a running but hung-up executable). So this is the >> complete stack of what is happening. It seems like MPI_Probe is >> blocking the code for some reason that it is not getting its message >> on one of the processes. > >> Have you ever come across something like that? > > I can't perfectly parse that stack (it appears to have been generated > using an older version of mesh_tools.C), but it appears that the max() > is what's blocking, and the probe() is further ahead in the code. > > I've never seen this happen, but I can guess one way it might: if the > processors don't agree on mesh.max_elem_id(), then whichever > processor(s) don't see the largest ID would exit that loop too early > and leave the others hanging. > > Your mesh having an incorrect max_elem_id() right before > prepare_for_use() isn't a bug and won't affect a real run, since we > update_parallel_id_counts() from within prepare_for_use(), after which > point every processor should know the correct max_elem_id(). It is > theoretically possible that this failure is *masking* a real bug, > though, so it's definitely worth fixing. > > My suggestion would be to replace > > for (dof_id_type i=0; i != mesh.max_elem_id(); ++i) > > with > > const dof_id_type max_id = mesh.parallel_max_elem_id(); > for (dof_id_type i=0; i != max_id; ++i) > > in libmesh_assert_valid_neighbors in mesh_tools.C > > and if that works for you then you can put in a PR, or just let me > know and I'll do so. > > > Sorry about the hassle! This wasn't an intended use case of > libmesh_assert_valid_neighbors(), but it *should* have been a > supported use case. I hate finding bugs in our debugging code. I > refuse to write debugging-code-debugging code. > --- > Roy ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Libmesh-users mailing list Libmesh-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libmesh-users