Hmm - I'll look through that section of code tomorrow morning and see if there could possibly be any mismatched send/receives or anything.
-Ben On Apr 9, 2013, at 8:48 PM, "Derek Gaston" <fried...@gmail.com> wrote: > Hey guys, > > I've got a fairly large job (>3500 procs) that is hanging while trying to > setup the mesh. The procs are in 2 separate places. ~Half of them are here: > > #35 0x00002b1746aea7f0 in > libMesh::Parallel::Communicator::send_receive<Hilbert::HilbertIndices> () > from /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #36 0x00002b1746afc17b in void > libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox > const&, libMesh::MeshBase::element_iterator const&, > libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, > std::allocator<unsigned int> >&) const () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #37 0x00002b1746c78528 in > libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, > unsigned int) () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #38 0x00002b1746c7b3e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, > unsigned int) () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #39 0x00002b1746ad1d32 in libMesh::MeshBase::prepare_for_use(bool) () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > > > And the other ~half are here: > > #6 0x00002ba5c95c8b90 in > libMesh::Parallel::Communicator::send_receive<unsigned int> () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #7 0x00002ba5c95da2e2 in void > libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::element_iterator>(libMesh::MeshTools::BoundingBox > const&, libMesh::MeshBase::element_iterator const&, > libMesh::MeshBase::element_iterator const&, std::vector<unsigned int, > std::allocator<unsigned int> >&) const () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #8 0x00002ba5c9756528 in > libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, > unsigned int) () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > #9 0x00002ba5c97593e5 in libMesh::Partitioner::partition(libMesh::MeshBase&, > unsigned int) () from > /home/gastdr/projects/fission/libmesh/lib/libmesh_oprof.so.0 > > > > Obviously they are in slightly different spots... > > > Any ideas on what's going on here or where to start looking? > > I was intermittently getting weird errors around this point from mvapich so > I've tried to switch to OpenMPI.... and it's hanging up here. > > The mesh itself isn't enormous... it's only about 1 million nodes or so. > We've definitely done more than this before. > > Thanks in advance for any advice! > > Derek > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Libmesh-devel mailing list > Libmesh-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/libmesh-devel ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Libmesh-devel mailing list Libmesh-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libmesh-devel