On Thu, 8 Nov 2012, Roy Stogner wrote: > I'm seeing hangs on our BuildBot servers when running the > "-online_mode 1" step of reduced_basis_ex6 with --n_threads=2 in the > LIBMESH_OPTIONS. That may just be because our BuildBot server is > ridiculously overloaded, and I'll see if I can verify it manually when > I get time, but before I dig into things it occurs to me that if you > run multithreaded yourself then I can be more confident that I'm > seeing a false positive.
It's definitely *not* a false positive - I managed to hang reduced_basis_ex6 today manually while testing out a new laptop. It's also not a threading issue; this was a 4-MPI-tasks, 1-thread-each run. I'm not sure what the issue is, maybe something with recent I/O changes? Stack traces (from processes interrupted in some kind of busy-loop) include three processes caught here: (gdb) where #0 0x00002b7b3612e034 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so #1 0x00002b7b2b40d46a in opal_progress () from /usr/lib/libopen-pal.so.0 #2 0x00002b7b1d953595 in ?? () from /usr/lib/libmpi.so.0 #3 0x00002b7b371df33f in ?? () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so #4 0x00002b7b1d968260 in PMPI_Allreduce () from /usr/lib/libmpi.so.0 #5 0x00002b7b2469066b in VecAssemblyBegin_MPI(_p_Vec*) () from /usr/lib/libpetsc.so.3.2 #6 0x00002b7b247baa53 in VecAssemblyBegin () from /usr/lib/libpetsc.so.3.2 #7 0x00002b7b1a2604df in libMesh::PetscVector<double>::close (this=0x2a8c7b0) at /home/roystgnr/libmesh/svn/include/libmesh/petsc_vector.h:910 #8 0x00002b7b1a407dd3 in libMesh::System::read_serialized_vector (this=this@entry=0x2a82970, io=..., vec=...) at src/systems/system_io.C:1333 #9 0x00002b7b1a408af8 in libMesh::System::read_serialized_data (this=0x2a82970, io=..., read_additional_data=<optimized out>) at src/systems/system_io.C:719 #10 0x00002b7b1a31d63b in libMesh::RBEvaluation::read_in_basis_functions (this=0x7fff9b94c490, sys=..., directory_name=..., read_binary_basis_functions=true) at src/reduced_basis/rb_evaluation.C:960 #11 0x000000000042c736 in main (argc=13, argv=0x7fff9b94c978) at reduced_basis_ex6.C:227 and one caught here: #0 0x00002ba8cf0ebfb9 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so #1 0x00002ba8c43cb46a in opal_progress () from /usr/lib/libopen-pal.so.0 #2 0x00002ba8b6911595 in ?? () from /usr/lib/libmpi.so.0 #3 0x00002ba8d019d33f in ?? () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so #4 0x00002ba8b6926260 in PMPI_Allreduce () from /usr/lib/libmpi.so.0 #5 0x00002ba8b2b9088a in libMesh::Parallel::Communicator::min<unsigned long> ( this=this@entry=0x2ba8b383c520 <libMesh::CommWorld>, r=@0x7fffc214a840: 1) at /home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:1148 #6 0x00002ba8b32571e3 in libMesh::Parallel::Communicator::verify<unsigned long> ( this=this@entry=0x2ba8b383c520 <libMesh::CommWorld>, r=@0x7fffc214a898: 1) at /home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:1114 #7 0x00002ba8b32581bb in verify<unsigned long> (r=@0x7fffc214a898: 1, this=0x2ba8b383c520 <libMesh::CommWorld>) at /home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:1111 #8 libMesh::Parallel::Communicator::sum<unsigned int> (this=0x2ba8b383c520 <libMesh::CommWorld>, r=...) at /home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:1606 #9 0x00002ba8b33cdfc9 in sum<std::vector<unsigned int> > (comm=..., r=...) at /home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:775 #10 libMesh::System::read_serialized_blocked_dof_objects<libMesh::MeshBase::node_iterator> ( this=this@entry=0x1c62d00, n_objects=n_objects@entry=6171, begin=..., end=..., io=..., vec=..., var_to_read=var_to_read@entry=1) at src/systems/system_io.C:969 #11 0x00002ba8b33c5bd5 in libMesh::System::read_serialized_vector (this=this@entry=0x1c62d00, io=..., vec=...) at src/systems/system_io.C:1305 #12 0x00002ba8b33c6af8 in libMesh::System::read_serialized_data (this=0x1c62d00, io=..., read_additional_data=<optimized out>) at src/systems/system_io.C:719 #13 0x00002ba8b32db63b in libMesh::RBEvaluation::read_in_basis_functions (this=0x7fffc214bf10, sys=..., directory_name=..., read_binary_basis_functions=true) at src/reduced_basis/rb_evaluation.C:960 #14 0x000000000042c736 in main (argc=13, argv=0x7fffc214c3f8) at reduced_basis_ex6.C:227 I can tell that I need to make Parallel::verify() more robust if possible, but I haven't had time yet to figure out the underlying problem that verify() (presumably a parallel_only() call) should have caught. --- Roy ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov _______________________________________________ Libmesh-devel mailing list Libmesh-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libmesh-devel