On Thu, 8 Nov 2012, Roy Stogner wrote:

> I'm seeing hangs on our BuildBot servers when running the
> "-online_mode 1" step of reduced_basis_ex6 with --n_threads=2 in the
> LIBMESH_OPTIONS.  That may just be because our BuildBot server is
> ridiculously overloaded, and I'll see if I can verify it manually when
> I get time, but before I dig into things it occurs to me that if you
> run multithreaded yourself then I can be more confident that I'm
> seeing a false positive.

It's definitely *not* a false positive - I managed to hang
reduced_basis_ex6 today manually while testing out a new laptop.  It's
also not a threading issue; this was a 4-MPI-tasks, 1-thread-each run.

I'm not sure what the issue is, maybe something with recent I/O
changes?  Stack traces (from processes interrupted in some kind of busy-loop)
include three processes caught here:

(gdb) where
#0  0x00002b7b3612e034 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so
#1  0x00002b7b2b40d46a in opal_progress () from /usr/lib/libopen-pal.so.0
#2  0x00002b7b1d953595 in ?? () from /usr/lib/libmpi.so.0
#3  0x00002b7b371df33f in ?? () from 
/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4  0x00002b7b1d968260 in PMPI_Allreduce () from /usr/lib/libmpi.so.0
#5  0x00002b7b2469066b in VecAssemblyBegin_MPI(_p_Vec*) () from 
/usr/lib/libpetsc.so.3.2
#6  0x00002b7b247baa53 in VecAssemblyBegin () from /usr/lib/libpetsc.so.3.2
#7  0x00002b7b1a2604df in libMesh::PetscVector<double>::close (this=0x2a8c7b0)
     at /home/roystgnr/libmesh/svn/include/libmesh/petsc_vector.h:910
#8  0x00002b7b1a407dd3 in libMesh::System::read_serialized_vector 
(this=this@entry=0x2a82970,
     io=..., vec=...) at src/systems/system_io.C:1333
#9  0x00002b7b1a408af8 in libMesh::System::read_serialized_data 
(this=0x2a82970, io=...,
     read_additional_data=<optimized out>) at src/systems/system_io.C:719
#10 0x00002b7b1a31d63b in libMesh::RBEvaluation::read_in_basis_functions 
(this=0x7fff9b94c490,
     sys=..., directory_name=..., read_binary_basis_functions=true)
     at src/reduced_basis/rb_evaluation.C:960
#11 0x000000000042c736 in main (argc=13, argv=0x7fff9b94c978) at 
reduced_basis_ex6.C:227

and one caught here:

#0  0x00002ba8cf0ebfb9 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so
#1  0x00002ba8c43cb46a in opal_progress () from /usr/lib/libopen-pal.so.0
#2  0x00002ba8b6911595 in ?? () from /usr/lib/libmpi.so.0
#3  0x00002ba8d019d33f in ?? () from 
/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4  0x00002ba8b6926260 in PMPI_Allreduce () from /usr/lib/libmpi.so.0
#5  0x00002ba8b2b9088a in libMesh::Parallel::Communicator::min<unsigned long> (
     this=this@entry=0x2ba8b383c520 <libMesh::CommWorld>, r=@0x7fffc214a840: 1)
     at 
/home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:1148
#6  0x00002ba8b32571e3 in libMesh::Parallel::Communicator::verify<unsigned 
long> (
     this=this@entry=0x2ba8b383c520 <libMesh::CommWorld>, r=@0x7fffc214a898: 1)
     at 
/home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:1114
#7  0x00002ba8b32581bb in verify<unsigned long> (r=@0x7fffc214a898: 1,
     this=0x2ba8b383c520 <libMesh::CommWorld>)
     at 
/home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:1111
#8  libMesh::Parallel::Communicator::sum<unsigned int> (this=0x2ba8b383c520 
<libMesh::CommWorld>,
     r=...) at 
/home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:1606
#9  0x00002ba8b33cdfc9 in sum<std::vector<unsigned int> > (comm=..., r=...)
     at /home/roystgnr/libmesh/svn/include/libmesh/parallel_implementation.h:775
#10 
libMesh::System::read_serialized_blocked_dof_objects<libMesh::MeshBase::node_iterator>
 (
     this=this@entry=0x1c62d00, n_objects=n_objects@entry=6171, begin=..., 
end=..., io=...,
     vec=..., var_to_read=var_to_read@entry=1) at src/systems/system_io.C:969
#11 0x00002ba8b33c5bd5 in libMesh::System::read_serialized_vector 
(this=this@entry=0x1c62d00,
     io=..., vec=...) at src/systems/system_io.C:1305
#12 0x00002ba8b33c6af8 in libMesh::System::read_serialized_data 
(this=0x1c62d00, io=...,
     read_additional_data=<optimized out>) at src/systems/system_io.C:719
#13 0x00002ba8b32db63b in libMesh::RBEvaluation::read_in_basis_functions 
(this=0x7fffc214bf10,
     sys=..., directory_name=..., read_binary_basis_functions=true)
     at src/reduced_basis/rb_evaluation.C:960
#14 0x000000000042c736 in main (argc=13, argv=0x7fffc214c3f8) at 
reduced_basis_ex6.C:227


I can tell that I need to make Parallel::verify() more robust if possible, but
I haven't had time yet to figure out the underlying problem that verify()
(presumably a parallel_only() call) should have caught.
---
Roy

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Reply via email to