The threading of the sparsity pattern is a bigger deal with amr when it is done
a lot, but in any case we could add a --single-threaded-sparsity or something
as an immediate stopgap.
And the hilbert indexing is used to derive a globally unique, partition
agnostic node number. But as you say it is not working out well?
It actually does the same thing with element centroids, which is I think where
you are based on the stack trace.
There could be an issue with not enough elements per processor... This one
could be boiled down to a stand-alone test by taking your mesh and calling
find_global_indices directly??
On Feb 8, 2012, at 10:17 AM, "Derek Gaston"
<fried...@gmail.com<mailto:fried...@gmail.com>> wrote:
The symptom with the sparsity pattern was that it would hangup forever in
DofMap::dof_indices(). ie all threads would be sitting at either operator_new
or delete for a std::vector<int> inside DofMap::dof_indices(). Then, after
taking forever on that if it ever got past it, a few threads were still hung in
a weird state in this parallel loop. I'll see if I can reproduce some of the
backtraces for you.
BTW - I don't think that threading this part of the calculation is super
critical ;-) Even when running with 100 Million DoFs, running through this
section of code in serial never takes all that long (I mean, it takes time...
but overall not a big deal compared to the actual solve).
Also: I ran into another issue. When I've spread the problem to 8000+ MPI the
code is getting stuck here:
0 0x00002b90da86b9e9 in btl_openib_component_progress () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#1 0x00002b90daf22ef6 in opal_progress () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libopen-pal.so.0
#2 0x00002b90da8200c4 in ompi_request_default_wait_all () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#3 0x00002b90da8909ee in ompi_coll_tuned_sendrecv_actual () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#4 0x00002b90da898716 in ompi_coll_tuned_allgather_intra_bruck () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#5 0x00002b90da891439 in ompi_coll_tuned_allgather_intra_dec_fixed () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#6 0x00002b90da8387e6 in PMPI_Allgather () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#7 0x00002b90d4e02268 in
libMesh::Parallel::Sort<Hilbert::HilbertIndices>::communicate_bins() () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#8 0x00002b90d4e014ee in
libMesh::Parallel::Sort<Hilbert::HilbertIndices>::sort() () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#9 0x00002b90d4c8f798 in void
libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::const_element_iterator>(libMesh::MeshTools::BoundingBox
const&, libMesh::MeshBase::const_element_iterator const&,
libMesh::MeshBase::const_element_iterator const&, std::vector<unsigned int,
std::allocator<unsigned int> >&) const () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#10 0x00002b90d4e0eaea in
libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&,
unsigned int) () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#11 0x00002b90d4e0e38b in libMesh::Partitioner::partition(libMesh::MeshBase&,
unsigned int) () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#12 0x00002b90d4c76e2c in libMesh::MeshBase::partition(unsigned int) () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#13 0x00002b90d4c76db8 in libMesh::MeshBase::prepare_for_use(bool) () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#14 0x00002b90d4c9d5f6 in
libMesh::MeshTools::Generation::build_cube(libMesh::UnstructuredMesh&, unsigned
int, unsigned int, unsigned int, double, double, double, double, double,
double, libMeshEnums::ElemType, bool) ()
from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
Looks like it's trying to find global node numbers... but its' not working out
well. The code never made it past here in 2 hours of runtime. One thing that
is interesting is that I only have ~90,000 nodes. Do you think it could just
be a problem with trying to spread out the mesh too much? Also, I thought this
Hilbert stuff only ran for ParallelMesh... I'm using SerialMesh here.
An input on this would be awesome.
Derek
On Wed, Feb 8, 2012 at 4:21 AM, Kirk, Benjamin (JSC-EG311)
<benjamin.kir...@nasa.gov<mailto:benjamin.kir...@nasa.gov>> wrote:
Excellent. What was the symptom in te sparsity pattern?
I'll be flying most of the day, hopefully that'll provide me some time to stare
at this....
-Ben
On Feb 8, 2012, at 3:16 AM, "Derek Gaston"
<fried...@gmail.com<mailto:fried...@gmail.com>> wrote:
Win. Between fixing the localize() issue I brought up earlier and de-threading
SparsityPattern::Build() my job now ran! Hopefully the others will continue to
run as well.
Derek
On Wed, Feb 8, 2012 at 1:07 AM, Derek Gaston
<fried...@gmail.com<mailto:fried...@gmail.com>> wrote:
Continuing my "huge run" witch hunt... I have really big runs that are hanging
at mutexes during threaded execution of SparsityPattern::Build().
The main issue seems to stem from DofMap::dof_indices(). Everything is getting
hung around allocating / deallocating std::vector<int> objects (memory
operations require mutexes). I see that there were a few added for support of
SCALAR variables. I don't have any SCALAR variables in this simulation... and
in that situation there shouldn't be any overhead for adding indicies for
SCALAR variables. I think all of the scalar variable stuff could be moved off
into one small portion of that function and guard it with the number of scalar
variables in the system.
I have to say: DofMap:dof_indices() has been showing up in my profiling studies
for a while (even on small workstation sized jobs) but I haven't had a chance
to look at it.
I'm going to take an intensive look at this function soon (maybe tomorrow) but
it's 1AM right now and I'm just going to turn off threading of this section all
together and see if I can get these jobs to go through.
I just thought I would point this out in case anyone else wanted to check it
out or provide opinions...
Derek
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net<mailto:Libmesh-devel@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/libmesh-devel
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel