The threading of the sparsity pattern is a bigger deal with amr when it is done 
a lot, but in any case we could add a --single-threaded-sparsity or something 
as an immediate stopgap.

And the hilbert indexing is used to derive a globally unique, partition 
agnostic node number. But as you say it is not working out well?

It actually does the same thing with element centroids, which is I think where 
you are based on the stack trace.

There could be an issue with not enough elements per processor...  This one 
could be boiled down to a stand-alone test by taking your mesh and calling 
find_global_indices directly??





On Feb 8, 2012, at 10:17 AM, "Derek Gaston" 
<fried...@gmail.com<mailto:fried...@gmail.com>> wrote:

The symptom with the sparsity pattern was that it would hangup forever in 
DofMap::dof_indices().  ie all threads would be sitting at either operator_new 
or delete for a std::vector<int> inside DofMap::dof_indices().  Then, after 
taking forever on that if it ever got past it, a few threads were still hung in 
a weird state in this parallel loop.  I'll see if I can reproduce some of the 
backtraces for you.

BTW - I don't think that threading this part of the calculation is super 
critical ;-)  Even when running with 100 Million DoFs, running through this 
section of code in serial never takes all that long (I mean, it takes time... 
but overall not a big deal compared to the actual solve).

Also: I ran into another issue.  When I've spread the problem to 8000+ MPI the 
code is getting stuck here:

0  0x00002b90da86b9e9 in btl_openib_component_progress () from 
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#1  0x00002b90daf22ef6 in opal_progress () from 
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libopen-pal.so.0
#2  0x00002b90da8200c4 in ompi_request_default_wait_all () from 
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#3  0x00002b90da8909ee in ompi_coll_tuned_sendrecv_actual () from 
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#4  0x00002b90da898716 in ompi_coll_tuned_allgather_intra_bruck () from 
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#5  0x00002b90da891439 in ompi_coll_tuned_allgather_intra_dec_fixed () from 
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#6  0x00002b90da8387e6 in PMPI_Allgather () from 
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#7  0x00002b90d4e02268 in 
libMesh::Parallel::Sort<Hilbert::HilbertIndices>::communicate_bins() () from 
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#8  0x00002b90d4e014ee in 
libMesh::Parallel::Sort<Hilbert::HilbertIndices>::sort() () from 
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#9  0x00002b90d4c8f798 in void 
libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::const_element_iterator>(libMesh::MeshTools::BoundingBox
 const&, libMesh::MeshBase::const_element_iterator const&, 
libMesh::MeshBase::const_element_iterator const&, std::vector<unsigned int, 
std::allocator<unsigned int> >&) const () from 
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#10 0x00002b90d4e0eaea in 
libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&, 
unsigned int) () from 
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#11 0x00002b90d4e0e38b in libMesh::Partitioner::partition(libMesh::MeshBase&, 
unsigned int) () from 
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#12 0x00002b90d4c76e2c in libMesh::MeshBase::partition(unsigned int) () from 
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#13 0x00002b90d4c76db8 in libMesh::MeshBase::prepare_for_use(bool) () from 
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#14 0x00002b90d4c9d5f6 in 
libMesh::MeshTools::Generation::build_cube(libMesh::UnstructuredMesh&, unsigned 
int, unsigned int, unsigned int, double, double, double, double, double, 
double, libMeshEnums::ElemType, bool) ()
   from 
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so



Looks like it's trying to find global node numbers... but its' not working out 
well.  The code never made it past here in 2 hours of runtime.  One thing that 
is interesting is that I only have ~90,000 nodes.  Do you think it could just 
be a problem with trying to spread out the mesh too much?  Also, I thought this 
Hilbert stuff only ran for ParallelMesh... I'm using SerialMesh here.

An input on this would be awesome.

Derek

On Wed, Feb 8, 2012 at 4:21 AM, Kirk, Benjamin (JSC-EG311) 
<benjamin.kir...@nasa.gov<mailto:benjamin.kir...@nasa.gov>> wrote:
Excellent. What was the symptom in te sparsity pattern?

I'll be flying most of the day, hopefully that'll provide me some time to stare 
at this....

-Ben


On Feb 8, 2012, at 3:16 AM, "Derek Gaston" 
<fried...@gmail.com<mailto:fried...@gmail.com>> wrote:

Win.  Between fixing the localize() issue I brought up earlier and de-threading 
SparsityPattern::Build() my job now ran!  Hopefully the others will continue to 
run as well.

Derek

On Wed, Feb 8, 2012 at 1:07 AM, Derek Gaston 
<fried...@gmail.com<mailto:fried...@gmail.com>> wrote:
Continuing my "huge run" witch hunt... I have really big runs that are hanging 
at mutexes during threaded execution of SparsityPattern::Build().

The main issue seems to stem from DofMap::dof_indices().  Everything is getting 
hung around allocating / deallocating std::vector<int> objects (memory 
operations require mutexes).  I see that there were a few added for support of 
SCALAR variables.  I don't have any SCALAR variables in this simulation... and 
in that situation there shouldn't be any overhead for adding indicies for 
SCALAR variables.  I think all of the scalar variable stuff could be moved off 
into one small portion of that function and guard it with the number of scalar 
variables in the system.

I have to say: DofMap:dof_indices() has been showing up in my profiling studies 
for a while (even on small workstation sized jobs) but I haven't had a chance 
to look at it.

I'm going to take an intensive look at this function soon (maybe tomorrow) but 
it's 1AM right now and I'm just going to turn off threading of this section all 
together and see if I can get these jobs to go through.

I just thought I would point this out in case anyone else wanted to check it 
out or provide opinions...

Derek

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net<mailto:Libmesh-devel@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Reply via email to