The symptom with the sparsity pattern was that it would hangup forever in
DofMap::dof_indices().  ie all threads would be sitting at either
operator_new or delete for a std::vector<int> inside DofMap::dof_indices().
 Then, after taking forever on that if it ever got past it, a few threads
were still hung in a weird state in this parallel loop.  I'll see if I can
reproduce some of the backtraces for you.

BTW - I don't think that threading this part of the calculation is super
critical ;-)  Even when running with 100 Million DoFs, running through this
section of code in serial never takes all that long (I mean, it takes
time... but overall not a big deal compared to the actual solve).

Also: I ran into another issue.  When I've spread the problem to 8000+ MPI
the code is getting stuck here:

0  0x00002b90da86b9e9 in btl_openib_component_progress () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#1  0x00002b90daf22ef6 in opal_progress () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libopen-pal.so.0
#2  0x00002b90da8200c4 in ompi_request_default_wait_all () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#3  0x00002b90da8909ee in ompi_coll_tuned_sendrecv_actual () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#4  0x00002b90da898716 in ompi_coll_tuned_allgather_intra_bruck () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#5  0x00002b90da891439 in ompi_coll_tuned_allgather_intra_dec_fixed () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#6  0x00002b90da8387e6 in PMPI_Allgather () from
/apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0
#7  0x00002b90d4e02268 in
libMesh::Parallel::Sort<Hilbert::HilbertIndices>::communicate_bins() ()
from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#8  0x00002b90d4e014ee in
libMesh::Parallel::Sort<Hilbert::HilbertIndices>::sort() () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#9  0x00002b90d4c8f798 in void
libMesh::MeshCommunication::find_global_indices<libMesh::MeshBase::const_element_iterator>(libMesh::MeshTools::BoundingBox
const&, libMesh::MeshBase::const_element_iterator const&,
libMesh::MeshBase::const_element_iterator const&, std::vector<unsigned int,
std::allocator<unsigned int> >&) const () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#10 0x00002b90d4e0eaea in
libMesh::Partitioner::partition_unpartitioned_elements(libMesh::MeshBase&,
unsigned int) () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#11 0x00002b90d4e0e38b in
libMesh::Partitioner::partition(libMesh::MeshBase&, unsigned int) () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#12 0x00002b90d4c76e2c in libMesh::MeshBase::partition(unsigned int) ()
from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#13 0x00002b90d4c76db8 in libMesh::MeshBase::prepare_for_use(bool) () from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so
#14 0x00002b90d4c9d5f6 in
libMesh::MeshTools::Generation::build_cube(libMesh::UnstructuredMesh&,
unsigned int, unsigned int, unsigned int, double, double, double, double,
double, double, libMeshEnums::ElemType, bool) ()
   from
/home/gastdr/projects/fission/herd_trunk/libmesh/lib/x86_64-unknown-linux-gnu_opt/libmesh.so



Looks like it's trying to find global node numbers... but its' not working
out well.  The code never made it past here in 2 hours of runtime.  One
thing that is interesting is that I only have ~90,000 nodes.  Do you think
it could just be a problem with trying to spread out the mesh too much?
 Also, I thought this Hilbert stuff only ran for ParallelMesh... I'm using
SerialMesh here.

An input on this would be awesome.

Derek

On Wed, Feb 8, 2012 at 4:21 AM, Kirk, Benjamin (JSC-EG311) <
benjamin.kir...@nasa.gov> wrote:

> Excellent. What was the symptom in te sparsity pattern?
>
> I'll be flying most of the day, hopefully that'll provide me some time to
> stare at this....
>
> -Ben
>
>
> On Feb 8, 2012, at 3:16 AM, "Derek Gaston" <fried...@gmail.com> wrote:
>
> Win.  Between fixing the localize() issue I brought up earlier and
> de-threading SparsityPattern::Build() my job now ran!  Hopefully the others
> will continue to run as well.
>
> Derek
>
> On Wed, Feb 8, 2012 at 1:07 AM, Derek Gaston <fried...@gmail.com> wrote:
>
>> Continuing my "huge run" witch hunt... I have really big runs that are
>> hanging at mutexes during threaded execution of SparsityPattern::Build().
>>
>> The main issue seems to stem from DofMap::dof_indices().  Everything is
>> getting hung around allocating / deallocating std::vector<int> objects
>> (memory operations require mutexes).  I see that there were a few added for
>> support of SCALAR variables.  I don't have any SCALAR variables in this
>> simulation... and in that situation there shouldn't be any overhead for
>> adding indicies for SCALAR variables.  I think all of the scalar variable
>> stuff could be moved off into one small portion of that function and guard
>> it with the number of scalar variables in the system.
>>
>> I have to say: DofMap:dof_indices() has been showing up in my profiling
>> studies for a while (even on small workstation sized jobs) but I haven't
>> had a chance to look at it.
>>
>> I'm going to take an intensive look at this function soon (maybe
>> tomorrow) but it's 1AM right now and I'm just going to turn off threading
>> of this section all together and see if I can get these jobs to go through.
>>
>> I just thought I would point this out in case anyone else wanted to check
>> it out or provide opinions...
>>
>> Derek
>>
>
>
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
>
> _______________________________________________
> Libmesh-devel mailing list
> Libmesh-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/libmesh-devel
>
>
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Reply via email to