Presumably, this overestimation happens only at the "boundary" nodes i that
are contained in elements living on other MPI ranks? Those foreign ranks
will count couplings (edges)  i-j that are shared by their elements with
the elements on rank p that owns i. Since only edge counts are communicated
back to p, there is no way to eliminate these duplicates.  Could we build a
full sparsity pattern for such nodes _only_ ? That way the memory issues
can be controlled, yet the duplicates would be eliminated.  You would,
however, need to communicate the edges, rather than their counts.

Dmitry.

On Wed, Nov 26, 2014, 22:34 Derek Gaston <fried...@gmail.com> wrote:

Ben Spencer (copied on this email) pointed me to a problem he was having
today with some of our sparsity pattern augmentation stuff.  It was causing
PETSc to error out saying that the number of nonzeros on a processor was
more than the number of entries on a row for that processor.  The weirdness
is that this didn't happen if he just ran on one processor...

Thinking that the problem was in our code (I believed we might have been
double counting somewhere) I started tracing this problem tonight... and
what I found is that libMesh is grossly overestimating the number of
nonzeros per row when running in parallel.  And since our code is set up to
believe that libMesh is producing the "perfect" number of nonzeros per row
we are blindly adding to an already inflated number that pushes us past the
size of the row...

Here's what's happening when running on 8 processors (for this one DoF that
I'm tracing... which is #168)

1.  DofMap::operator() is computing the correct number of nonzeros (in this
case 60).

I am taking the 60 number as being the correct number because that's the
final number for this DoF when it's run in serial (I haven't actually dug
in to see which DoF this is and manually compute the sparsity pattern...
yet).

Judging by this it seems that all of the dofs connected to #168 must be
local (again, not completely verified... but the fact that
DofMap::operator() comes up with the same number as the final number in
serial is a good indicator).

2.  SparsityPattern::Build::parallel_sync() totally screws up.

Putting print statements around line 3076 in dof_map.C I can see that n_nz
for #168 goes up to 117! Even worse... n_oz ALSO goes up to 117!


Remember: the correct number for n_nz + n_oz should be _60_.  So we are
basically going to tell PETSc to set aside 4x as much memory for that row
as is necessary.

3.  n_nz and n_oz get chopped down around line 3076 in dof_map.C...

n_nz gets chopped so that it's the min(n_nz, n_dofs_on_proc).  In my case
on that processor n_dofs_on_proc is 108 so n_nz gets set to that

n_oz gets chopped so that it's min(n_oz, dofs_not_on_proc) you can see that
that could be _very_ bad!

4.  Now the (overestimated) n_nz and n_oz get passed to MOOSE for
modification and we start adding to n_nz/n_oz for dof couplings that
libMesh definitely didn't know about... but since n_nz is sitting at the
max possible already we blow past the number of dofs on this proc and then
PETSc errors (like it should).

So... my question is this: is this really the best "estimate" we can do in
this case?

This is a tiny problem in 3D with only 3 variables.  This will be MUCH
worse if you have, say, 2000 variables... you could be telling PETSc to
allocate ENORMOUS chunks of memory that are unnecessary.  I know that PETSc
could throw a bunch of that memory away after the first filling... but we
don't allow that in MOOSE because often we are pre-allocating for future
connections.  But even if you were to let it do that it means that there
could be a HUGE memory spike in the beginning until PETSc frees up a bunch
of memory.

It seems like this code is currently a worst-case estimate of what could
happen.  It _does_ look like it might be better if we built the full
sparsity pattern... but that has it's own memory problems.


Also... it looks like there is a lot more parallel communication than
necessary going on here.  We're sending large vectors of information from
proc to proc... even in the case where we're not building a full sparsity
pattern.  It seems like each processor could just send a minimal of "hey, I
have this many dofs that are connected to these rows you own"... ie one
scalar instead of a bunch of entries.

So... should I take a stab at redoing some of this code?  I think that it's
possible to get a much better estimate and do so with much less parallel
communication.  I probably wouldn't mess with the code that does the full
sparsity pattern... I would just remove the "non" full sparsity pattern
code and make a different function that gets called if you're not building
a full sparsity pattern.  That probably should be done either way (look at
the huge "if" with duplicated code for each case in DofMap::operator() ).

Or do one of you guys see a quick fix that does something better?

(Oh - BTW, I'm going to implement the same use of min() in MOOSE's sparsity
pattern augmentation stuff to get us through for right now - so this isn't
necessarily time sensitive)

Derek

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Reply via email to