I've mentioned this a few times in passing - but now I have concrete
evidence: Outputting the solution takes a TON of memory... way more than it
should.  This is something we've seen for a very long time - and it's one
of the main reasons why many of our parallel jobs die...

Firstly, here's the graph:

https://drive.google.com/file/d/0B1YTFvojHetuaXlFd29RdkNqTUE/edit?usp=sharing

This is a run with 2 systems in it - the first one has 40 variables and
totals about 25 million DoFs... the second one is two variables and comes
out to a little over 1 million DoFs.  This job is spread out across 160 MPI
processes (and we're going to be looking at the aggregate memory across all
of those).

The two lines you're seeing are for the exact same run - but the green one
is doing output (Exodus in this case - although it doesn't matter what
type) and the blue one has output completely turned off.  Due to our
awesome memory logger I can tell you that those huge green spikes are
occurring in EquationSystems::build_solution_vector()

The problem in there is two-fold:

1.  System::update_global_solution() does a localization (to all
processors!) of the entire solution vector!  That's a really terrible idea
- especially since we're only going to access local entries in the solution
vector in build_solution_vector()!  The normal current_local_solution
should suffice - without any of this localization at all....

2.  The global solution vector in build_solution_vector()  (which is called
"soln" in the function) is of number_of_nodes*number_of_variables in length
- AND it gets allocated on every processor... AND at the end we do an
awesome all-to-all global sum of that guy... even though it's only going to
get used on processor zero for serialized output formats....

When doing serialized output (like Exodus) that solution vector should only
be allocated on processor 0.  Every other processor should have a much
shorter vector that is num_local_nodes*num_vars long... and store another
vector that is the mapping into the global one (or something along those
lines).  Then, at the end, each processor should sum its entries to the
correct positions in processor 0 (with some sort of AllGather I would
suspect).

When doing parallelized output (like Nemesis) that nodes*vars length vector
should _never_ be allocated!  Instead, simply the pieces on each processor
that are going to be output should be built and passed along.

Yes - right now we are going through the process of build a
global_nodes*vars length solution vector on every processor even for our
parallel output formats.

As a first cut we're definitely going to try removing the call to
update_global_solution() and just use current_local_solution instead.
 We'll report back with another memory graph of that tomorrow.

To do the rest we might need a bit of brainstorming - but if anyone is
feeling like they want to get in there and fix this stuff - please do!

Derek
------------------------------------------------------------------------------
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most 
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Reply via email to