On Mon, Nov 11, 2013 at 3:51 PM, Derek Gaston <fried...@gmail.com> wrote:
> I've mentioned this a few times in passing - but now I have concrete
> evidence: Outputting the solution takes a TON of memory... way more than it
> should. This is something we've seen for a very long time - and it's one
> of the main reasons why many of our parallel jobs die...
>
> Firstly, here's the graph:
>
>
> https://drive.google.com/file/d/0B1YTFvojHetuaXlFd29RdkNqTUE/edit?usp=sharing
>
> This is a run with 2 systems in it - the first one has 40 variables and
> totals about 25 million DoFs... the second one is two variables and comes
> out to a little over 1 million DoFs. This job is spread out across 160 MPI
> processes (and we're going to be looking at the aggregate memory across all
> of those).
>
Quick correction: We are only looking at the memory of the rank 0 MPI
process in this graph. The memory profile of all the other ranks pretty
much match this one though.
>
> The two lines you're seeing are for the exact same run - but the green one
> is doing output (Exodus in this case - although it doesn't matter what
> type) and the blue one has output completely turned off. Due to our
> awesome memory logger I can tell you that those huge green spikes are
> occurring in EquationSystems::build_solution_vector()
>
> The problem in there is two-fold:
>
> 1. System::update_global_solution() does a localization (to all
> processors!) of the entire solution vector! That's a really terrible idea
> - especially since we're only going to access local entries in the solution
> vector in build_solution_vector()! The normal current_local_solution
> should suffice - without any of this localization at all....
>
> 2. The global solution vector in build_solution_vector() (which is
> called "soln" in the function) is of number_of_nodes*number_of_variables in
> length - AND it gets allocated on every processor... AND at the end we do
> an awesome all-to-all global sum of that guy... even though it's only going
> to get used on processor zero for serialized output formats....
>
> When doing serialized output (like Exodus) that solution vector should
> only be allocated on processor 0. Every other processor should have a much
> shorter vector that is num_local_nodes*num_vars long... and store another
> vector that is the mapping into the global one (or something along those
> lines). Then, at the end, each processor should sum its entries to the
> correct positions in processor 0 (with some sort of AllGather I would
> suspect).
>
> When doing parallelized output (like Nemesis) that nodes*vars length
> vector should _never_ be allocated! Instead, simply the pieces on each
> processor that are going to be output should be built and passed along.
>
> Yes - right now we are going through the process of build a
> global_nodes*vars length solution vector on every processor even for our
> parallel output formats.
>
> As a first cut we're definitely going to try removing the call to
> update_global_solution() and just use current_local_solution instead.
> We'll report back with another memory graph of that tomorrow.
>
> To do the rest we might need a bit of brainstorming - but if anyone is
> feeling like they want to get in there and fix this stuff - please do!
>
> Derek
>
>
> ------------------------------------------------------------------------------
> November Webinars for C, C++, Fortran Developers
> Accelerate application performance with scalable programming models.
> Explore
> techniques for threading, error checking, porting, and tuning. Get the most
> from the latest Intel processors and coprocessors. See abstracts and
> register
> http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
> _______________________________________________
> Libmesh-devel mailing list
> Libmesh-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/libmesh-devel
>
>
------------------------------------------------------------------------------
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel