Re: [Libmesh-devel] Weak scaling updates for GMG post NBX

Boris Boutkov Thu, 21 Mar 2019 16:53:21 -0700


On 3/20/19 2:13 PM, Stogner, Roy H wrote:

On Mon, 18 Mar 2019, Boris Boutkov wrote:

Out of some curiosity I recently rebased my GMG implementation on to the
upcoming NBX changes in PR #1965 to do some weak scaling analysis.

I ran an np 256 Poisson problem using GMG with ~10k dofs/proc, and in short
it, seems like the NBX changes provide some solid improvement bringing my
total runtime from something like ~19s to ~16s, so a nice step on the road
to  weak GMG scaling.  Pre-NBX I had a good chunk (30% total w/sub) of my
time being spent in the alltoall() which post-NBX is down to 2%! This came
with a fairly large amount of calls to possible_receive() and now 15% of my
total time being spent in there, but the overall timing seems to be a win so
thanks much for this work!

Thanks for the update!

Greedy question: could you try the same timings at, say, np 16?  I was
pretty confident np 256 would be a big win, since the asymptotic
scaling is improved, but it'd be nice to have data points at lower
processor counts too.



Sure. I've updated to include the np16 results which can be found at:

https://drive.google.com/file/d/1X8U1XcZNNEAOK-z33jFFfKuM6zYRjsji/view?usp=sharing

The short of it is that the overall timing is nearly indistinguishableat np16. Also similar to before, the 10% of time spent in alltoall() gotoffloaded to possibly_receive(), and basically, the heavy performancehits are still the same culprits - but its worth nothing that they areslightly 'heavier' at np256 than at np16 which eventually manifests inthe total time increase. Anyways, I'd say at np16 the changes areneutral for this use case.

Despite these improvements, the weak scaling for the GMG implementation is
still a bit lacking unfortunately as np1=~1s. I ran these tests through gperf
in order to gain some more insight and it looks to me that major components
slowing down the setup time are still refining/coarsening/distributing_dofs
which in turn do a lot of nodal parallel consistency adjusting and setting
nonlocal_dof_objects and am wondering if there are maybe some low hanging
fruit to improve on around those calls.

There almost certainly is.  Could I get comparable results from your
new fem_system_ex1 settings (with more coarse refinements, I mean) to
test with?

I ran these studies on a Poisson problem with quad4s, so I think outsideof the increased cost of the projections and refinements of the secondorder information, and if we ignore the solve time increase, therelatively expensive functions in init_and_attach_petscdm() willsimilarly show up for fem_system_ex1 under increasing mg levels. Theother option would be the direct comparison using the soon-to-be-mergedmultigrid examples in GRINS which is basically whats presented in theattachment.

Either way, I'd certainly be interested to learn how this all behaves onother machines because in the past I've seen situations where MPIrelated optimizations were more pessimistic on my local cluster than onother systems.



- Boris






_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Re: [Libmesh-devel] Weak scaling updates for GMG post NBX

Reply via email to