On 3/20/19 2:13 PM, Stogner, Roy H wrote:
On Mon, 18 Mar 2019, Boris Boutkov wrote:

Out of some curiosity I recently rebased my GMG implementation on to the
upcoming NBX changes in PR #1965 to do some weak scaling analysis.

I ran an np 256 Poisson problem using GMG with ~10k dofs/proc, and in short
it, seems like the NBX changes provide some solid improvement bringing my
total runtime from something like ~19s to ~16s, so a nice step on the road
to  weak GMG scaling.  Pre-NBX I had a good chunk (30% total w/sub) of my
time being spent in the alltoall() which post-NBX is down to 2%! This came
with a fairly large amount of calls to possible_receive() and now 15% of my
total time being spent in there, but the overall timing seems to be a win so
thanks much for this work!
Thanks for the update!

Greedy question: could you try the same timings at, say, np 16?  I was
pretty confident np 256 would be a big win, since the asymptotic
scaling is improved, but it'd be nice to have data points at lower
processor counts too.


Sure. I've updated to include the np16 results which can be found at:

https://drive.google.com/file/d/1X8U1XcZNNEAOK-z33jFFfKuM6zYRjsji/view?usp=sharing

The short of it is that the overall timing is nearly indistinguishable at np16. Also similar to before, the 10% of time spent in alltoall() got offloaded to possibly_receive(), and basically, the heavy performance hits are still the same culprits - but its worth nothing that they are slightly 'heavier' at np256 than at np16 which eventually manifests in the total time increase. Anyways, I'd say at np16 the changes are neutral for this use case.


Despite these improvements, the weak scaling for the GMG implementation is
still a bit lacking unfortunately as np1=~1s. I ran these tests through gperf
in order to gain some more insight and it looks to me that major components
slowing down the setup time are still refining/coarsening/distributing_dofs
which in turn do a lot of nodal parallel consistency adjusting and setting
nonlocal_dof_objects and am wondering if there are maybe some low hanging
fruit to improve on around those calls.
There almost certainly is.  Could I get comparable results from your
new fem_system_ex1 settings (with more coarse refinements, I mean) to
test with?

I ran these studies on a Poisson problem with quad4s, so I think outside of the increased cost of the projections and refinements of the second order information, and if we ignore the solve time increase, the relatively expensive functions in init_and_attach_petscdm() will similarly show up for fem_system_ex1  under increasing mg levels. The other option would be the direct comparison using the soon-to-be-merged multigrid examples in GRINS which is basically whats presented in the attachment.

Either way, I'd certainly be interested to learn how this all behaves on other machines because in the past I've seen situations where MPI related optimizations were more pessimistic on my local cluster than on other systems.


- Boris






_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Reply via email to