On 3/20/19 2:13 PM, Stogner, Roy H wrote:
On Mon, 18 Mar 2019, Boris Boutkov wrote:
Out of some curiosity I recently rebased my GMG implementation on to the
upcoming NBX changes in PR #1965 to do some weak scaling analysis.
I ran an np 256 Poisson problem using GMG with ~10k dofs/proc, and in short
it, seems like the NBX changes provide some solid improvement bringing my
total runtime from something like ~19s to ~16s, so a nice step on the road
to weak GMG scaling. Pre-NBX I had a good chunk (30% total w/sub) of my
time being spent in the alltoall() which post-NBX is down to 2%! This came
with a fairly large amount of calls to possible_receive() and now 15% of my
total time being spent in there, but the overall timing seems to be a win so
thanks much for this work!
Thanks for the update!
Greedy question: could you try the same timings at, say, np 16? I was
pretty confident np 256 would be a big win, since the asymptotic
scaling is improved, but it'd be nice to have data points at lower
processor counts too.
Sure. I've updated to include the np16 results which can be found at:
https://drive.google.com/file/d/1X8U1XcZNNEAOK-z33jFFfKuM6zYRjsji/view?usp=sharing
The short of it is that the overall timing is nearly indistinguishable
at np16. Also similar to before, the 10% of time spent in alltoall() got
offloaded to possibly_receive(), and basically, the heavy performance
hits are still the same culprits - but its worth nothing that they are
slightly 'heavier' at np256 than at np16 which eventually manifests in
the total time increase. Anyways, I'd say at np16 the changes are
neutral for this use case.
Despite these improvements, the weak scaling for the GMG implementation is
still a bit lacking unfortunately as np1=~1s. I ran these tests through gperf
in order to gain some more insight and it looks to me that major components
slowing down the setup time are still refining/coarsening/distributing_dofs
which in turn do a lot of nodal parallel consistency adjusting and setting
nonlocal_dof_objects and am wondering if there are maybe some low hanging
fruit to improve on around those calls.
There almost certainly is. Could I get comparable results from your
new fem_system_ex1 settings (with more coarse refinements, I mean) to
test with?
I ran these studies on a Poisson problem with quad4s, so I think outside
of the increased cost of the projections and refinements of the second
order information, and if we ignore the solve time increase, the
relatively expensive functions in init_and_attach_petscdm() will
similarly show up for fem_system_ex1 under increasing mg levels. The
other option would be the direct comparison using the soon-to-be-merged
multigrid examples in GRINS which is basically whats presented in the
attachment.
Either way, I'd certainly be interested to learn how this all behaves on
other machines because in the past I've seen situations where MPI
related optimizations were more pessimistic on my local cluster than on
other systems.
- Boris
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel