On Tue, Jan 5, 2016 at 2:31 AM, Derek Gaston <fried...@gmail.com> wrote:
> Maybe I never discussed these results anywhere... because I can't find any > discussion of them :-) > > Here are the timing results: > https://docs.google.com/spreadsheets/d/1dS0MjyTYRQ_o8pBsuGncJtSYxEYqKw42Chr8ougD1ek/edit?usp=docslist_api Thanks for sharing. > I'll attempt to explain a little... > > All I was timing was variable value evaluation at quadrature points. No > residual, no RHS, no Jacobian. Just purely looping through the elements and > trying to compute the value of each variable at each quadrature point. This > is pretty important to us in our multi physics runs. Many of our > simulations have dozens of variables... some have thousands. This is > definitely ripe for vectorization... > Precisely my thinking. > I tried a few basic things: > > 1. The "LibMesh Way" is how it's coded in the Examples (like here: > https://github.com/libMesh/libmesh/blob/master/examples/systems_of_equations/systems_of_equations_ex2/systems_of_equations_ex2.C#L490 > ) > > 2. Invert the loop so you only need to pull each dof value out once. This > is how we do it in MOOSE ( like here: > https://github.com/idaholab/moose/blob/devel/framework/src/base/MooseVariable.C#L859 > ). As you can see in the spreadsheet... just doing that is 3x faster! How much of this is the benefit of striding correctly and getting cache reuse? This was the major impetus, in the first pass anyway, of trying to encapsulate the shape function values. > That's because, if you do the timing, the global-to-local mapping that > happens in that PETSc vector indexing to get a dof value is insanely > expensive compared to everything else. That's the kind of speedup I was > hoping for with vectorization. > Interesting. Is that look up timing in your table or is it buried within the times? > 3. Variations on #2 where I pulled out ALL of the dof values for variables > of the same type at the same time. As you can see in the timing, this can > yield another 4x speedup... but only for huge numbers of variables! 4. Loading up dense matrices of shape function evaluations so that all > variable's values and gradients (of the same FEType) can be computed > simultaneously with one single mat*vec. This creates perfectly vectorizable > operations. I tried using Eigen for this as well This is exactly where I was wanting to go for larger variable counts. > as the stuff in libMesh and even our own ColumnMajorMatrix implementation. > It never made much difference though... At most a 2x speedup at the extreme > end of things. > > HOWEVER: the reason #4 didn't pay off is because of the storage of phi and > dphi. Having to unpack the shape function evaluations and put them in dense > matrices takes time. Same with getting DoF values... that has to go through > an expensive global to local lookup. Finally, the results of the > computation had to be unpacked back into our own vector classes that we use > to hold values and gradients of variables. All of that overhead swamped any > gains in more efficient computation. And that, in a nutshell, is the crux of the problem. All of this is exactly why I had Tim focus on a container for shape functions. If we encapsulate the shape function container better, we can setup compile time (runtime?) specification of the container so we could natively use proper matrix data structures (i.e. bypass the copy overhead you were describing, for the shape function values anyway) to get the benefits where they count, e.g. for the right number of variables in the System. Hopefully such an encapsulation could make it easier for users to experiment as well. It's even something we can easily introduce gradually by making it opt-in at configure time so we don't break any existing codes. > Trying to do micro-optimization Just to be clear, this was the first step. I'd be thrilled that even this relatively simple change (that hopefully will be easy to update in the library with a search replace and a configure option) would give a 10% benefit. Again, doing *many* of these solves, 10% would save a lot of CPU hours. And certainly real-time applications would be grateful as well. > on this stuff RARELY pays off in any significant way because there is SO > MUCH other stuff that is already inefficient that we can't do anything > about. That's why 10% isn't interesting to me. 10% is, in reality, 10% of > 15% (best case) because the other 85% of assembly time is spent doing other > things (like evaluating material properties by interpolating tables, etc). Anyway, I'm glad someone is looking into this. It's important to check > these things every once in a while... but let's not make architectural > changes to the core systems in libMesh unless there are large (huge) gains > across a wide range of applications. This is not a major architectural change, IMHO. This is just doing a better job of encapsulation so we can make changes that could see major benefits and minimize the changes to downstream apps. And we can easily do it without breaking existing codes in the beginning (and updating existing codes would be just a search and replace for the object name if we preserve the existing API). Of course, this is all once we've got a decent first pass at a container object to build around, which is what Tim is striving to get. Best, Paul
------------------------------------------------------------------------------
_______________________________________________ Libmesh-devel mailing list Libmesh-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libmesh-devel