Maybe I never discussed these results anywhere... because I can't find any discussion of them :-)
Here are the timing results: https://docs.google.com/spreadsheets/d/1dS0MjyTYRQ_o8pBsuGncJtSYxEYqKw42Chr8ougD1ek/edit?usp=docslist_api I'll attempt to explain a little... All I was timing was variable value evaluation at quadrature points. No residual, no RHS, no Jacobian. Just purely looping through the elements and trying to compute the value of each variable at each quadrature point. This is pretty important to us in our multi physics runs. Many of our simulations have dozens of variables... some have thousands. This is definitely ripe for vectorization... I tried a few basic things: 1. The "LibMesh Way" is how it's coded in the Examples (like here: https://github.com/libMesh/libmesh/blob/master/examples/systems_of_equations/systems_of_equations_ex2/systems_of_equations_ex2.C#L490 ) 2. Invert the loop so you only need to pull each dof value out once. This is how we do it in MOOSE ( like here: https://github.com/idaholab/moose/blob/devel/framework/src/base/MooseVariable.C#L859 ). As you can see in the spreadsheet... just doing that is 3x faster! That's because, if you do the timing, the global-to-local mapping that happens in that PETSc vector indexing to get a dof value is insanely expensive compared to everything else. That's the kind of speedup I was hoping for with vectorization. 3. Variations on #2 where I pulled out ALL of the dof values for variables of the same type at the same time. As you can see in the timing, this can yield another 4x speedup... but only for huge numbers of variables! 4. Loading up dense matrices of shape function evaluations so that all variable's values and gradients (of the same FEType) can be computed simultaneously with one single mat*vec. This creates perfectly vectorizable operations. I tried using Eigen for this as well as the stuff in libMesh and even our own ColumnMajorMatrix implementation. It never made much difference though... At most a 2x speedup at the extreme end of things. HOWEVER: the reason #4 didn't pay off is because of the storage of phi and dphi. Having to unpack the shape function evaluations and put them in dense matrices takes time. Same with getting DoF values... that has to go through an expensive global to local lookup. Finally, the results of the computation had to be unpacked back into our own vector classes that we use to hold values and gradients of variables. All of that overhead swamped any gains in more efficient computation. And that, in a nutshell, is the crux of the problem. Trying to do micro-optimization on this stuff RARELY pays off in any significant way because there is SO MUCH other stuff that is already inefficient that we can't do anything about. That's why 10% isn't interesting to me. 10% is, in reality, 10% of 15% (best case) because the other 85% of assembly time is spent doing other things (like evaluating material properties by interpolating tables, etc). Anyway, I'm glad someone is looking into this. It's important to check these things every once in a while... but let's not make architectural changes to the core systems in libMesh unless there are large (huge) gains across a wide range of applications. Derek On Tue, Jan 5, 2016 at 1:50 AM Derek Gaston <fried...@gmail.com> wrote: > Yes, try to do the vector indexing yourself first to see if it's the > operator calls that are throwing things off. > > I did a bunch of work on this myself a few years ago... all I was > attempting to speed up was just variable value evaluation... not Re/Ke > evaluation as a whole. Let me see if I can dig up what I did..(I'll do some > searching and send another email). I eventually dropped it because, while > it gave some speedup in some extreme cases (like with thousands of > variables to evaluate) it was marginal (or even slower) for the more common > cases (1-10 variables). > > Honestly, 10% is not worth it to me. Any real application (that isn't > example 3) is going to have WAY more going on that can't be vectorized > anyway. If 10% is our best case then I don't think this extra complexity is > worth it. Further, non-vectorizable work is typically perfectly parallel, > which means I can just use 10% more cores and get the same effect now... > which is easy to do. > > Hopefully a bit more work will yield larger gains. > > Derek > On Tue, Jan 5, 2016 at 12:30 AM Roy Stogner <royst...@ices.utexas.edu> > wrote: > >> >> On Mon, 4 Jan 2016, Tim Adowski wrote: >> >> > However, all versions of GCC were unable to vectorize the Ke loop >> > due to "bad data ref", and both Intel versions required "#pragma >> > ivdep" in order to vectorize the Ke loop. >> >> One last thought: is it possible that what is confusing gcc isn't your >> class, but rather the DenseMatrix class? Try replacing "Ke(i,j)" with >> "my_vector[i*M+j]" or whatever and see if gcc can handle that? >> --- >> Roy >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Libmesh-devel mailing list >> Libmesh-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/libmesh-devel >> >
------------------------------------------------------------------------------
_______________________________________________ Libmesh-devel mailing list Libmesh-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libmesh-devel