Maybe I never discussed these results anywhere... because I can't find any
discussion of them :-)

Here are the timing results:
https://docs.google.com/spreadsheets/d/1dS0MjyTYRQ_o8pBsuGncJtSYxEYqKw42Chr8ougD1ek/edit?usp=docslist_api

I'll attempt to explain a little...

All I was timing was variable value evaluation at quadrature points. No
residual, no RHS, no Jacobian. Just purely looping through the elements and
trying to compute the value of each variable at each quadrature point. This
is pretty important to us in our multi physics runs. Many of our
simulations have dozens of variables... some have thousands. This is
definitely ripe for vectorization...

I tried a few basic things:

1. The "LibMesh Way" is how it's coded in the Examples (like here:
https://github.com/libMesh/libmesh/blob/master/examples/systems_of_equations/systems_of_equations_ex2/systems_of_equations_ex2.C#L490
)

2. Invert the loop so you only need to pull each dof value out once. This
is how we do it in MOOSE ( like here:
https://github.com/idaholab/moose/blob/devel/framework/src/base/MooseVariable.C#L859
). As you can see in the spreadsheet... just doing that is 3x faster!
That's because, if you do the timing, the global-to-local mapping that
happens in that PETSc vector indexing to get a dof value is insanely
expensive compared to everything else. That's the kind of speedup I was
hoping for with vectorization.

3. Variations on #2 where I pulled out ALL of the dof values for variables
of the same type at the same time. As you can see in the timing, this can
yield another 4x speedup... but only for huge numbers of variables!

4. Loading up dense matrices of shape function evaluations so that all
variable's values and gradients (of the same FEType) can be computed
simultaneously with one single mat*vec. This creates perfectly vectorizable
operations. I tried using Eigen for this as well as the stuff in libMesh
and even our own ColumnMajorMatrix implementation. It never made much
difference though... At most a 2x speedup at the extreme end of things.

HOWEVER: the reason #4 didn't pay off is because of the storage of phi and
dphi. Having to unpack the shape function evaluations and put them in dense
matrices takes time. Same with getting DoF values... that has to go through
an expensive global to local lookup. Finally, the results of the
computation had to be unpacked back into our own vector classes that we use
to hold values and gradients of variables. All of that overhead swamped any
gains in more efficient computation.

And that, in a nutshell, is the crux of the problem. Trying to do
micro-optimization on this stuff RARELY pays off in any significant way
because there is SO MUCH other stuff that is already inefficient that we
can't do anything about. That's why 10% isn't interesting to me. 10% is, in
reality, 10% of 15% (best case) because the other 85% of assembly time is
spent doing other things (like evaluating material properties by
interpolating tables, etc).

Anyway, I'm glad someone is looking into this. It's important to check
these things every once in a while... but let's not make architectural
changes to the core systems in libMesh unless there are large (huge) gains
across a wide range of applications.

Derek
On Tue, Jan 5, 2016 at 1:50 AM Derek Gaston <fried...@gmail.com> wrote:

> Yes, try to do the vector indexing yourself first to see if it's the
> operator calls that are throwing things off.
>
> I did a bunch of work on this myself a few years ago... all I was
> attempting to speed up was just variable value evaluation... not Re/Ke
> evaluation as a whole. Let me see if I can dig up what I did..(I'll do some
> searching and send another email). I eventually dropped it because, while
> it gave some speedup in some extreme cases (like with thousands of
> variables to evaluate) it was marginal (or even slower) for the more common
> cases (1-10 variables).
>
> Honestly, 10% is not worth it to me. Any real application (that isn't
> example 3) is going to have WAY more going on that can't be vectorized
> anyway. If 10% is our best case then I don't think this extra complexity is
> worth it. Further, non-vectorizable work is typically perfectly parallel,
> which means I can just use 10% more cores and get the same effect now...
> which is easy to do.
>
> Hopefully a bit more work will yield larger gains.
>
> Derek
> On Tue, Jan 5, 2016 at 12:30 AM Roy Stogner <royst...@ices.utexas.edu>
> wrote:
>
>>
>> On Mon, 4 Jan 2016, Tim Adowski wrote:
>>
>> > However, all versions of GCC were unable to vectorize the Ke loop
>> > due to "bad data ref", and both Intel versions required "#pragma
>> > ivdep" in order to vectorize the Ke loop.
>>
>> One last thought: is it possible that what is confusing gcc isn't your
>> class, but rather the DenseMatrix class?  Try replacing "Ke(i,j)" with
>> "my_vector[i*M+j]" or whatever and see if gcc can handle that?
>> ---
>> Roy
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Libmesh-devel mailing list
>> Libmesh-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/libmesh-devel
>>
>
------------------------------------------------------------------------------
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Reply via email to