Hello libMesh-devel:

I have been working to replace the current std::vector<std::vector<T> >
implementation of the shape functions, with the goal of achieving
vectorization when the shape function values are used.
​However, I have only had partial success, and I am all out of ideas. I
would appreciate any feedback or suggestions on my implementation, or any
insight from personal experience.

I have written a wrapper class that mimics the current interface (resize(),
size(), [][]), but uses a 1D std::vector<T> as the underlying data
structure.
The overload of the bracket operator converts the 2 indices [i][qp] into a
1D index with the proper stride [ (qp * num_sf) + i ]

As a test case, I have been using Introduction Example 3.
Rather than fully integrating my new class into libMesh, it is manually
inserted into the example code.
Referring to "introduction_ex3.C", after the call to reinit() (line 296), I
create 2 instances of my class, one for phi and one for dphi.
I then copy over all values from the normal (d)phi vectors to my new
fe_(d)phi instances, and then use my class in the QP loop below.

Please see:
https://github.com/tradowsk/libmesh/tree/fe_vect/examples/vectorization_test/poisson_modified
.
It contains the 1D vector class and the modified example code.​

There are 2 loops that should be vectorized: the Ke inner j-loop on line
322, and the Fe i-loop at 357.
Using my 1D class, I was able to achieve vectorization on the Fe loop in
both GCC(4.8.4, 4.9.2, 5.3.0) and Intel(14.0, 16.0).
However, all versions of GCC were unable to vectorize the Ke loop due to
"bad data ref", and both Intel versions required "#pragma ivdep" in order
to vectorize the Ke loop. In addition, it seems that both compilers were
unsure of the Fe loop, and versioned them for runtime analysis.
Vec reports can be provided if someone can decipher them.

Using the START_LOG/STOP_LOG macros to time the QP loop, I got the
following results for the 1D shape function implementation:
Original: 1028ms (average)
Modified with Fe loop vectorized: 920ms (average)
Difference: ~10%
So there is a significant speedup to be gained by switching up the shape
function implementation, even with only one of the loops vectorized.

An interesting and highly problematic note is that there is an open bug in
GCC wherein auto-vectorization is not possible if the loop iteration
variable is of type "unsigned int".
Bug reports: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57206 and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48052
Referring to the Fe loop, I was only able to achieve vectorization in GCC
(all versions) by removing the "unsigned" from the iteration variable.

After changing the Ke loop to "int" I tried to following tricks to get
vectorization:
making various variables and class functions const
breaking the Ke computation into multiple steps
pulling the fe_dphi[i][qp] out of the j-loop (line 381) and making it const
manually unrolling the inner j-loop
Nothing was able to convince GCC to vectorize the loop.

To enable GCC auto-vectorization, I tried -ftree-vectorize and -O3, and
both yielded the same results. I also tried -mavx (and -mavx2 on a Haswell
machine).
In addition, I experimented with various combinations of default flags and
tricks from above, but GCC always spat out the same vectorization refusal.

Moving forward, it would seem that refactoring the shape function
implementation would yield cleaner code.
For example, fe_base calls vector.resize() many times in order to fully
initialize the 2D vector. However, with the new class, these calls are
unnecessary since for the 1D vector, vector.resize() only needs to be
called once from within the class.
A cleaner and more meaningful interface would be to have a constructor that
takes the initial row and column sizes as arguments and handles all the
allocation within the class itself. The class.resize() function could then
be used only when the shape function vector changes size, making it more
semantically meaningful.
The [][] operator could also be replaced with parentheses (i,j) or a
get_val(i,j) function, since the code to implement the double brackets adds
quite a bit of complexity to the class.

I tried implementing the simplified interface version of the class​
(see GitHub link for fe_shape_function_simplified.h), but GCC still
refused...

~Tim
------------------------------------------------------------------------------
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Reply via email to