Hello libMesh-devel: I have been working to replace the current std::vector<std::vector<T> > implementation of the shape functions, with the goal of achieving vectorization when the shape function values are used. However, I have only had partial success, and I am all out of ideas. I would appreciate any feedback or suggestions on my implementation, or any insight from personal experience.
I have written a wrapper class that mimics the current interface (resize(), size(), [][]), but uses a 1D std::vector<T> as the underlying data structure. The overload of the bracket operator converts the 2 indices [i][qp] into a 1D index with the proper stride [ (qp * num_sf) + i ] As a test case, I have been using Introduction Example 3. Rather than fully integrating my new class into libMesh, it is manually inserted into the example code. Referring to "introduction_ex3.C", after the call to reinit() (line 296), I create 2 instances of my class, one for phi and one for dphi. I then copy over all values from the normal (d)phi vectors to my new fe_(d)phi instances, and then use my class in the QP loop below. Please see: https://github.com/tradowsk/libmesh/tree/fe_vect/examples/vectorization_test/poisson_modified . It contains the 1D vector class and the modified example code. There are 2 loops that should be vectorized: the Ke inner j-loop on line 322, and the Fe i-loop at 357. Using my 1D class, I was able to achieve vectorization on the Fe loop in both GCC(4.8.4, 4.9.2, 5.3.0) and Intel(14.0, 16.0). However, all versions of GCC were unable to vectorize the Ke loop due to "bad data ref", and both Intel versions required "#pragma ivdep" in order to vectorize the Ke loop. In addition, it seems that both compilers were unsure of the Fe loop, and versioned them for runtime analysis. Vec reports can be provided if someone can decipher them. Using the START_LOG/STOP_LOG macros to time the QP loop, I got the following results for the 1D shape function implementation: Original: 1028ms (average) Modified with Fe loop vectorized: 920ms (average) Difference: ~10% So there is a significant speedup to be gained by switching up the shape function implementation, even with only one of the loops vectorized. An interesting and highly problematic note is that there is an open bug in GCC wherein auto-vectorization is not possible if the loop iteration variable is of type "unsigned int". Bug reports: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57206 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48052 Referring to the Fe loop, I was only able to achieve vectorization in GCC (all versions) by removing the "unsigned" from the iteration variable. After changing the Ke loop to "int" I tried to following tricks to get vectorization: making various variables and class functions const breaking the Ke computation into multiple steps pulling the fe_dphi[i][qp] out of the j-loop (line 381) and making it const manually unrolling the inner j-loop Nothing was able to convince GCC to vectorize the loop. To enable GCC auto-vectorization, I tried -ftree-vectorize and -O3, and both yielded the same results. I also tried -mavx (and -mavx2 on a Haswell machine). In addition, I experimented with various combinations of default flags and tricks from above, but GCC always spat out the same vectorization refusal. Moving forward, it would seem that refactoring the shape function implementation would yield cleaner code. For example, fe_base calls vector.resize() many times in order to fully initialize the 2D vector. However, with the new class, these calls are unnecessary since for the 1D vector, vector.resize() only needs to be called once from within the class. A cleaner and more meaningful interface would be to have a constructor that takes the initial row and column sizes as arguments and handles all the allocation within the class itself. The class.resize() function could then be used only when the shape function vector changes size, making it more semantically meaningful. The [][] operator could also be replaced with parentheses (i,j) or a get_val(i,j) function, since the code to implement the double brackets adds quite a bit of complexity to the class. I tried implementing the simplified interface version of the class (see GitHub link for fe_shape_function_simplified.h), but GCC still refused... ~Tim
------------------------------------------------------------------------------
_______________________________________________ Libmesh-devel mailing list Libmesh-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libmesh-devel