Hi Kay,

On 09/01/19 08:29, Kay F. Jahnke wrote:
Hi there!

I am developing software which tries to deliberately exploit the
compiler's autovectorization facilities by feeding data in
autovectorization-friendly loops. I'm currently using both g++ and
clang++ to see how well this approach works. Using simple arithmetic, I
often get good results. To widen the scope of my work, I was looking for
documentation on which constructs would be recognized by the
autovectorization stage, and found

https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html


Yeah, that page hasn't been updated in ages AFAIK.

By the looks of it, this document has not seen any changes for several
years. Has development on the autovectorization stage stopped, or is
there simply no documentation?


There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.

In my experience, vectorization is essential to speed up arithmetic on
the CPU, and reliable recognition of vectorization opportunities by the
compiler can provide vectorization to programs which don't bother to
code it explicitly. I feel the topic is being neglected - at least the
documentation I found suggests this. To demonstrate what I mean, I have
two concrete scenarios which I'd like to be handled by the
autovectorization stage:

- gather/scatter with arbitrary indexes

In C, this would be loops like

// gather from B to A using gather indexes

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = B [ indexes [ i ] ] ;

 From the AVX2 ISA onwards, there are hardware gather/scatter
operations, which can speed things up a good deal.

- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent
gives a dramatic speedup (ca. 4X)


I believe GCC will do some of that already given a high-enough optimisation 
level
and floating-point constraints.
Do you have examples where it doesn't? Testcases with self-contained source code
and compiler flags would be useful to analyse.

If the compiler were to provide the autovectorization facilities, and if
the patterns it recognizes were well-documented, users could rely on
certain code patterns being recognized and autovectorized - sort of a
contract between the user and the compiler. With a well-chosen spectrum
of patterns, this would make it unnecessary to have to rely on explicit
vectorization in many cases. My hope is that such an interface would
help vectorization to become more frequently used - as I understand the
status quo, this is still a niche topic, even though many processors
provide suitable hardware nowadays.


I wouldn't say it's a niche topic :)
From my monitoring of the GCC development over the last few years there's been 
lots
of improvements in auto-vectorisation in compilers (at least in GCC).

The thing is, auto-vectorisation is not always profitable for performance.
Sometimes the runtime loop iteration count is so low that setting up the 
vectorised loop
(alignment checks, loads/permutes) is slower than just doing the scalar form,
especially since SIMD performance varies from CPU to CPU.
So we would want the compiler to have the freedom to make its own judgement on 
when
to auto-vectorise rather than enforce a "contract". If the user really only 
wants
vector code, they should use one of the explicit programming paradigms.

HTH,
Kyrill

Can you point me to where 'the action is' in this regard?

With regards

Kay F. Jahnke



Reply via email to