Re: autovectorization in gcc

2019-01-10 Thread Kay F. Jahnke

On 09.01.19 10:50, Andrew Haley wrote:

On 1/9/19 9:45 AM, Kyrill Tkachov wrote:

There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.


I don't agree. Sometimes vectorization is critical. It would be nice
to have a warning which would fire if vectorization failed. That would
surely help the OP. 


Further down this thread, some g++ flags were used which produced 
meaningful information about vectorization failures, so the facility is 
there - maybe it's not very prominent.


When it comes to user visibility, I'd like to add that there are great 
differences between different users. I spend most of my time writing 
library code, using template metaprogramming in C++. It's essential for 
my code to perform well (real-time visualization), but I don't have 
intimate compiler knowledge - I'm aiming at writing portable, 
standard-compliant code. I'd like the compilers I use to provide 
extensive documentation if I need to track down a problem, and I dislike 
it if I have to use 'special' commands to get things done. Other users 
may produce target-specific code with one specific compiler, and they 
have different needs. It's better to have documentation and not need it 
than the other way round.


So my idea of a 'contract' regarding vectorization is like this:

- the documentation states the scope of vectorization
- the use of a feature can be forced or disallowed
- or left up to a cost model
- the compiler can be made to produce diagnostic output

Documentation is absolutely essential. If there is lots of development 
in autovectorization, not documenting this work in a way users can 
simply find is - in my eyes - a grave omission. The text 
'Auto-vectorization in GCC' looks like it has last been updated in 2011 
(according to the 'Latest News' section). I'm curious to know what new 
capabilities have been added since then. It makes my life much easier if 
I can write loops to follow a given pattern relying on the 
autovectorizer, rather than having to use explicit vector code, having 
to rely on a library. There is also another aspect to being dependent on 
external libraries. When a new architecture comes around, chances are 
the compiler writers will be first to support it. It may take years for 
an external library to add a new target ISA, more time until this runs 
smoothly, and then more time until it has trickled down to the package 
repos of most distributions - if this happens at all. Plus you have the 
danger of betting on the wrong horse, and when the very promising 
library you've used to code your stuff goes offline or commercial, 
you've wasted your precious time. Relying only on the compiler brings 
innovation out most reliably and quickly, and is a good strategy to 
avoid wasting resources.


Now I may be missing things here because I haven't dug deeply enough to 
find documentation about autovectorization in gcc. This was why I have 
asked to be pointed to 'where the action is'. I was hoping to maybe get 
some helpful hints. My main objective is, after all, to 'deliberately 
exploit the compiler's autovectorization facilities by feeding data in
autovectorization-friendly loops'. The code will run, vectorized or not, 
but it would be great to have good guidelines what will or will not be 
autovectorized with a given compiler, rather than having to look at the 
assembler output.


Kay






Re: autovectorization in gcc

2019-01-09 Thread Kay F. Jahnke

On 09.01.19 10:45, Kyrill Tkachov wrote:


There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.


Since I'm trying to deliberately exploit it, a more user-visible guise 
would help ;)



- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent
gives a dramatic speedup (ca. 4X)


The above is a typical example. So, to give a complete source 'vec_sqrt.cc':

#include 

extern float data [ 32768 ] ;

extern void vf1()
{
  #pragma vectorize enable
  for ( int i = 0 ; i < 32768 ; i++ )
data [ i ] = std::sqrt ( data [ i ] ) ;
}

This has a large trip count, the loop is trivial. It's an ideal 
candidate for autovectorization. When I compile this source, using


g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc

the inner loop translates to:

.L2:
vmovss  (%rbx), %xmm0
vucomiss%xmm0, %xmm2
vsqrtss %xmm0, %xmm1, %xmm1
jbe .L3
vmovss  %xmm2, 12(%rsp)
addq$4, %rbx
vmovss  %xmm1, 8(%rsp)
callsqrtf@PLT
vmovss  8(%rsp), %xmm1
vmovss  %xmm1, -4(%rbx)
cmpq%rbp, %rbx
vmovss  12(%rsp), %xmm2
jne .L2

AFAICT this is not vectorized, it only uses a single float at a time.
In vector code, I'd expect the vsqrtps mnemonic to show up.

I believe GCC will do some of that already given a high-enough 
optimisation level

and floating-point constraints.
Do you have examples where it doesn't? Testcases with self-contained 
source code

and compiler flags would be useful to analyse.


so, see above. With -Ofast output is similar, just the inner loop is 
unrolled. But maybe I'm missing something? Any hints for additional flags?



If the compiler were to provide the autovectorization facilities, and if
the patterns it recognizes were well-documented, users could rely on
certain code patterns being recognized and autovectorized - sort of a
contract between the user and the compiler. With a well-chosen spectrum
of patterns, this would make it unnecessary to have to rely on explicit
vectorization in many cases. My hope is that such an interface would
help vectorization to become more frequently used - as I understand the
status quo, this is still a niche topic, even though many processors
provide suitable hardware nowadays.



I wouldn't say it's a niche topic :)
 From my monitoring of the GCC development over the last few years 
there's been lots

of improvements in auto-vectorisation in compilers (at least in GCC).


Okay, I'll take your word for it.


The thing is, auto-vectorisation is not always profitable for performance.
Sometimes the runtime loop iteration count is so low that setting up the 
vectorised loop
(alignment checks, loads/permutes) is slower than just doing the scalar 
form,

especially since SIMD performance varies from CPU to CPU.
So we would want the compiler to have the freedom to make its own 
judgement on when
to auto-vectorise rather than enforce a "contract". If the user really 
only wants

vector code, they should use one of the explicit programming paradigms.


I know that these issues are important. I am using Vc for explicit 
vectorization, so I can easily code to produce vector code for common 
targets. And I can compare the performance. I have tried the example 
given above on my AVX2 machine, linking with a main program which calls 
'vf1' 32768 times, to get one gigaroot (giggle). The vectorized version 
takes about half a second, the unvectorized takes about three. with 
functions like sqrt, trigonometric functions, exp and pow, vectorization 
is very profitable. Some further details:


Here's the main program 'memaxs.cc':

float data [ 32768 ] ;
extern void vf1() ;

int main ( int argc , char * argv[] )
{
  for ( int k = 0 ; k < 32768 ; k++ )
  {
vf1() ;
  }
}

And the compiler call to get a binary:

g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc

Here's the performance:

$ time ./memaxs

real0m3,205s
user0m3,200s
sys 0m0,004s

This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc')

#include 

extern float data [ 32768 ] ;

extern void vf1()
{
  for ( int k = 0 ; k < 32768 ; k += 8 )
  {
Vc::float_v fv ( data + k ) ;
fv = sqrt ( fv ) ;
fv.store ( data + k ) ;
  }
}

Translated to assembler, I get the inner loop

.L2:
vmovups (%rax), %xmm0
addq$32, %rax
vinsertf128 $0x1, -16(%rax), %ymm0, %ymm0
vsqrtps %ymm0, %ymm0
vmovups %xmm0, -32(%rax)
vextractf128$0x1, %ymm0, -16(%rax)
cmpq%rax, %rdx
jne .L2
vzeroupper
ret
.cfi_endproc

note how the data are read 32 bytes at a time and processed with vsqrtps.

creating the corresponding binary and executing it:

$ g++ -O3 -mavx2 -o memaxs sqr

autovectorization in gcc

2019-01-09 Thread Kay F. Jahnke

Hi there!

I am developing software which tries to deliberately exploit the 
compiler's autovectorization facilities by feeding data in 
autovectorization-friendly loops. I'm currently using both g++ and 
clang++ to see how well this approach works. Using simple arithmetic, I 
often get good results. To widen the scope of my work, I was looking for 
documentation on which constructs would be recognized by the 
autovectorization stage, and found


https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html

By the looks of it, this document has not seen any changes for several 
years. Has development on the autovectorization stage stopped, or is 
there simply no documentation?


In my experience, vectorization is essential to speed up arithmetic on 
the CPU, and reliable recognition of vectorization opportunities by the 
compiler can provide vectorization to programs which don't bother to 
code it explicitly. I feel the topic is being neglected - at least the 
documentation I found suggests this. To demonstrate what I mean, I have 
two concrete scenarios which I'd like to be handled by the 
autovectorization stage:


- gather/scatter with arbitrary indexes

In C, this would be loops like

// gather from B to A using gather indexes

for ( int i = 0 ; i < vsz ; i++ )
  A [ i ] = B [ indexes [ i ] ] ;

From the AVX2 ISA onwards, there are hardware gather/scatter 
operations, which can speed things up a good deal.


- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
  A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent 
gives a dramatic speedup (ca. 4X)


If the compiler were to provide the autovectorization facilities, and if 
the patterns it recognizes were well-documented, users could rely on 
certain code patterns being recognized and autovectorized - sort of a 
contract between the user and the compiler. With a well-chosen spectrum 
of patterns, this would make it unnecessary to have to rely on explicit 
vectorization in many cases. My hope is that such an interface would 
help vectorization to become more frequently used - as I understand the 
status quo, this is still a niche topic, even though many processors 
provide suitable hardware nowadays.


Can you point me to where 'the action is' in this regard?

With regards

Kay F. Jahnke