> 
> On 07/20/2011 10:27 AM, Maurizio De Cecco wrote:
>> I am playing around with GCC and Clang vector extensions, on Linux and
>> Mac OS X, and i am getting some strange behaviour.
>> 
>> I am working on jMax Phoenix, and its dsp engine, in its current state,
>> is very memory bound; it is based on the aggregation of very small
>> granularity operations, like vector sum or multiply, each of them
>> executed independently from and to memory.
>> 
>> I tried to implements all this 'primitive' operations using the vector
>> types.
>> 
>> On clang/MacOSX i get an impressive improvement in performance,
>> around 4x on the operations, even just using the vector types for
>> copying data; my impression is that the compiler use some kind of vector
>> load/store instruction that properly use the available memory bandwidth,
>> but unfortunately i do not know more about the x86 architecture.
>> 
>> On gcc/Linux, (gcc 4.5.2) the same code produce a *slow down* of around
>> 2.5x.
>> 
>> Well, anybody have an idea of why ?
>> 
>> I am actually running linux (Ubuntu 11.04) under a VMWare virtual
>> machine, i do not know is this may have any implications.
> 
> Maybe. A better comparison would be: clang/Linux vs. gcc/Linux and
> clang/MacOSX vs gcc/MacOSX compiled binaries.
> 
> Also as Dan already pointed out: gcc has a whole lot of optimization
> flags which are not enabled by default. try '-O3 -msse2 -ffast-math'.
> '-ftree-vectorizer-verbose=2' is handy while optimizing code.
> 
> have fun,
> robin

Or you can use LLVM to *directly* generate vector code, as in the following 
example, result of some experiments done with Faust and it's LLVM backend:

block_code8:                                      ; preds = 
%block_code8.block_code8_crit_edge, %block_code3
  %20 = phi float* [ %15, %block_code3 ], [ %.pre11, 
%block_code8.block_code8_crit_edge ]
  %21 = phi float* [ %14, %block_code3 ], [ %.pre10, 
%block_code8.block_code8_crit_edge ]
  %22 = phi float* [ %16, %block_code3 ], [ %.pre9, 
%block_code8.block_code8_crit_edge ]
  %indvar = phi i32 [ 0, %block_code3 ], [ %indvar.next, 
%block_code8.block_code8_crit_edge ]
  %nextindex1 = shl i32 %indvar, 2
  %nextindex = add i32 %nextindex1, 4
  %23 = sext i32 %nextindex1 to i64
  %24 = getelementptr float* %22, i64 %23
  %25 = getelementptr float* %21, i64 %23
  %26 = bitcast float* %25 to <4 x float>*
  %27 = load <4 x float>* %26, align 1
  %28 = getelementptr float* %20, i64 %23
  %29 = bitcast float* %28 to <4 x float>*
  %30 = load <4 x float>* %29, align 1
  %31 = fadd <4 x float> %27, %30
  %32 = bitcast float* %24 to <4 x float>*
  store <4 x float> %31, <4 x float>* %32, align 1
  %33 = icmp ult i32 %nextindex, %18
  br i1 %33, label %block_code8.block_code8_crit_edge, label %exit_block6

In this block float* arrays are loaded, then "bitcast" in vector of 4 floats, 
the vector of 4 float is loaded, then manipulated with LLVM IR vector version 
of add, mult...etc... then stored.

The LLVM IR is still generated to use the "conservative" "align 1" option since 
it can not yet be sure data is always aligned. The result SSE code with then 
use the MOVUPS (Move Unaligned Packed Single-Precision Floating-Point Values). 
The next steps is to generated stuff like:

 %27 = load <4 x float>* %26, align 4

so that  MOVAPS (Move Aligned Packed Single-Precision Floating-Point Values) is 
used instead.

We already see so nice speed improvements, but the Faust vector LLVM backend 
version is still not yet complete...

Stéphane 





_______________________________________________
Linux-audio-dev mailing list
[email protected]
http://lists.linuxaudio.org/listinfo/linux-audio-dev

Reply via email to