Comparing with SSE, NEON instructions seems hard to be generated in some cases

勝余傅 Sat, 01 Mar 2014 00:02:26 -0800

Hi,
In the following sample code

#define N 4
float a[N] __attribute__ ((aligned (16)));
float b[N] __attribute__ ((aligned (16)));
float c[N] __attribute__ ((aligned (16)));
double d[N] __attribute__ ((aligned (16)));


void f(float *pa, float *pb);

int main(){

        int i;
        f(a,b);
        for(i=0; i<N; i++){
                c[i] = b[i] + a[i] + d[i];
        }
        return 0;
}

In the above code, a, b, c is float and d is double. Thus, it has to
convert float to double after b[i]+a[i].

I use gcc-4.6 with flag "gcc-4.6 -O3 -msse2 -msse3 -mfpmath=sse
-ftree-vectorize  -mmmx -msse4.1 -msse -S". As expected, It can
generate lots of SSE instructions. As following:

.....
movaps  b(%rip), %xmm0
xorl    %eax, %eax
xorps   %xmm1, %xmm1
addps   a(%rip), %xmm0
movhlps %xmm0, %xmm1
cvtps2pd        %xmm0, %xmm2
cvtps2pd        %xmm1, %xmm1
addpd   d(%rip), %xmm2
addpd   d+16(%rip), %xmm1
cvtpd2ps        %xmm2, %xmm0
cvtpd2ps        %xmm1, %xmm1
movlhps %xmm1, %xmm0
movaps  %xmm0, c(%rip)
addq    $8, %rsp
.....


However, I use arm-linux-gnueabihf-gcc-4.6 with flag "-static
-mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize
-mvectorize-with-neon-quad  -ftree-slp-vectorize -march=armv7-a
-mtune=cortex-a15 -O3 -Ofast -S". It generate all scalar instructions
without any NEON instructions.

Although NEON doesn't support double precision floating point, it
still can generate "b  add_neon a" first. Then, using scalar
instructions to do other computation.

Are there any reasons such that arm-linux-gnueabihf-gcc-4.6 doesn't
generate binary contain NEON?

Any help appreciated.

Thank you very much,
Sheng Yu

Comparing with SSE, NEON instructions seems hard to be generated in some cases

Reply via email to