For reference.  We know that the NEON intrinsics in GCC have issues.

I came across this page:
 http://hilbert-space.de/?p=22

which has a colour to greyscale conversion done using intrinsics.
gcc-linaro-4.5-2011.03-0 does poorly through saving intermediate
values on the stack.  The core of the loop is:

.L3:
        mov     ip, r4
        vld3.8  {d16-d18}, [r6]
        vstmia  r4, {d16-d18}
        ldmia   ip!, {r0, r1, r2, r3}
        mov     sl, r9
        adds    r7, r7, #1
        adds    r6, r6, #24
        stmia   sl!, {r0, r1, r2, r3}
        fldd    d16, [sp, #24]
        fldd    d18, [sp, #32]
        ldmia   ip, {r0, r1}
        vmull.u8        q8, d16, d19
        stmia   sl, {r0, r1}
        vmlal.u8        q8, d18, d20
        fldd    d18, [sp, #40]
        vmlal.u8        q8, d18, d21
        vshrn.i16       d16, q8, #8
        vst1.8  {d16}, [r5]
        adds    r5, r5, #8
        cmp     r8, r7
        bgt     .L3

llvm-2.9~svn128540 does much better:

        vld3.8  {d20, d21, d22}, [r1]!
        add     r3, r3, #1
        cmp     r3, r2
        vmull.u8        q12, d21, d16
        vmlal.u8        q12, d20, d17
        vmlal.u8        q12, d22, d18
        vshrn.i16       d19, q12, #8
        vst1.8  {d19}, [r0]!
        blt     .LBB0_1

and may actually be better than the had-written assembler on Nils's
page due to scheduling the loop comparison earlier.

Richard S, were you looking into this?

-- Michael

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to