Michael Hope <michael.h...@linaro.org> writes:
> For reference.  We know that the NEON intrinsics in GCC have issues.
> I came across this page:
>  http://hilbert-space.de/?p=22
> which has a colour to greyscale conversion done using intrinsics.
> gcc-linaro-4.5-2011.03-0 does poorly through saving intermediate
> values on the stack.  The core of the loop is:
> .L3:
>       mov     ip, r4
>       vld3.8  {d16-d18}, [r6]
>       vstmia  r4, {d16-d18}
>       ldmia   ip!, {r0, r1, r2, r3}
>       mov     sl, r9
>       adds    r7, r7, #1
>       adds    r6, r6, #24
>       stmia   sl!, {r0, r1, r2, r3}
>       fldd    d16, [sp, #24]
>       fldd    d18, [sp, #32]
>       ldmia   ip, {r0, r1}
>       vmull.u8        q8, d16, d19
>       stmia   sl, {r0, r1}
>       vmlal.u8        q8, d18, d20
>       fldd    d18, [sp, #40]
>       vmlal.u8        q8, d18, d21
>       vshrn.i16       d16, q8, #8
>       vst1.8  {d16}, [r5]
>       adds    r5, r5, #8
>       cmp     r8, r7
>       bgt     .L3
> llvm-2.9~svn128540 does much better:
>       vld3.8  {d20, d21, d22}, [r1]!
>       add     r3, r3, #1
>       cmp     r3, r2
>       vmull.u8        q12, d21, d16
>       vmlal.u8        q12, d20, d17
>       vmlal.u8        q12, d22, d18
>       vshrn.i16       d19, q12, #8
>       vst1.8  {d19}, [r0]!
>       blt     .LBB0_1
> and may actually be better than the had-written assembler on Nils's
> page due to scheduling the loop comparison earlier.
> Richard S, were you looking into this?

Yeah, this is actually the first thing I had to tackle as part of the
auto-vectorisation support for vld/vst.  If it wasn't fixed, the same
problems would affect the auto-vectoriser.  The main patches are:

  - http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01631.html
    When we're dealing with multiple vectors (vld2-vld4, etc.),
    allow GCC to access the individual vectors in-place, rather
    than forcing all the vectors to the stack and loading
    individual ones from there.

    Now upstream.

  - http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01996.html
    Changes the intrinsic patterns to use memory operands
    instead of register pointer operands.  Among other things,
    this allows the vld3.8 and vst1.8 to take the post-incremented
    addresses, as it does in the LLVM code.

    Only posted on Tuesday, awaiting review.

  - Allow types like uint32x4x3_t to be stored in registers.
    I started a discussion related to this:


    and I think the outcome means that the implementation I wanted
    should be OK.  I tested the patch I've been using last night,
    and plan to submit it today.

The build I tested last night gives:

        cmp     r2, #0
        add     r3, r2, #7
        movlt   r2, r3
        mov     r2, r2, asr #3
        cmp     r2, #0
        vmov.i8 d21, #77  @ v8qi
        vmov.i8 d22, #151  @ v8qi
        vmov.i8 d23, #28  @ v8qi
        bxle    lr
        mov     r3, #0
        vld3.8  {d18-d20}, [r1]!
        vmull.u8        q8, d18, d21
        vmlal.u8        q8, d19, d22
        vmlal.u8        q8, d20, d23
        add     r3, r3, #1
        vshrn.i16       d16, q8, #8
        cmp     r3, r2
        vst1.8  {d16}, [r0]!
        bne     .L3
        bx      lr

which seems to be the same as LLVM code, but scheduled differently.


linaro-toolchain mailing list

Reply via email to