Michael Hope <michael.h...@linaro.org> writes: > For reference. We know that the NEON intrinsics in GCC have issues. > > I came across this page: > http://hilbert-space.de/?p=22 > > which has a colour to greyscale conversion done using intrinsics. > gcc-linaro-4.5-2011.03-0 does poorly through saving intermediate > values on the stack. The core of the loop is: > > .L3: > mov ip, r4 > vld3.8 {d16-d18}, [r6] > vstmia r4, {d16-d18} > ldmia ip!, {r0, r1, r2, r3} > mov sl, r9 > adds r7, r7, #1 > adds r6, r6, #24 > stmia sl!, {r0, r1, r2, r3} > fldd d16, [sp, #24] > fldd d18, [sp, #32] > ldmia ip, {r0, r1} > vmull.u8 q8, d16, d19 > stmia sl, {r0, r1} > vmlal.u8 q8, d18, d20 > fldd d18, [sp, #40] > vmlal.u8 q8, d18, d21 > vshrn.i16 d16, q8, #8 > vst1.8 {d16}, [r5] > adds r5, r5, #8 > cmp r8, r7 > bgt .L3 > > llvm-2.9~svn128540 does much better: > > vld3.8 {d20, d21, d22}, [r1]! > add r3, r3, #1 > cmp r3, r2 > vmull.u8 q12, d21, d16 > vmlal.u8 q12, d20, d17 > vmlal.u8 q12, d22, d18 > vshrn.i16 d19, q12, #8 > vst1.8 {d19}, [r0]! > blt .LBB0_1 > > and may actually be better than the had-written assembler on Nils's > page due to scheduling the loop comparison earlier. > > Richard S, were you looking into this?
Yeah, this is actually the first thing I had to tackle as part of the auto-vectorisation support for vld/vst. If it wasn't fixed, the same problems would affect the auto-vectoriser. The main patches are: - http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01631.html When we're dealing with multiple vectors (vld2-vld4, etc.), allow GCC to access the individual vectors in-place, rather than forcing all the vectors to the stack and loading individual ones from there. Now upstream. - http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01996.html Changes the intrinsic patterns to use memory operands instead of register pointer operands. Among other things, this allows the vld3.8 and vst1.8 to take the post-incremented addresses, as it does in the LLVM code. Only posted on Tuesday, awaiting review. - Allow types like uint32x4x3_t to be stored in registers. I started a discussion related to this: http://gcc.gnu.org/ml/gcc/2011-03/msg00342.html and I think the outcome means that the implementation I wanted should be OK. I tested the patch I've been using last night, and plan to submit it today. The build I tested last night gives: cmp r2, #0 add r3, r2, #7 movlt r2, r3 mov r2, r2, asr #3 cmp r2, #0 vmov.i8 d21, #77 @ v8qi vmov.i8 d22, #151 @ v8qi vmov.i8 d23, #28 @ v8qi bxle lr mov r3, #0 .L3: vld3.8 {d18-d20}, [r1]! vmull.u8 q8, d18, d21 vmlal.u8 q8, d19, d22 vmlal.u8 q8, d20, d23 add r3, r3, #1 vshrn.i16 d16, q8, #8 cmp r3, r2 vst1.8 {d16}, [r0]! bne .L3 bx lr which seems to be the same as LLVM code, but scheduled differently. Richard _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain