Em segunda-feira, 24 de janeiro de 2011, às 12:57:32, Carsten Munk escreveu: > Hi, > > Do we have a sane and performing NEON memcpy that would be suitable > for MeeGo glibc version anywhere? Would be useful for glibc armv7nhl > variant.
Shouldn't be very hard to implement one:
NOTE: NOT TESTED!
void *my_memcpy(void *dest, void *src, long n)
{
const int stride_bytes = 16;
uint8_t *d = dest;
uint8_t *s = src;
{
/* main copy, Neon vectorised */
long vector_len = n / stride_bytes;
uint8_t *end = s + vector_len * stride_bytes;
n -= vector_len * stride_bytes;
while (s != end) {
#ifdef __CC_ARM
vst1q_u8(d, vld1q_u8(s));
d += stride_bytes;
s += stride_bytes;
#else
/*
* Assembly equivalent:
*/
asm ("vld1.8 {d0, d1}, [%[s]]!\n"
"vst1.8 {d0, d1}, [%[d]]!\n"
: [s] "+r" (s), [d] "+r" (d)
: /* no inputs */
: "d0", "d1");
#endif
}
}
if (stride_bytes > 8 && n >= 8) {
/* one last 8-byte step */
n -= 8;
#ifdef __CC_ARM
vst1_u8(d, vld1_u8(s));
d += 8;
s += 8;
#else
/*
* Assembly equivalent:
*/
asm ("vld1.8 {d0}, [%[s]]!\n"
"vst1.8 {d0}, [%[d]]!\n"
: [s] "+r" (s), [d] "+r" (d)
: /* no inputs */
: "d0");
#endif
}
/* residue */
switch (n) {
case 7: *d++ = *s++;
case 6: *d++ = *s++;
case 5: *d++ = *s++;
case 4: *d++ = *s++;
case 3: *d++ = *s++;
case 2: *d++ = *s++;
case 1: *d++ = *s++;
}
return dest;
}
You can modify the above code:
stride load/store
8 vld1_u8 / vst1_u8
16 vld1q_u8 / vst1q_u8
24 vld3_u8 / vst3_u8
48 vld3q_u8 / vst3q_u8
Given that the 24- and 48-byte strides require a division by 3, I recommend
sticking to the 8- or 16-byte stride versions.
The GCC versions are written in inline assembly because all current versions
of GCC spill the Neon registers to memory with the intrinsics.
Comparing the assembly generated by both GCC and RVCT indicates that the math
portion of the function and the transition from 16-byte to 8-byte stride seem
to be better with RVCT, but the handling of the switch at the function
epilogue seems better with GCC 4.5.
Each of the case statements in GCC is:
ldrb r3, [r4], #1 @ zero_extendqisi2
strb r3, [ip], #1
whereas RVCT produces:
LDRB r4,[r1],#1
ADD r2,r3,#1
STRB r4,[r3,#0]
MOV r3,r2
That is, the same first instruction, but it uses two additional instructions to
update the "d" variable, instead of doing the inline post-update.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Senior Product Manager - Nokia, Qt Development Frameworks
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ MeeGo-dev mailing list [email protected] http://lists.meego.com/listinfo/meego-dev
