From: Yu-Jen Chang > Sent: 01 June 2022 06:59 > > David Laight <david.lai...@aculab.com> 於 2022年5月30日 週一 下午4:10寫道: > > > > From: Yu-Jen Chang > > > Sent: 28 May 2022 09:13 > > > > > > The original assembly version of memchr() is implemented with > > > the byte-wise comparing technique, which does not fully > > > use 64-bits registers in x86_64 CPU. We use word-wide > > > comparing so that 8 characters can be compared at the same time > > > on x86_64 CPU. First we align the input and then use word-wise > > > comparing to find the first 64-bit word that contain the target. > > > Secondly, we compare every byte in the word and get the output. > > > > > > We create two files to measure the performance. The first file > > > contains on average 10 characters ahead the target character. > > > The second file contains at least 1000 characters ahead the > > > target character. Our implementation of “memchr()” is slightly > > > better in the first test and nearly 4x faster than the orginal > > > implementation in the second test. > > > > > > Signed-off-by: Yu-Jen Chang <arthurchan...@gmail.com> > > > Signed-off-by: Ching-Chun (Jim) Huang <js...@ccns.ncku.edu.tw> > > > --- > > > arch/x86/include/asm/string_64.h | 3 ++ > > > arch/x86/lib/Makefile | 1 + > > > arch/x86/lib/string_64.c | 78 ++++++++++++++++++++++++++++++++ > > > 3 files changed, 82 insertions(+) > > > create mode 100644 arch/x86/lib/string_64.c > > > > > ... > > > diff --git a/arch/x86/lib/string_64.c b/arch/x86/lib/string_64.c > > > new file mode 100644 > > > index 000000000..4e067d5be > > > --- /dev/null > > > +++ b/arch/x86/lib/string_64.c > > > @@ -0,0 +1,78 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +#include <linux/string.h> > > > +#include <linux/export.h> > > > +#include <linux/align.h> > > > + > > > +/* How many bytes are loaded each iteration of the word copy loop */ > > > +#define LBLOCKSIZE (sizeof(long)) > > > + > > > +#ifdef __HAVE_ARCH_MEMCHR > > > + > > > +void *memchr(const void *cs, int c, size_t length) > > > +{ > > > + const unsigned char *src = (const unsigned char *)cs, d = c; > > > > You don't need the cast. > > > > > + > > > + while (!IS_ALIGNED((long)src, sizeof(long))) { > > > + if (!length--) > > > + return NULL; > > > + if (*src == d) > > > + return (void *)src; > > > + src++; > > > + } > > > > There is no point aligning the address. > > On tests I've done misaligned reads don't even take an extra > > clock - even if you get the cpu doing two reads/clock. > > Even if they did the code isn't memory limited. > > > > > + if (length >= LBLOCKSIZE) { > > > + unsigned long mask = d << 8 | d; > > > + unsigned int i = 32; > > > + long xor, data; > > > + const long consta = 0xFEFEFEFEFEFEFEFF, > > > + constb = 0x8080808080808080; > > > + > > > + /* > > > + * Create a 8-bytes mask for word-wise comparing. > > > + * For example, a mask for 'a' is 0x6161616161616161. > > > + */ > > > + > > > + mask |= mask << 16; > > > + for (i = 32; i < LBLOCKSIZE * 8; i <<= 1) > > > + mask |= mask << i; > > > > Given that consta/b only support 64 bit why the loop. > > Just do mask |= mask << 32. > > I'd also put all 3 calculations together - not hide one > > in the initialiser. > > > > > + /* > > > + * We perform word-wise comparing with following operation: > > > + * 1. Perform xor on the long word @src and @mask > > > + * and put into @xor. > > > + * 2. Add @xor with @consta. > > > + * 3. ~@xor & @constb. > > > + * 4. Perform & with the result of step 2 and 3. > > > + * > > > + * Step 1 creates a byte which is 0 in the long word if > > > + * there is at least one target byte in it. > > > + * > > > + * Step 2 to Step 4 find if there is a byte with 0 in > > > + * the long word. > > > + */ > > > + asm volatile("1:\n\t" > > > + "movq (%0),%1\n\t" > > > + "xorq %6,%1\n\t" > > > + "lea (%1,%4), %2\n\t" > > > + "notq %1\n\t" > > > + "andq %5,%1\n\t" > > > + "testq %1,%2\n\t" > > > + "jne 2f\n\t" > > > + "add $8,%0\n\t" > > > + "sub $8,%3\n\t" > > > + "cmp $7,%3\n\t" > > > + "ja 1b\n\t" > > > + "2:\n\t" > > > + : "=D"(src), "=r"(xor), "=r"(data), > > > "=r"(length) > > > > Why constrain src to %rdi? > > At first I try to use some instructions related to %rdi, but I realize > that I won't use these instructions. It is unnecessary to constrain > src to %rdi. > > > > > > + : "r"(consta), "r"(constb), "r"(mask), > > > "0"(src), > > > + "1"(xor), "2"(data), "3"(length) > > > > Use "+r" in the outputs instead of respecifying the args. > > I'd also suggest using named arguments - much easier to read. > > > > > + : "memory", "cc"); > > > > Doesn't the compiler generate much the same code? > > You should also be able to code without needing add, sub and cmp > > at the end of the loop. > > If you use negative offsets from the end of the buffer > > the loop can be a single add and jnz. > > > > David > > > > > + } > > > + > > > + while (length--) { > > > + if (*src == d) > > > + return (void *)src; > > > + src++; > > > + } > > > + return NULL; > > > +} > > > +EXPORT_SYMBOL(memchr); > > > +#endif > > > -- > > > 2.25.1 > > > > - > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 > > 1PT, UK > > Registration No: 1397386 (Wales) > > I remove the aligning address part. On my tests the performance are similar. > Here I rewrite the assembly using named arguments and I reduce one instruction > in the loop by adding two parameters, which are 'end' and 'dst'. > 'end' stores the > address of the end of the string. 'dst' stores the address of the end > of word-wise > comparison. As a result, when 'src' is equal to 'dst', the number of remaining > characters is less than 8. The following while loop will find if the > target character is > in these remaining characters. > > On my test the performance is similar with the my original implementation. > Only > a little bit fast when going through a very long string, which contains > 128*1024 > characters and the target character is near the end of the string. > > I also explain how consta and constb work clearly in the comments. Hope that > it > helps understanding. > > The following code is what I change. > > void *memchr(const void *cs, int c, size_t length) > { > const unsigned char *src = (const unsigned char *)cs; > const unsigned char *end = src + length; > > if (length >= LBLOCKSIZE) { > unsigned long mask = c << 8 | c;
That is wrong if 'c' is outside 0..255. I suspect it is best to at least allow -128..-1. > long xor, data; > const long consta = 0xFEFEFEFEFEFEFEFF, > constb = 0x8080808080808080; > const unsigned char *dst = (const unsigned char *)src + > (length & 0xFFFFFFFFFFFFFFF8); > > /* > * Create a 8-bytes mask for word-wise comparing. > * For example, a mask for 'a' is 0x6161616161616161. > */ > > mask |= mask << 16; > mask |= mask << 32; > /* > * We perform word-wise comparing with following operation: > * 1. Perform xor on the long word @src and @mask > * and put into @xor. > * 2. Add @xor with @consta. > * 3. ~@xor & @constb. > * 4. Perform & with the result of step 2 and 3. > * > * If there is a zero byte in @xor, step 2 turns it into > * 0xFF. Then step 3 and 4 turn it into 0x80. > * > * If there is a none-zero byte in @xor, let k > * (0 <= k <= 7) be the lowest 1 in this byte. The lowest > * k bits are 0. After step 2, the byte ends in a single > * bit of value 0. Step 3 and 4 turns this byte into k > * bits of 1, which is 2^k - 1, at first. Then & @constb > * makes it into 0. > * > * Step 2 to Step 4 find if there is a byte with 0 in > * the long word. > */ > asm volatile("1:\n\t" > "movq (%[src]),%[xor]\n\t" > "xorq %[mask],%[xor]\n\t" > "lea (%[xor],%[const_a]), %[tmp]\n\t" > "notq %[xor]\n\t" > "andq %[const_b],%[xor]\n\t" > "testq %[xor],%[tmp]\n\t" > "jnz 2f\n\t" > "add $8,%[src]\n\t" > "cmp %[src], %[dst]\n\t" > "ja 1b\n\t" > "2:\n\t" > : > [src] "+r"(src), [xor] "+r"(xor), [tmp] "+r"(data) > : [const_a] "r"(consta), [const_b] "r"(constb), > [mask] "r"(mask), [dst] "r"(dst) > : "memory", "cc"); > } > > while (src <= end) { > if (*src == d) I think you mean 'c'. > return (void *)src; > src++; > } > return NULL; > } > > Thanks, > Yu-Jen Chang Gcc compiles this C to the same loop and is easier to read. Valid on all LE 64bit systems. void *memchr(const void *p, int c, unsigned long length) { unsigned long mask, val; const void *end = p + length; c &= 0xff; if (p <= end - 8) { mask = c | c << 8; mask |= mask << 16; mask |= mask << 32; for (; p <= end - 8; p += 8) { val = *(unsigned long *)p ^ mask; if ((val + 0xfefefefefefefeffu) & (~val & 0x8080808080808080u)) break; } } for (; p < end; p++) if (*(unsigned char *)p == c) return p; return NULL; } See https://godbolt.org/z/6rqTqfEsx David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ linux-um mailing list linux-um@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-um