On 2026/1/21 18:57, David Laight wrote: > On Wed, 21 Jan 2026 09:01:29 +0200 > Andy Shevchenko <[email protected]> wrote: > > ... >> I understand that. My point is if we move the generic implementation >> to use word-at-a-time technique the difference should not go 4x, >> right? Perhaps 1.5x or so. I believe this will be a very useful >> exercise. > > I posted a version earlier. > > After the initial setup (aligning the base address and loading > some constants the loop on x86-64 is 7 instructions (should be similar > for other architectures). > I think it will execute in 4 clocks. > You then need to find the byte in the word, easy enough on LE with > a fast ffs() - but harder otherwise. > The real problem is the cost for short strings. > Like memcpy() you need a hint from the source of the 'expected' length > (as a compile-time constant) to compile-time select the algorithm. > > OTOH: > for (;;) { > if (!ptr[0]) return ptr - start; > ptr += 2; > while (ptr[-1]); > return ptr - start - 1; > has two 'load+compare+branch' and one add per loop. > On x86 that might all overlap and give you a two-clock loop > that checks one byte every clock - faster than 'rep scasb'. > (You can get a two clock loop, but not a 1 clock loop.) > I think unrolling further will make little/no difference. > > The break-even for the word-at-a-time version is probably at least 64 > characters. >
Thanks for the profound analysis and the detailed suggestions. I am still exploring some of the finer points of these low-level performance trade-offs, so your input is very helpful. I will definitely spend some time studying this further and experiment with the approaches you mentioned. Thanks again for your help! -- With Best Regards, Feng Jiang
