On Fri, 27 Mar 2026 13:37:29 -0700 Linus Torvalds <[email protected]> wrote:
> On Fri, 27 Mar 2026 at 12:57, <[email protected]> wrote: > > > > Using 'byte masking' is faster for longer strings - the break-even point > > is around 56 bytes on the same Zen-5 (there is much larger overhead, then > > it runs at 16 bytes in 3 clocks). > > What byte masking approach did you actually use? This is the code I was testing. It does aligned accesses - I did measure it without the alignment code, made little/no difference. The OPTIMIZER_HIDE_VAR() is needed to stop gcc generating different 64bit constants and to make it generate the constant in a sane way (especially on architectures with only 16bit immediates). size_t strlen_longs(const char *s) { unsigned int off = (unsigned long)s % sizeof (long); const unsigned long *p = (void *)(s - off); unsigned long ones = 0x01010101ul; unsigned long val; unsigned long mask; int first = 1; OPTIMIZER_HIDE_VAR(ones); ones |= ones << 16 << 16; mask = (~0ul >> 8) >> 8 * (sizeof (long) - 1 - off); // I've just realised that might be better as: // mask = ones >> 1 + 8 * (sizeof (long) - 1 - off); // which has the right properties and stops the compiler generating // 0x00ffffffffffffff val = *p | mask; do { if (!first) val = *++p; first = 0; mask = (val - ones) & ~val & (ones << 7); } while (!mask); off = (__builtin_ffsl(mask) - 1)/8; return (const char *)p + off - s; } That loop is the one that compiled best, ISTR it has a 'spare' register move in it ('first' gets optimised out). On many BE systems doing a byteswapping memory read may be best. > We have 'lib/strnlen_user.c', which is actually the only strlen() in > the kernel that I've really ever seen in profiles (it shows up for > execve() with lots of arguments). > > That has tons of extra overhead due to the whole user access setup, > but the core loop should be pretty good with that has_zero() thing. I've not measured strnlen(), but it wouldn't surprise me if argv[] processing wouldn't be faster with something like the strlen() in this patch. After all arguments are usually relatively short. If you were going to use the above then both 'ones' and 'ones << 7' need so be calculated once and kept in registers. > I do agree that we shouldn't use 'rep scas'. It goes back to the > *very* original linux kernel sources, though, and I've never seen it > in profiles because very few things in the kernel actually use strings > a lot. True, and most are short. strscpy() is next on the list... And the arm64 strlen() has special code to optimise crossing page boundaries. God knows how slow it is on your typical 10 character string. David > > Linus

