A quick benchmark shows it's faster up to about 10 bytes, but after that it
becomes extremely slow. At 16 bytes it's already 2.5 times slower and for
larger sizes its over 13 times slower than the GLIBC implementation...
> The implementation falls back to the library call if the
> string is not aligned.
If it did that for larger sizes then it would be fine. However a byte loop is
is unacceptably slow.
Also given the large amount of inlined code, it would make sense to handle
larger sizes than 8. It may be worth comparing a loop doing 8 bytes per
iteration with the GLIBC strlen or just inline the first 16 bytes and then
fallback to strlen.
Also if you have statistics that show tiny strlen sizes are much more common
then the strlen implementation could be further tuned for that.