Thank you, It looks like it is important to have shrx for x86 which appears only when -march=x86-64-v3 is used (see https://github.com/golang/go/issues/47120#issuecomment-877629712 ). Just in case: I know x86 wound not use fallback implementation, however, the sole purpose of shift-based DFA is to fold all the data-dependent ops into a single instruction.
An alternative idea: should we optimize for validation of **valid** inputs rather than optimizing the worst case? In other words, what if the implementation processes all characters always and uses a slower method in case of validation failure? I would guess it is more important to be faster with accepting valid input rather than "faster to reject invalid input". In shift-DFA approach, it would mean the validation loop would be simpler with fewer branches (see https://godbolt.org/z/hhMxhT6cf ): static inline int pg_is_valid_utf8(const unsigned char *s, const unsigned char *end) { uint64 class; uint64 state = BGN; while (s < end) { // clang unrolls the loop class = ByteCategory[*s++]; state = class >> (state & DFA_MASK); // <-- note that AND is fused into the shift operation } return (state & DFA_MASK) != ERR; } Note: GCC does not seem to unroll "while(s<end)" loop by default, so manual unroll might be worth trying: static inline int pg_is_valid_utf8(const unsigned char *s, const unsigned char *end) { uint64 class; uint64 state = BGN; while(s < end + 4) { for(int i = 0; i < 4; i++) { class = ByteCategory[*s++]; state = class >> (state & DFA_MASK); } } while(s < end) { class = ByteCategory[*s++]; state = class >> (state & DFA_MASK); } return (state & DFA_MASK) != ERR; } ---- static int pg_utf8_verifystr2(const unsigned char *s, int len) { if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid, then just accept it return s + len; } // slow path: the string is not valid, perform a slower analysis return s + ....; } Vladimir