On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > Just wondering, do you have the code in a GitHub/Gitlab branch? > > >+ utf8_advance(s, state, len); > >+ > >+ /* > >+ * If we saw an error during the loop, let the caller handle it. We treat > >+ * all other states as success. > >+ */ > >+ if (state == ERR) > >+ return 0; > > Did you mean state = utf8_advance(s, state, len); there? (reassign state variable)
Yep, that's a bug, thanks for catching! > >I wanted to try different strides for the DFA > > Does that (and "len >= 32" condition) mean the patch does not improve validation of the shorter strings (the ones less than 32 bytes)? Right. Also, the 32 byte threshold was just a temporary need for testing 32-byte stride -- testing different thresholds wouldn't hurt. I'm not terribly concerned about short strings, though, as long as we don't regress. That said, Heikki had something in his v14 [1] that we could use: +/* + * Subroutine of pg_utf8_verifystr() to check on char. Returns the length of the + * character at *s in bytes, or 0 on invalid input or premature end of input. + * + * XXX: could this be combined with pg_utf8_verifychar above? + */ +static inline int +pg_utf8_verify_one(const unsigned char *s, int len) It would be easy to replace pg_utf8_verifychar with this. It might even speed up the SQL function length_in_encoding() -- that would be a better reason to do it. [1] https://www.postgresql.org/message-id/2f95e70d-4623-87d4-9f24-ca534155f179%40iki.fi -- John Naylor EDB: http://www.enterprisedb.com