Re: speed up verifying UTF-8

John Naylor Wed, 30 Jun 2021 09:54:47 -0700

On Wed, Jun 30, 2021 at 7:18 AM Heikki Linnakangas <hlinn...@iki.fi> wrote:


> Hmm, there's one more simple trick we can do: We can have a separate
> fast-path version of the loop when there are at least 8 bytes of input
> left, skipping all the length checks. With that:

Good idea, and the numbers look good on Power8 / gcc 4.8 as well:

master:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2951 |  1521 |   871 |    1473 |   1508

v13:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     949 |   642 |   203 |    1046 |   1818

v14:

 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     887 |   607 |   179 |     776 |   1325


I don't think the new structuring will pose any challenges for rebasing
0002, either. This might need some experimentation, though:

+ * Subroutine of pg_utf8_verifystr() to check on char. Returns the length
of the
+ * character at *s in bytes, or 0 on invalid input or premature end of
input.
+ *
+ * XXX: could this be combined with pg_utf8_verifychar above?
+ */
+static inline int
+pg_utf8_verify_one(const unsigned char *s, int len)

It seems like it would be easy to have pg_utf8_verify_one in my proposed
pg_utf8.h header and replace the body of pg_utf8_verifychar with it.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

Reply via email to