Re: speed up verifying UTF-8

John Naylor Mon, 26 Jul 2021 04:09:36 -0700

Attached is v20, which has a number of improvements:

1. Cleaned up and explained DFA coding.
2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to
produce an optimized loop, using branch-free accumulators. That way, it
doesn't need to be rewritten for different input lengths. I also think it's
a bit easier to understand this way.
3. Put SSE helper functions in their own file.
4. Mostly-cosmetic edits to the configure detection.
5. Draft commit message.


With #2 above in place, I wanted to try different strides for the DFA, so
more measurements (hopefully not much more of these):

Power8, gcc 4.8

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    2944 |  1523 |   871 |    1473 |   1509

v20, 8-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1189 |   550 |   246 |     600 |    936

v20, 16-byte stride (in the actual patch):
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     981 |   440 |   134 |     791 |    820

v20, 32-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     857 |   481 |   141 |     834 |    839

Based on the above, I decided that 16 bytes had the best overall balance.
Other platforms may differ, but I don't expect it to make a huge amount of
difference.

Just for fun, I was also a bit curious about what Vladimir mentioned
upthread about x86-64-v3 offering a different shift instruction. Somehow,
clang 12 refused to build with that target, even though the release notes
say it can, but gcc 11 was fine:

x86 Macbook, gcc 11, USE_FALLBACK_UTF8=1:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
    1200 |   728 |   370 |     544 |    637

v20:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     459 |   243 |    77 |     424 |    440

v20, CFLAGS="-march=x86-64-v3 -O2" :
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     390 |   215 |    77 |     303 |    323

And, gcc does generate the desired shift here:

objdump -S src/port/pg_utf8_fallback.o | grep shrx
      53: c4 e2 eb f7 d1               shrxq %rdx, %rcx, %rdx

While it looks good, clang can do about as good by simply unrolling all 16
shifts in the loop, which gcc won't do. To be clear, it's irrelevant, since
x86-64-v3 includes AVX2, and if we had that we would just use it with the
SIMD algorithm.

Macbook x86, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     974 |   691 |   370 |     456 |    526

v20, USE_FALLBACK_UTF8=1:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     351 |   172 |    88 |     349 |    350

v20, with SSE4:
 chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
     142 |    92 |    59 |     141 |    141

I'm pretty happy with the patch at this point.

--
John Naylor
EDB: http://www.enterprisedb.com

From c82cbcf342f986396152a743a552626757b0a2b3 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@2ndquadrant.com>
Date: Sun, 25 Jul 2021 20:41:41 -0400
Subject: [PATCH v20] Add a fast path for validating UTF-8 text

Our previous validator is a traditional one that performs comparisons
and branching on one byte at a time. It's useful in that we always know
exactly how many bytes we have validated, but that precision comes at
a cost. Input validation can show up prominently in profiles of COPY
FROM, and future improvements to COPY FROM such as parallelism or line
and field parsing will put more pressure on input validation. Hence,
supplement with two fast path implementations:

On machines that support SSE4, use an algorithm described in the
paper "Validating UTF-8 In Less Than One Instruction Per Byte" by
John Keiser and Daniel Lemire. The authors have made available an
open source implementation within the simdjson library (Apache 2.0
license). The lookup tables and naming conventions were adopted from
this library, but the code was written from scratch.

On other hardware, use a "shift-based" DFA.

Both implementations are heavily optimized for blocks of ASCII text,
are relatively free of branching and thus robust against many kinds
of byte patterns, and delay error checking to the very end. With these
algorithms, UTF-8 validation is from anywhere from two to seven times
faster, depending on platform and the distribution of byte sequences
in the input.

The previous coding in pg_utf8_verifystr() is retained for short
strings and for when the fast path returns an error.

Review, performance testing, and additional hacking by: Heikki
Linakangas, Vladimir Sitnikov, Amit Khandekar, Thomas Munro, and
Greg Stark

Discussion:
https://www.postgresql.org/message-id/CAFBsxsEV_SzH%2BOLyCiyon%3DiwggSyMh_eF6A3LU2tiWf3Cy2ZQg%40mail.gmail.com
---
 config/c-compiler.m4                     |  28 +-
 configure                                | 112 ++++++--
 configure.ac                             |  61 +++-
 src/Makefile.global.in                   |   3 +
 src/common/wchar.c                       |  36 +++
 src/include/mb/pg_wchar.h                |   7 +
 src/include/pg_config.h.in               |   9 +
 src/include/port/pg_sse42_utils.h        | 163 +++++++++++
 src/include/port/pg_utf8.h               |  98 +++++++
 src/port/Makefile                        |   6 +
 src/port/pg_utf8_fallback.c              | 250 ++++++++++++++++
 src/port/pg_utf8_sse42.c                 | 347 +++++++++++++++++++++++
 src/port/pg_utf8_sse42_choose.c          |  68 +++++
 src/test/regress/expected/conversion.out | 112 ++++++++
 src/test/regress/sql/conversion.sql      |  81 ++++++
 src/tools/msvc/Mkvcbuild.pm              |   4 +
 src/tools/msvc/Solution.pm               |   3 +
 17 files changed, 1344 insertions(+), 44 deletions(-)
 create mode 100644 src/include/port/pg_sse42_utils.h
 create mode 100644 src/include/port/pg_utf8.h
 create mode 100644 src/port/pg_utf8_fallback.c
 create mode 100644 src/port/pg_utf8_sse42.c
 create mode 100644 src/port/pg_utf8_sse42_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 780e906ecc..49d592a53c 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -591,36 +591,46 @@ if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
   AC_DEFINE(HAVE_GCC__ATOMIC_INT64_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int64 *, int64 *, int64).])
 fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
 
-# PGAC_SSE42_CRC32_INTRINSICS
+# PGAC_SSE42_INTRINSICS
 # ---------------------------
 # Check if the compiler supports the x86 CRC instructions added in SSE 4.2,
 # using the _mm_crc32_u8 and _mm_crc32_u32 intrinsic functions. (We don't
 # test the 8-byte variant, _mm_crc32_u64, but it is assumed to be present if
 # the other ones are, on x86-64 platforms)
 #
+# Also, check for support of x86 instructions added in SSSE3 and SSE4.1,
+# in particular _mm_alignr_epi8, _mm_shuffle_epi8, and _mm_testz_si128.
+# We might be able to assume these are understood by the compiler if CRC
+# intrinsics are, but it's better to document our reliance on them here.
+#
+# We don't test for SSE2 intrinsics, as they are assumed to be present if
+# SSE 4.2 intrinsics are.
+#
 # An optional compiler flag can be passed as argument (e.g. -msse4.2). If the
-# intrinsics are supported, sets pgac_sse42_crc32_intrinsics, and CFLAGS_SSE42.
-AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=$1], [Ac_cachevar],
+# intrinsics are supported, sets pgac_sse42_intrinsics, and CFLAGS_SSE42.
+AC_DEFUN([PGAC_SSE42_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for for _mm_crc32_u8, _mm_crc32_u32, _mm_alignr_epi8, _mm_shuffle_epi8, and _mm_testz_si128 with CFLAGS=$1], [Ac_cachevar],
 [pgac_save_CFLAGS=$CFLAGS
 CFLAGS="$pgac_save_CFLAGS $1"
 AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h>],
   [unsigned int crc = 0;
    crc = _mm_crc32_u8(crc, 0);
    crc = _mm_crc32_u32(crc, 0);
+   __m128i vec = _mm_set1_epi8(crc);
+   vec = _mm_shuffle_epi8(vec,
+         _mm_alignr_epi8(vec, vec, 1));
    /* return computed value, to prevent the above being optimized away */
-   return crc == 0;])],
+   return _mm_testz_si128(vec, vec);])],
   [Ac_cachevar=yes],
   [Ac_cachevar=no])
 CFLAGS="$pgac_save_CFLAGS"])
 if test x"$Ac_cachevar" = x"yes"; then
   CFLAGS_SSE42="$1"
-  pgac_sse42_crc32_intrinsics=yes
+  pgac_sse42_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
-])# PGAC_SSE42_CRC32_INTRINSICS
-
+])# PGAC_SSE42_INTRINSICS
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # ----------------------------
diff --git a/configure b/configure
index e468def49e..bb5e15ce41 100755
--- a/configure
+++ b/configure
@@ -645,6 +645,7 @@ XGETTEXT
 MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
+PG_UTF8_OBJS
 PG_CRC32C_OBJS
 CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
@@ -17963,14 +17964,14 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
 
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 intrinsics.
 #
-# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
+# First check if these intrinsics can be used
 # with the default compiler flags. If not, check if adding the -msse4.2
 # flag helps. CFLAGS_SSE42 is set to -msse4.2 if that's required.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=" >&5
-$as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=... " >&6; }
-if ${pgac_cv_sse42_crc32_intrinsics_+:} false; then :
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for for _mm_crc32_u8, _mm_crc32_u32, _mm_alignr_epi8, _mm_shuffle_epi8, and _mm_testz_si128 with CFLAGS=" >&5
+$as_echo_n "checking for for _mm_crc32_u8, _mm_crc32_u32, _mm_alignr_epi8, _mm_shuffle_epi8, and _mm_testz_si128 with CFLAGS=... " >&6; }
+if ${pgac_cv_sse42_intrinsics_+:} false; then :
   $as_echo_n "(cached) " >&6
 else
   pgac_save_CFLAGS=$CFLAGS
@@ -17984,32 +17985,35 @@ main ()
 unsigned int crc = 0;
    crc = _mm_crc32_u8(crc, 0);
    crc = _mm_crc32_u32(crc, 0);
+   __m128i vec = _mm_set1_epi8(crc);
+   vec = _mm_shuffle_epi8(vec,
+         _mm_alignr_epi8(vec, vec, 1));
    /* return computed value, to prevent the above being optimized away */
-   return crc == 0;
+   return _mm_testz_si128(vec, vec);
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_sse42_crc32_intrinsics_=yes
+  pgac_cv_sse42_intrinsics_=yes
 else
-  pgac_cv_sse42_crc32_intrinsics_=no
+  pgac_cv_sse42_intrinsics_=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
 CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_crc32_intrinsics_" >&5
-$as_echo "$pgac_cv_sse42_crc32_intrinsics_" >&6; }
-if test x"$pgac_cv_sse42_crc32_intrinsics_" = x"yes"; then
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_intrinsics_" >&5
+$as_echo "$pgac_cv_sse42_intrinsics_" >&6; }
+if test x"$pgac_cv_sse42_intrinsics_" = x"yes"; then
   CFLAGS_SSE42=""
-  pgac_sse42_crc32_intrinsics=yes
+  pgac_sse42_intrinsics=yes
 fi
 
-if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=-msse4.2" >&5
-$as_echo_n "checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=-msse4.2... " >&6; }
-if ${pgac_cv_sse42_crc32_intrinsics__msse4_2+:} false; then :
+if test x"$pgac_sse42_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for for _mm_crc32_u8, _mm_crc32_u32, _mm_alignr_epi8, _mm_shuffle_epi8, and _mm_testz_si128 with CFLAGS=-msse4.2" >&5
+$as_echo_n "checking for for _mm_crc32_u8, _mm_crc32_u32, _mm_alignr_epi8, _mm_shuffle_epi8, and _mm_testz_si128 with CFLAGS=-msse4.2... " >&6; }
+if ${pgac_cv_sse42_intrinsics__msse4_2+:} false; then :
   $as_echo_n "(cached) " >&6
 else
   pgac_save_CFLAGS=$CFLAGS
@@ -18023,26 +18027,29 @@ main ()
 unsigned int crc = 0;
    crc = _mm_crc32_u8(crc, 0);
    crc = _mm_crc32_u32(crc, 0);
+   __m128i vec = _mm_set1_epi8(crc);
+   vec = _mm_shuffle_epi8(vec,
+         _mm_alignr_epi8(vec, vec, 1));
    /* return computed value, to prevent the above being optimized away */
-   return crc == 0;
+   return _mm_testz_si128(vec, vec);
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_sse42_crc32_intrinsics__msse4_2=yes
+  pgac_cv_sse42_intrinsics__msse4_2=yes
 else
-  pgac_cv_sse42_crc32_intrinsics__msse4_2=no
+  pgac_cv_sse42_intrinsics__msse4_2=no
 fi
 rm -f core conftest.err conftest.$ac_objext \
     conftest$ac_exeext conftest.$ac_ext
 CFLAGS="$pgac_save_CFLAGS"
 fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_crc32_intrinsics__msse4_2" >&5
-$as_echo "$pgac_cv_sse42_crc32_intrinsics__msse4_2" >&6; }
-if test x"$pgac_cv_sse42_crc32_intrinsics__msse4_2" = x"yes"; then
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sse42_intrinsics__msse4_2" >&5
+$as_echo "$pgac_cv_sse42_intrinsics__msse4_2" >&6; }
+if test x"$pgac_cv_sse42_intrinsics__msse4_2" = x"yes"; then
   CFLAGS_SSE42="-msse4.2"
-  pgac_sse42_crc32_intrinsics=yes
+  pgac_sse42_intrinsics=yes
 fi
 
 fi
@@ -18177,12 +18184,12 @@ fi
 # in the template or configure command line.
 if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x""; then
   # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+  if test x"$pgac_sse42_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
     USE_SSE42_CRC32C=1
   else
     # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
     # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+    if test x"$pgac_sse42_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
       # Use ARM CRC Extension if available.
@@ -18196,7 +18203,7 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
           # fall back to slicing-by-8 algorithm, which doesn't require any
           # special CPU support.
           USE_SLICING_BY_8_CRC32C=1
-	fi
+        fi
       fi
     fi
   fi
@@ -18249,6 +18256,59 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+# Select UTF-8 validator implementation.
+#
+# If we are targeting a processor that has SSE 4.2 instructions, we can use
+# those to validate UTF-8 characters. If we're not targeting such
+# a processor, but we can nevertheless produce code that uses the SSE
+# intrinsics, perhaps with some extra CFLAGS, compile both implementations and
+# select which one to use at runtime, depending on whether SSE 4.2 is supported
+# by the processor we're running on.
+#
+# You can override this logic by setting the appropriate USE_*_UTF8 flag to 1
+# in the template or configure command line.
+if test x"$USE_SSE42_UTF8" = x"" && test x"$USE_SSE42_UTF8_WITH_RUNTIME_CHECK" = x"" && test x"$USE_FALLBACK_UTF8" = x""; then
+  if test x"$pgac_sse42_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+    USE_SSE42_UTF8=1
+  else
+    if test x"$pgac_sse42_intrinsics" = x"yes"; then
+      USE_SSE42_UTF8_WITH_RUNTIME_CHECK=1
+    else
+      # fall back to algorithm which doesn't require any special
+      # CPU support.
+      USE_FALLBACK_UTF8=1
+    fi
+  fi
+fi
+
+# Set PG_UTF8_OBJS appropriately depending on the selected implementation.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking which UTF-8 validator to use" >&5
+$as_echo_n "checking which UTF-8 validator to use... " >&6; }
+if test x"$USE_SSE42_UTF8" = x"1"; then
+
+$as_echo "#define USE_SSE42_UTF8 1" >>confdefs.h
+
+  PG_UTF8_OBJS="pg_utf8_sse42.o"
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
+$as_echo "SSE 4.2" >&6; }
+else
+  if test x"$USE_SSE42_UTF8_WITH_RUNTIME_CHECK" = x"1"; then
+
+$as_echo "#define USE_SSE42_UTF8_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+    PG_UTF8_OBJS="pg_utf8_sse42.o pg_utf8_fallback.o pg_utf8_sse42_choose.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime check" >&5
+$as_echo "SSE 4.2 with runtime check" >&6; }
+  else
+
+$as_echo "#define USE_FALLBACK_UTF8 1" >>confdefs.h
+
+    PG_UTF8_OBJS="pg_utf8_fallback.o"
+    { $as_echo "$as_me:${as_lineno-$LINENO}: result: fallback" >&5
+$as_echo "fallback" >&6; }
+  fi
+fi
+
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index 39666f9727..2431565760 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2059,14 +2059,14 @@ if test x"$pgac_cv__cpuid" = x"yes"; then
   AC_DEFINE(HAVE__CPUID, 1, [Define to 1 if you have __cpuid.])
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC calculations.
+# Check for Intel SSE 4.2 intrinsics.
 #
-# First check if the _mm_crc32_u8 and _mm_crc32_u64 intrinsics can be used
-# with the default compiler flags. If not, check if adding the -msse4.2
+# First check if these intrinsics can be used with the default
+# compiler flags. If not, check if adding the -msse4.2
 # flag helps. CFLAGS_SSE42 is set to -msse4.2 if that's required.
-PGAC_SSE42_CRC32_INTRINSICS([])
-if test x"$pgac_sse42_crc32_intrinsics" != x"yes"; then
-  PGAC_SSE42_CRC32_INTRINSICS([-msse4.2])
+PGAC_SSE42_INTRINSICS([])
+if test x"$pgac_sse42_intrinsics" != x"yes"; then
+  PGAC_SSE42_INTRINSICS([-msse4.2])
 fi
 AC_SUBST(CFLAGS_SSE42)
 
@@ -2107,12 +2107,12 @@ AC_SUBST(CFLAGS_ARMV8_CRC32C)
 # in the template or configure command line.
 if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x""; then
   # Use Intel SSE 4.2 if available.
-  if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+  if test x"$pgac_sse42_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
     USE_SSE42_CRC32C=1
   else
     # Intel SSE 4.2, with runtime check? The CPUID instruction is needed for
     # the runtime check.
-    if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
+    if test x"$pgac_sse42_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" = x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
       USE_SSE42_CRC32C_WITH_RUNTIME_CHECK=1
     else
       # Use ARM CRC Extension if available.
@@ -2126,7 +2126,7 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" &&
           # fall back to slicing-by-8 algorithm, which doesn't require any
           # special CPU support.
           USE_SLICING_BY_8_CRC32C=1
-	fi
+        fi
       fi
     fi
   fi
@@ -2163,6 +2163,49 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+# Select UTF-8 validator implementation.
+#
+# If we are targeting a processor that has SSE 4.2 instructions, we can use
+# those to validate UTF-8 characters. If we're not targeting such
+# a processor, but we can nevertheless produce code that uses the SSE
+# intrinsics, perhaps with some extra CFLAGS, compile both implementations and
+# select which one to use at runtime, depending on whether SSE 4.2 is supported
+# by the processor we're running on.
+#
+# You can override this logic by setting the appropriate USE_*_UTF8 flag to 1
+# in the template or configure command line.
+if test x"$USE_SSE42_UTF8" = x"" && test x"$USE_SSE42_UTF8_WITH_RUNTIME_CHECK" = x"" && test x"$USE_FALLBACK_UTF8" = x""; then
+  if test x"$pgac_sse42_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then
+    USE_SSE42_UTF8=1
+  else
+    if test x"$pgac_sse42_intrinsics" = x"yes"; then
+      USE_SSE42_UTF8_WITH_RUNTIME_CHECK=1
+    else
+      # fall back to algorithm which doesn't require any special
+      # CPU support.
+      USE_FALLBACK_UTF8=1
+    fi
+  fi
+fi
+
+# Set PG_UTF8_OBJS appropriately depending on the selected implementation.
+AC_MSG_CHECKING([which UTF-8 validator to use])
+if test x"$USE_SSE42_UTF8" = x"1"; then
+  AC_DEFINE(USE_SSE42_UTF8, 1, [Define to 1 use Intel SSE 4.2 instructions.])
+  PG_UTF8_OBJS="pg_utf8_sse42.o"
+  AC_MSG_RESULT(SSE 4.2)
+else
+  if test x"$USE_SSE42_UTF8_WITH_RUNTIME_CHECK" = x"1"; then
+    AC_DEFINE(USE_SSE42_UTF8_WITH_RUNTIME_CHECK, 1, [Define to 1 to use Intel SSE 4.2 instructions with a runtime check.])
+    PG_UTF8_OBJS="pg_utf8_sse42.o pg_utf8_fallback.o pg_utf8_sse42_choose.o"
+    AC_MSG_RESULT(SSE 4.2 with runtime check)
+  else
+    AC_DEFINE(USE_FALLBACK_UTF8, 1, [Define to 1 to use the fallback.])
+    PG_UTF8_OBJS="pg_utf8_fallback.o"
+    AC_MSG_RESULT(fallback)
+  fi
+fi
+AC_SUBST(PG_UTF8_OBJS)
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8f05840821..f54433933b 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -721,6 +721,9 @@ LIBOBJS = @LIBOBJS@
 # files needed for the chosen CRC-32C implementation
 PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
 
+# files needed for the chosen UTF-8 validation implementation
+PG_UTF8_OBJS = @PG_UTF8_OBJS@
+
 LIBS := -lpgcommon -lpgport $(LIBS)
 
 # to make ws2_32.lib the last library
diff --git a/src/common/wchar.c b/src/common/wchar.c
index 0636b8765b..bf13fa6515 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -13,6 +13,7 @@
 #include "c.h"
 
 #include "mb/pg_wchar.h"
+#include "port/pg_utf8.h"
 
 
 /*
@@ -1761,7 +1762,42 @@ static int
 pg_utf8_verifystr(const unsigned char *s, int len)
 {
 	const unsigned char *start = s;
+	int			valid_bytes = 0;
 
+	/*
+	 * For longer strings, dispatch to an optimized implementation.
+	 *
+	 * The threshold is somewhat arbitrary. XXX: If you change this, you must
+	 * change the tests in conversion.sql to match!
+	 * WIP: test different thesholds?
+	 */
+	if (len >= 32)
+	{
+		/* platform-specific implementation in src/port */
+		valid_bytes = UTF8_VERIFYSTR_FAST(s, len);
+		s += valid_bytes;
+		len -= valid_bytes;
+
+		/*
+		 * When checking multiple bytes at a time, it's possible to end within
+		 * a multibyte sequence, which wouldn't have raised an error above.
+		 * Before checking the remaining bytes, first walk backwards to find
+		 * the last byte that could have been the start of a valid sequence.
+		 */
+		while (s > start)
+		{
+			s--;
+			len++;
+
+			if (!IS_HIGHBIT_SET(*s) ||
+				IS_UTF8_2B_LEAD(*s) ||
+				IS_UTF8_3B_LEAD(*s) ||
+				IS_UTF8_4B_LEAD(*s))
+				break;
+		}
+	}
+
+	/* check remaining bytes */
 	while (len > 0)
 	{
 		int			l;
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index d93ccac263..045bbbcb7e 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -29,6 +29,13 @@ typedef unsigned int pg_wchar;
  */
 #define MAX_MULTIBYTE_CHAR_LEN	4
 
+/*
+ * UTF-8 macros
+ */
+#define IS_UTF8_2B_LEAD(c) (((c) & 0xe0) == 0xc0)
+#define IS_UTF8_3B_LEAD(c) (((c) & 0xf0) == 0xe0)
+#define IS_UTF8_4B_LEAD(c) (((c) & 0xf8) == 0xf0)
+
 /*
  * various definitions for EUC
  */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 783b8fc1ba..6d759145a8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -898,6 +898,9 @@
 /* Define to 1 to build with BSD Authentication support. (--with-bsd-auth) */
 #undef USE_BSD_AUTH
 
+/* Define to 1 to use the fallback. */
+#undef USE_FALLBACK_UTF8
+
 /* Define to build with ICU support. (--with-icu) */
 #undef USE_ICU
 
@@ -935,6 +938,12 @@
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 use Intel SSE 4.2 instructions. */
+#undef USE_SSE42_UTF8
+
+/* Define to 1 to use Intel SSE 4.2 instructions with a runtime check. */
+#undef USE_SSE42_UTF8_WITH_RUNTIME_CHECK
+
 /* Define to build with systemd support. (--with-systemd) */
 #undef USE_SYSTEMD
 
diff --git a/src/include/port/pg_sse42_utils.h b/src/include/port/pg_sse42_utils.h
new file mode 100644
index 0000000000..deafb3e5f8
--- /dev/null
+++ b/src/include/port/pg_sse42_utils.h
@@ -0,0 +1,163 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_sse42_utils.h
+ *	  Convenience functions to wrap SSE 4.2 intrinsics.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/pg_sse42_utils.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_SSE42_UTILS
+#define PG_SSE42_UTILS
+
+#include <nmmintrin.h>
+
+
+/* assign the arguments to the lanes in the register */
+#define vset(...)       _mm_setr_epi8(__VA_ARGS__)
+
+/* return a zeroed register */
+static inline const __m128i
+vzero()
+{
+	return _mm_setzero_si128();
+}
+
+/* perform an unaligned load from memory into a register */
+static inline const __m128i
+vload(const unsigned char *raw_input)
+{
+	return _mm_loadu_si128((const __m128i *) raw_input);
+}
+
+/* return a vector with each 8-bit lane populated with the input scalar */
+static inline __m128i
+splat(char byte)
+{
+	return _mm_set1_epi8(byte);
+}
+
+/* return false if a register is zero, true otherwise */
+static inline bool
+to_bool(const __m128i v)
+{
+	/*
+	 * _mm_testz_si128 returns 1 if the bitwise AND of the two arguments is
+	 * zero. Zero is the only value whose bitwise AND with itself is zero.
+	 */
+	return !_mm_testz_si128(v, v);
+}
+
+/* vector version of IS_HIGHBIT_SET() */
+static inline bool
+is_highbit_set(const __m128i v)
+{
+	return _mm_movemask_epi8(v) != 0;
+}
+
+/* bitwise vector operations */
+
+static inline __m128i
+bitwise_and(const __m128i v1, const __m128i v2)
+{
+	return _mm_and_si128(v1, v2);
+}
+
+static inline __m128i
+bitwise_or(const __m128i v1, const __m128i v2)
+{
+	return _mm_or_si128(v1, v2);
+}
+
+static inline __m128i
+bitwise_xor(const __m128i v1, const __m128i v2)
+{
+	return _mm_xor_si128(v1, v2);
+}
+
+/* perform signed greater-than on all 8-bit lanes */
+static inline __m128i
+greater_than(const __m128i v1, const __m128i v2)
+{
+	return _mm_cmpgt_epi8(v1, v2);
+}
+
+/* set bits in the error vector where bytes in the input are zero */
+static inline void
+check_for_zeros(const __m128i v, __m128i * error)
+{
+	const		__m128i cmp = _mm_cmpeq_epi8(v, vzero());
+
+	*error = bitwise_or(*error, cmp);
+}
+
+/*
+ * Do unsigned subtraction, but instead of wrapping around
+ * on overflow, stop at zero. Useful for emulating unsigned
+ * comparison.
+ */
+static inline __m128i
+saturating_sub(const __m128i v1, const __m128i v2)
+{
+	return _mm_subs_epu8(v1, v2);
+}
+
+/*
+ * Shift right each 8-bit lane
+ *
+ * There is no intrinsic to do this on 8-bit lanes, so shift
+ * right in each 16-bit lane then apply a mask in each 8-bit
+ * lane shifted the same amount.
+ */
+static inline __m128i
+shift_right(const __m128i v, const int n)
+{
+	const		__m128i shift16 = _mm_srli_epi16(v, n);
+	const		__m128i mask = splat(0xFF >> n);
+
+	return bitwise_and(shift16, mask);
+}
+
+/*
+ * Shift entire 'input' register right by N 8-bit lanes, and
+ * replace the first N lanes with the last N lanes from the
+ * 'prev' register. Could be stated in C thusly:
+ *
+ * ((prev << 128) | input) >> (N * 8)
+ *
+ * The third argument to the intrinsic must be a numeric constant, so
+ * we must have separate functions for different shift amounts.
+ */
+static inline __m128i
+prev1(__m128i prev, __m128i input)
+{
+	return _mm_alignr_epi8(input, prev, sizeof(__m128i) - 1);
+}
+
+static inline __m128i
+prev2(__m128i prev, __m128i input)
+{
+	return _mm_alignr_epi8(input, prev, sizeof(__m128i) - 2);
+}
+
+static inline __m128i
+prev3(__m128i prev, __m128i input)
+{
+	return _mm_alignr_epi8(input, prev, sizeof(__m128i) - 3);
+}
+
+/*
+ * For each 8-bit lane in the input, use that value as an index
+ * into the lookup vector as if it were a 16-element byte array.
+ */
+static inline __m128i
+lookup(const __m128i input, const __m128i lookup)
+{
+	return _mm_shuffle_epi8(lookup, input);
+}
+
+#endif							/* PG_SSE42_UTILS */
diff --git a/src/include/port/pg_utf8.h b/src/include/port/pg_utf8.h
new file mode 100644
index 0000000000..dcecfed9e2
--- /dev/null
+++ b/src/include/port/pg_utf8.h
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_utf8.h
+ *	  Routines for fast validation of UTF-8 text.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/pg_utf8.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_UTF8_H
+#define PG_UTF8_H
+
+
+#if defined(USE_SSE42_UTF8)
+/* Use SSE 4.2 instructions. */
+#define UTF8_VERIFYSTR_FAST(s, len) \
+	pg_validate_utf8_sse42((s), (len))
+
+extern int	pg_validate_utf8_sse42(const unsigned char *s, int len);
+
+#elif defined(USE_SSE42_UTF8_WITH_RUNTIME_CHECK)
+/* Use SSE 4.2 instructions, but perform a runtime check first. */
+#define UTF8_VERIFYSTR_FAST(s, len) \
+	pg_validate_utf8((s), (len))
+
+extern int	pg_validate_utf8_fallback(const unsigned char *s, int len);
+extern int	(*pg_validate_utf8) (const unsigned char *s, int len);
+extern int	pg_validate_utf8_sse42(const unsigned char *s, int len);
+
+#else
+/* Use a portable implementation */
+#define UTF8_VERIFYSTR_FAST(s, len) \
+	pg_validate_utf8_fallback((s), (len))
+
+extern int	pg_validate_utf8_fallback(const unsigned char *s, int len);
+
+#endif							/* USE_SSE42_UTF8 */
+
+/* The following are visible everywhere. */
+
+/*
+ * Verify a chunk of bytes for valid ASCII including a zero-byte check.
+ * This is here in case non-UTF8 encodings want to use it.
+ */
+static inline bool
+is_valid_ascii(const unsigned char *s, int len)
+{
+	uint64		chunk,
+				highbit_cum = UINT64CONST(0),
+				zero_cum = UINT64CONST(0x8080808080808080);
+
+	Assert(len % sizeof(chunk) == 0);
+
+	while (len >= sizeof(chunk))
+	{
+		memcpy(&chunk, s, sizeof(chunk));
+
+		/* Capture any set bits in this chunk. */
+		highbit_cum |= chunk;
+
+		/*
+		 * Capture any zero bytes in this chunk.
+		 *
+		 * First, add 0x7f to each byte. This sets the high bit in each byte,
+		 * unless it was a zero. We will check later that none of the bytes in
+		 * the chunk had the high bit set, in which case the max value each
+		 * byte can have after the addition is 0x7f + 0x7f = 0xfe, and we
+		 * don't need to worry about carrying over to the next byte.
+		 *
+		 * If any resulting high bits are zero, the corresponding high bits in
+		 * the zero accumulator will be cleared.
+		 */
+		zero_cum &= (chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f));
+
+		s += sizeof(chunk);
+		len -= sizeof(chunk);
+	}
+
+	/* Check for any set high bits in the high bit accumulator. */
+	if (highbit_cum & UINT64CONST(0x8080808080808080))
+		return false;
+
+	/*
+	 * Check if all bytes in the zero accumulator still have the high bit set.
+	 * XXX: This check is only valid after checking the high bit accumulator,
+	 * as noted above.
+	 */
+	if (zero_cum == UINT64CONST(0x8080808080808080))
+		return true;
+	else
+		return false;
+}
+
+#endif							/* PG_UTF8_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index 52dbf5783f..04838b0ab2 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -40,6 +40,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS = \
 	$(LIBOBJS) \
 	$(PG_CRC32C_OBJS) \
+	$(PG_UTF8_OBJS) \
 	bsearch_arg.o \
 	chklocale.o \
 	erand48.o \
@@ -89,6 +90,11 @@ libpgport.a: $(OBJS)
 thread.o: CFLAGS+=$(PTHREAD_CFLAGS)
 thread_shlib.o: CFLAGS+=$(PTHREAD_CFLAGS)
 
+# all versions of pg_utf8_sse42.o need CFLAGS_SSE42
+pg_utf8_sse42.o: CFLAGS+=$(CFLAGS_SSE42)
+pg_utf8_sse42_shlib.o: CFLAGS+=$(CFLAGS_SSE42)
+pg_utf8_sse42_srv.o: CFLAGS+=$(CFLAGS_SSE42)
+
 # all versions of pg_crc32c_sse42.o need CFLAGS_SSE42
 pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42)
 pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_SSE42)
diff --git a/src/port/pg_utf8_fallback.c b/src/port/pg_utf8_fallback.c
new file mode 100644
index 0000000000..579b67a288
--- /dev/null
+++ b/src/port/pg_utf8_fallback.c
@@ -0,0 +1,250 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_utf8_fallback.c
+ *	  Validate UTF-8 using plain C.
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_utf8_fallback.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include "port/pg_utf8.h"
+
+/*
+ * The fallback UTF-8 validator uses a "shift-based" DFA as described by Per
+ * Vognsen:
+ *
+ * https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725
+ *
+ * In a traditional table-driven DFA, the input byte and current state are
+ * used to compute the index into an array of state transitions. Since the
+ * load address is dependent on earlier work, the CPU is not kept busy.
+ *
+ * Now, in a shift-based DFA, the input byte is an index into array of
+ * integers that encode the state transitions. To retrieve the current state,
+ * you simply shift the integer by the current state and apply a mask. In
+ * this scheme, loads only depend on the input byte, so there is better
+ * piplining.
+ *
+ * The naming conventions, but not code, in this file are adopted from an
+ * implementation (not shift-based) of a UTF-8 to UTF-16/32 transcoder, whose
+ * table follows:
+ *
+ * https://github.com/BobSteagall/utf_utils/blob/master/src/utf_utils.cpp
+ *
+ * Compared to the orginal, ERR and BGN/END are switched to make the shift
+ * encodings simpler. ERR is lower case in the table for readability.
+ *
+ * ILL  NZA  CR1  CR2  CR3  L2A  L3A  L3B  L3C  L4A  L4B  L4C CLASS / STATE
+ * =========================================================================
+ * err, err, err, err, err, err, err, err, err, err, err, err,      | ERR
+ * err, END, err, err, err, CS1, P3A, CS2, P3B, P4A, CS3, P4B,      | BGN|END
+ *
+ * err, err, END, END, END, err, err, err, err, err, err, err,      | CS1
+ * err, err, CS1, CS1, CS1, err, err, err, err, err, err, err,      | CS2
+ * err, err, CS2, CS2, CS2, err, err, err, err, err, err, err,      | CS3
+ *
+ * err, err, err, err, CS1, err, err, err, err, err, err, err,      | P3A
+ * err, err, CS1, CS1, err, err, err, err, err, err, err, err,      | P3B
+ *
+ * err, err, err, CS2, CS2, err, err, err, err, err, err, err,      | P4A
+ * err, err, CS2, err, err, err, err, err, err, err, err, err,      | P4B
+ *
+ * The states and categories are spelled out below.
+ */
+
+/*
+ * With six bits per state, the mask is 63, whose importance is described
+ * later.
+ */
+#define DFA_BITS_PER_STATE 6
+#define DFA_MASK ((1 << DFA_BITS_PER_STATE) - 1)
+
+/*
+ * This determines how much to advance the string pointer each time per loop.
+ * Sixteen seems to give the best balance of performance across different
+ * byte distributions.
+ */
+#define STRIDE_LENGTH 16
+
+/* possible transition states for the DFA */
+
+/* Invalid state */
+#define	ERR UINT64CONST(0)
+/* Begin */
+#define	BGN (UINT64CONST(1) * DFA_BITS_PER_STATE)
+/* Continuation states */
+#define	CS1 (UINT64CONST(2) * DFA_BITS_PER_STATE)
+#define	CS2 (UINT64CONST(3) * DFA_BITS_PER_STATE)
+#define	CS3 (UINT64CONST(4) * DFA_BITS_PER_STATE)
+/* Partial 3-byte sequence states */
+#define	P3A (UINT64CONST(5) * DFA_BITS_PER_STATE)
+#define	P3B (UINT64CONST(6) * DFA_BITS_PER_STATE)
+/* Partial 4-byte sequence states */
+#define	P4A (UINT64CONST(7) * DFA_BITS_PER_STATE)
+#define	P4B (UINT64CONST(8) * DFA_BITS_PER_STATE)
+/* Begin and End are the same state */
+#define	END BGN
+
+/*
+ * The byte categories are 64-bit integers that encode within them the state
+ * transitions. Shifting by the current state gives the next state.
+ */
+
+/* invalid byte */
+#define ILL ERR
+
+/* non-zero ASCII */
+#define NZA (END << BGN)
+
+/* continuation byte */
+#define CR1 (END << CS1) | (CS1 << CS2) | (CS2 << CS3) | (CS1 << P3B) | (CS2 << P4B)
+#define CR2 (END << CS1) | (CS1 << CS2) | (CS2 << CS3) | (CS1 << P3B) | (CS2 << P4A)
+#define CR3 (END << CS1) | (CS1 << CS2) | (CS2 << CS3) | (CS1 << P3A) | (CS2 << P4A)
+
+/* 2-byte lead */
+#define L2A (CS1 << BGN)
+
+/* 3-byte lead */
+#define L3A (P3A << BGN)
+#define L3B (CS2 << BGN)
+#define L3C (P3B << BGN)
+
+/* 4-byte lead */
+#define L4A (P4A << BGN)
+#define L4B (CS3 << BGN)
+#define L4C (P4B << BGN)
+
+/* map an input byte to its byte category */
+const uint64 ByteCategory[256] =
+{
+	/* ASCII */
+
+	ILL, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+	NZA, NZA, NZA, NZA, NZA, NZA, NZA, NZA,
+
+	/* continuation bytes */
+
+	/* 80..8F */
+	CR1, CR1, CR1, CR1, CR1, CR1, CR1, CR1,
+	CR1, CR1, CR1, CR1, CR1, CR1, CR1, CR1,
+
+	/* 90..9F */
+	CR2, CR2, CR2, CR2, CR2, CR2, CR2, CR2,
+	CR2, CR2, CR2, CR2, CR2, CR2, CR2, CR2,
+
+	/* A0..BF */
+	CR3, CR3, CR3, CR3, CR3, CR3, CR3, CR3,
+	CR3, CR3, CR3, CR3, CR3, CR3, CR3, CR3,
+	CR3, CR3, CR3, CR3, CR3, CR3, CR3, CR3,
+	CR3, CR3, CR3, CR3, CR3, CR3, CR3, CR3,
+
+	/* leading bytes */
+
+	/* C0..DF */
+	ILL, ILL, L2A, L2A, L2A, L2A, L2A, L2A,
+	L2A, L2A, L2A, L2A, L2A, L2A, L2A, L2A,
+	L2A, L2A, L2A, L2A, L2A, L2A, L2A, L2A,
+	L2A, L2A, L2A, L2A, L2A, L2A, L2A, L2A,
+
+	/* E0..EF */
+	L3A, L3B, L3B, L3B, L3B, L3B, L3B, L3B,
+	L3B, L3B, L3B, L3B, L3B, L3C, L3B, L3B,
+
+	/* F0..FF */
+	L4A, L4B, L4B, L4B, L4C, ILL, ILL, ILL,
+	ILL, ILL, ILL, ILL, ILL, ILL, ILL, ILL,
+};
+
+static inline uint64
+utf8_advance(const unsigned char *s, uint64 state, int len)
+{
+	/* Note: We deliberately don't track the state within the loop. */
+	while (len > 0)
+	{
+		/*
+		 * It's important that the mask value is 63: In most instruction sets,
+		 * a shift by a 64-bit operand is understood to be a shift by its mod
+		 * 64, so the compiler should elide the mask operation.
+		 */
+		state = ByteCategory[*s++] >> (state & DFA_MASK);
+		len--;
+	}
+
+	return state & DFA_MASK;
+}
+
+/*
+ * Returns the string length if valid, or zero on error.
+ *
+ * If valid, it's still possible we ended within an incomplete multibyte
+ * sequence, so the caller is responsible for adjusting the returned result
+ * to make sure it represents the end of the last valid byte sequence.
+ *
+ * In the error case, the caller must start over at the beginning and verify
+ * one byte at a time.
+ *
+ * See also the comment in common/wchar.c under "multibyte sequence
+ * validators".
+ */
+int
+pg_validate_utf8_fallback(const unsigned char *s, int len)
+{
+	const int	orig_len = len;
+	uint64		state = BGN;
+
+	while (len >= STRIDE_LENGTH)
+	{
+		/*
+		 * If the chunk is all ASCII, we can skip the full UTF-8 check, but we
+		 * must still check for a non-END state, which means the previous
+		 * chunk ended in the middle of a multibyte sequence.
+		 */
+		if (state != END || !is_valid_ascii(s, STRIDE_LENGTH))
+			state = utf8_advance(s, state, STRIDE_LENGTH);
+
+		s += STRIDE_LENGTH;
+		len -= STRIDE_LENGTH;
+	}
+
+	/*
+	 * Check remaining bytes.
+	 *
+	 * XXX: Since we pass s and len by value, they are no longer meaningful
+	 * after this point, but that's okay, because we know we're at the end.
+	 */
+	utf8_advance(s, state, len);
+
+	/*
+	 * If we saw an error during the loop, let the caller handle it. We treat
+	 * all other states as success.
+	 */
+	if (state == ERR)
+		return 0;
+	else
+		return orig_len;
+}
diff --git a/src/port/pg_utf8_sse42.c b/src/port/pg_utf8_sse42.c
new file mode 100644
index 0000000000..78afbfe6ac
--- /dev/null
+++ b/src/port/pg_utf8_sse42.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_utf8_sse42.c
+ *	  Validate UTF-8 using SSE 4.2 instructions.
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_utf8_sse42.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#include "port/pg_sse42_utils.h"
+#include "port/pg_utf8.h"
+
+/*
+ * This module is based on the paper "Validating UTF-8 In Less Than One
+ * Instruction Per Byte" by John Keiser and Daniel Lemire:
+ *
+ * https://arxiv.org/pdf/2010.03090.pdf
+ *
+ * The authors provide an implementation of this algorithm in the simdjson
+ * library (Apache 2.0 license):
+ *
+ * https://github.com/simdjson/simdjson.
+ *
+ * The PG code was written from scratch, but with some naming conventions
+ * adapted from the Westmere implementation of simdjson. The constants and
+ * lookup tables were taken directly from simdjson with some cosmetic
+ * rearrangements.
+ *
+ * The core of the lookup algorithm is a two-part process:
+ *
+ * 1. Classify 2-byte sequences. All 2-byte errors can be found by looking at
+ * the first three nibbles of each overlapping 2-byte sequence, using three
+ * separate lookup tables. The interesting bytes are either definite errors
+ * or two continuation bytes in a row. The latter may be valid depending on
+ * what came before.
+ *
+ * 2. Find starts of possible 3- and 4-byte sequences.
+ *
+ * Combining the above results allows us to verify any UTF-8 sequence.
+ */
+
+/* constants for comparing bytes at the end of a vector */
+#define MAX_CONTINUATION 0xBF
+#define MAX_TWO_BYTE_LEAD 0xDF
+#define MAX_THREE_BYTE_LEAD 0xEF
+
+/* lookup tables for classifying two-byte sequences */
+
+/*
+ * 11______ 0_______
+ * 11______ 11______
+ */
+#define TOO_SHORT		(1 << 0)
+
+/* 0_______ 10______ */
+#define TOO_LONG		(1 << 1)
+
+/* 1100000_ 10______ */
+#define OVERLONG_2		(1 << 2)
+
+/* 11100000 100_____ */
+#define OVERLONG_3		(1 << 3)
+
+/* The following two symbols intentionally share the same value. */
+
+/* 11110000 1000____ */
+#define OVERLONG_4		(1 << 4)
+
+/*
+ * 11110101 1000____
+ * 1111011_ 1000____
+ * 11111___ 1000____
+ */
+#define TOO_LARGE_1000	(1 << 4)
+
+/*
+ * 11110100 1001____
+ * 11110100 101_____
+ * 11110101 1001____
+ * 11110101 101_____
+ * 1111011_ 1001____
+ * 1111011_ 101_____
+ * 11111___ 1001____
+ * 11111___ 101_____
+ */
+#define TOO_LARGE		(1 << 5)
+
+/* 11101101 101_____ */
+#define SURROGATE		(1 << 6)
+
+/*
+ * 10______ 10______
+ *
+ * The cast here is to silence warnings about implicit conversion
+ * from 'int' to 'char'. It's fine that this is a negative value,
+ * because we only care about the pattern of bits.
+ */
+#define TWO_CONTS ((char) (1 << 7))
+
+/* These all have ____ in byte 1 */
+#define CARRY (TOO_SHORT | TOO_LONG | TWO_CONTS)
+
+/*
+ * table for categorizing bits in the high nibble of
+ * the first byte of a 2-byte sequence
+ */
+#define BYTE_1_HIGH_TABLE \
+	/* 0_______ ________ <ASCII in byte 1> */ \
+	TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG, \
+	TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG, \
+	/* 10______ ________ <continuation in byte 1> */ \
+	TWO_CONTS, TWO_CONTS, TWO_CONTS, TWO_CONTS, \
+	/* 1100____ ________ <two byte lead in byte 1> */ \
+	TOO_SHORT | OVERLONG_2, \
+	/* 1101____ ________ <two byte lead in byte 1> */ \
+	TOO_SHORT, \
+	/* 1110____ ________ <three byte lead in byte 1> */ \
+	TOO_SHORT | OVERLONG_3 | SURROGATE, \
+	/* 1111____ ________ <four+ byte lead in byte 1> */ \
+	TOO_SHORT | TOO_LARGE | TOO_LARGE_1000 | OVERLONG_4
+
+/*
+ * table for categorizing bits in the low nibble of
+ * the first byte of a 2-byte sequence
+ */
+#define BYTE_1_LOW_TABLE \
+	/* ____0000 ________ */ \
+	CARRY | OVERLONG_2 | OVERLONG_3 | OVERLONG_4, \
+	/* ____0001 ________ */ \
+	CARRY | OVERLONG_2, \
+	/* ____001_ ________ */ \
+	CARRY, \
+	CARRY, \
+	/* ____0100 ________ */ \
+	CARRY | TOO_LARGE, \
+	/* ____0101 ________ */ \
+	CARRY | TOO_LARGE | TOO_LARGE_1000, \
+	/* ____011_ ________ */ \
+	CARRY | TOO_LARGE | TOO_LARGE_1000, \
+	CARRY | TOO_LARGE | TOO_LARGE_1000, \
+	/* ____1___ ________ */ \
+	CARRY | TOO_LARGE | TOO_LARGE_1000, \
+	CARRY | TOO_LARGE | TOO_LARGE_1000, \
+	CARRY | TOO_LARGE | TOO_LARGE_1000, \
+	CARRY | TOO_LARGE | TOO_LARGE_1000, \
+	CARRY | TOO_LARGE | TOO_LARGE_1000, \
+	/* ____1101 ________ */ \
+	CARRY | TOO_LARGE | TOO_LARGE_1000 | SURROGATE, \
+	CARRY | TOO_LARGE | TOO_LARGE_1000, \
+	CARRY | TOO_LARGE | TOO_LARGE_1000
+
+/*
+ * table for categorizing bits in the high nibble of
+ * the second byte of a 2-byte sequence
+ */
+#define BYTE_2_HIGH_TABLE \
+	/* ________ 0_______ <ASCII in byte 2> */ \
+	TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT, \
+	TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT, \
+	/* ________ 1000____ */ \
+	TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE_1000 | OVERLONG_4, \
+	/* ________ 1001____ */ \
+	TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE, \
+	/* ________ 101_____ */ \
+	TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE, \
+	TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE, \
+	/* ________ 11______ */ \
+	TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT \
+
+
+/*
+ * Return a vector with lanes non-zero where we have either errors, or
+ * two or more continuations in a row.
+ */
+static inline __m128i
+check_special_cases(const __m128i prev, const __m128i input)
+{
+	const		__m128i byte_1_high_table = vset(BYTE_1_HIGH_TABLE);
+	const		__m128i byte_1_low_table = vset(BYTE_1_LOW_TABLE);
+	const		__m128i byte_2_high_table = vset(BYTE_2_HIGH_TABLE);
+
+	/*
+	 * To classify the first byte in each chunk we need to have the last byte
+	 * from the previous chunk.
+	 */
+	const		__m128i input_shift1 = prev1(prev, input);
+
+	/* put the relevant nibbles into their own bytes in their own registers */
+	const		__m128i byte_1_high = shift_right(input_shift1, 4);
+	const		__m128i byte_1_low = bitwise_and(input_shift1, splat(0x0F));
+	const		__m128i byte_2_high = shift_right(input, 4);
+
+	/* lookup the possible errors for each set of nibbles */
+	const		__m128i lookup_1_high = lookup(byte_1_high, byte_1_high_table);
+	const		__m128i lookup_1_low = lookup(byte_1_low, byte_1_low_table);
+	const		__m128i lookup_2_high = lookup(byte_2_high, byte_2_high_table);
+
+	/*
+	 * AND all the lookups together. At this point, non-zero lanes in the
+	 * returned vector represent:
+	 *
+	 * 1. invalid 2-byte sequences
+	 *
+	 * 2. the second continuation byte of a 3- or 4-byte character
+	 *
+	 * 3. the third continuation byte of a 4-byte character
+	 */
+	const		__m128i temp = bitwise_and(lookup_1_high, lookup_1_low);
+
+	return bitwise_and(temp, lookup_2_high);
+}
+
+/*
+ * Return a vector with lanes set to TWO_CONTS where we expect to find two
+ * continuations in a row. These are valid only within 3- and 4-byte sequences.
+ */
+static inline __m128i
+check_multibyte_lengths(const __m128i prev, const __m128i input)
+{
+	/*
+	 * Populate registers that contain the input shifted right by 2 and 3
+	 * bytes, filling in the left lanes with the previous input.
+	 */
+	const		__m128i input_shift2 = prev2(prev, input);
+	const		__m128i input_shift3 = prev3(prev, input);
+
+	/*
+	 * Constants for comparison. Any 3-byte lead is greater than
+	 * MAX_TWO_BYTE_LEAD, etc.
+	 */
+	const		__m128i max_lead2 = splat(MAX_TWO_BYTE_LEAD);
+	const		__m128i max_lead3 = splat(MAX_THREE_BYTE_LEAD);
+
+	/*
+	 * Look in the shifted registers for 3- or 4-byte leads. There is no
+	 * unsigned comparison, so we use saturating subtraction followed by
+	 * signed comparison with zero. Any non-zero bytes in the result represent
+	 * valid leads.
+	 */
+	const		__m128i is_third_byte = saturating_sub(input_shift2, max_lead2);
+	const		__m128i is_fourth_byte = saturating_sub(input_shift3, max_lead3);
+
+	/* OR them together for easier comparison */
+	const		__m128i temp = bitwise_or(is_third_byte, is_fourth_byte);
+
+	/*
+	 * Set all bits in each 8-bit lane if the result is greater than zero.
+	 * Signed arithmetic is okay because the values are small.
+	 */
+	const		__m128i must23 = greater_than(temp, vzero());
+
+	/*
+	 * We want to compare with the result of check_special_cases() so apply a
+	 * mask to return only the set bits corresponding to the "two
+	 * continuations" case.
+	 */
+	return bitwise_and(must23, splat(TWO_CONTS));
+}
+
+/* set bits in the error vector where we find invalid UTF-8 input */
+static inline void
+check_utf8_bytes(const __m128i prev, const __m128i input, __m128i * error)
+{
+	const		__m128i special_cases = check_special_cases(prev, input);
+	const		__m128i expect_two_conts = check_multibyte_lengths(prev, input);
+
+	/* If the two cases are identical, this will be zero. */
+	const		__m128i result = bitwise_xor(expect_two_conts, special_cases);
+
+	*error = bitwise_or(*error, result);
+}
+
+/* return non-zero if the input terminates with an incomplete code point */
+static inline __m128i
+is_incomplete(const __m128i v)
+{
+	const		__m128i max_array =
+	vset(0xFF, 0xFF, 0xFF, 0xFF,
+		 0xFF, 0xFF, 0xFF, 0xFF,
+		 0xFF, 0xFF, 0xFF, 0xFF,
+		 0xFF, MAX_THREE_BYTE_LEAD, MAX_TWO_BYTE_LEAD, MAX_CONTINUATION);
+
+	return saturating_sub(v, max_array);
+}
+
+/*
+ * Returns the number of bytes validated, or zero on error.
+ *
+ * If valid, it's still possible we ended within an incomplete multibyte
+ * sequence, so the caller is responsible for adjusting the returned result
+ * to make sure it represents the end of the last valid byte sequence. In
+ * addition, the returned length can only be a multiple of register-width, so
+ * the caller must verify any remaining bytes.
+ *
+ * In the error case, the caller must start over at the beginning and verify
+ * one byte at a time.
+ *
+ * See also the comment in common/wchar.c under "multibyte sequence
+ * validators".
+ */
+int
+pg_validate_utf8_sse42(const unsigned char *s, int len)
+{
+	const unsigned char *start = s;
+	__m128i		error = vzero();
+	__m128i		prev = vzero();
+	__m128i		prev_incomplete = vzero();
+	__m128i		input;
+
+	while (len >= sizeof(input))
+	{
+		input = vload(s);
+		check_for_zeros(input, &error);
+
+		/*
+		 * If the chunk is all ASCII, we can skip the full UTF-8 check, but we
+		 * must still check the previous chunk for incomplete multibyte
+		 * sequences at the end. We only update prev_incomplete if the chunk
+		 * contains non-ASCII.
+		 */
+		if (is_highbit_set(input))
+		{
+			check_utf8_bytes(prev, input, &error);
+			prev_incomplete = is_incomplete(input);
+		}
+		else
+			error = bitwise_or(error, prev_incomplete);
+
+		prev = input;
+		s += sizeof(input);
+		len -= sizeof(input);
+	}
+
+	/* If we saw an error during the loop, let the caller handle it. */
+	if (to_bool(error))
+		return 0;
+	else
+		return s - start;
+}
diff --git a/src/port/pg_utf8_sse42_choose.c b/src/port/pg_utf8_sse42_choose.c
new file mode 100644
index 0000000000..973fe69225
--- /dev/null
+++ b/src/port/pg_utf8_sse42_choose.c
@@ -0,0 +1,68 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_utf8_sse42_choose.c
+ *	  Choose between SSE 4.2 and fallback implementation.
+ *
+ * On first call, checks if the CPU we're running on supports SSE 4.2.
+ * If it does, use SSE instructions for UTF-8 validation. Otherwise,
+ * fall back to the pure C implementation.
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_utf8_sse42_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#ifdef HAVE__GET_CPUID
+#include <cpuid.h>
+#endif
+
+#ifdef HAVE__CPUID
+#include <intrin.h>
+#endif
+
+#include "port/pg_utf8.h"
+
+static bool
+pg_utf8_sse42_available(void)
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 1);
+#else
+
+	/*
+	 * XXX The equivalent check for CRC throws an error here because it
+	 * detects CPUID presence at configure time. This is to avoid indirecting
+	 * through a function pointer, but that's not important for UTF-8.
+	 */
+	return false;
+#endif							/* HAVE__GET_CPUID */
+	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
+}
+
+/*
+ * This gets called on the first call. It replaces the function pointer
+ * so that subsequent calls are routed directly to the chosen implementation.
+ */
+static int
+pg_validate_utf8_choose(const unsigned char *s, int len)
+{
+	if (pg_utf8_sse42_available())
+		pg_validate_utf8 = pg_validate_utf8_sse42;
+	else
+		pg_validate_utf8 = pg_validate_utf8_fallback;
+
+	return pg_validate_utf8(s, len);
+}
+
+int			(*pg_validate_utf8) (const unsigned char *s, int len) = pg_validate_utf8_choose;
diff --git a/src/test/regress/expected/conversion.out b/src/test/regress/expected/conversion.out
index 04fdcba496..c318fee1b9 100644
--- a/src/test/regress/expected/conversion.out
+++ b/src/test/regress/expected/conversion.out
@@ -72,6 +72,118 @@ $$;
 --
 -- UTF-8
 --
+-- The description column must be unique.
+CREATE TABLE utf8_verification_inputs (inbytes bytea, description text PRIMARY KEY);
+insert into utf8_verification_inputs  values
+  ('\x66006f',	'NUL byte'),
+  ('\xaf',		'bare continuation'),
+  ('\xc5',		'missing second byte in 2-byte char'),
+  ('\xc080',	'smallest 2-byte overlong'),
+  ('\xc1bf',	'largest 2-byte overlong'),
+  ('\xc280',	'next 2-byte after overlongs'),
+  ('\xdfbf',	'largest 2-byte'),
+  ('\xe9af',	'missing third byte in 3-byte char'),
+  ('\xe08080',	'smallest 3-byte overlong'),
+  ('\xe09fbf',	'largest 3-byte overlong'),
+  ('\xe0a080',	'next 3-byte after overlong'),
+  ('\xed9fbf',	'last before surrogates'),
+  ('\xeda080',	'smallest surrogate'),
+  ('\xedbfbf',	'largest surrogate'),
+  ('\xee8080',	'next after surrogates'),
+  ('\xefbfbf',	'largest 3-byte'),
+  ('\xf1afbf',	'missing fourth byte in 4-byte char'),
+  ('\xf0808080',	'smallest 4-byte overlong'),
+  ('\xf08fbfbf',	'largest 4-byte overlong'),
+  ('\xf0908080',	'next 4-byte after overlong'),
+  ('\xf48fbfbf',	'largest 4-byte'),
+  ('\xf4908080',	'smallest too large'),
+  ('\xfa9a9a8a8a',	'5-byte');
+-- Test UTF-8 verification
+select description, (test_conv(inbytes, 'utf8', 'utf8')).* from utf8_verification_inputs;
+            description             |   result   |   errorat    |                             error                              
+------------------------------------+------------+--------------+----------------------------------------------------------------
+ NUL byte                           | \x66       | \x006f       | invalid byte sequence for encoding "UTF8": 0x00
+ bare continuation                  | \x         | \xaf         | invalid byte sequence for encoding "UTF8": 0xaf
+ missing second byte in 2-byte char | \x         | \xc5         | invalid byte sequence for encoding "UTF8": 0xc5
+ smallest 2-byte overlong           | \x         | \xc080       | invalid byte sequence for encoding "UTF8": 0xc0 0x80
+ largest 2-byte overlong            | \x         | \xc1bf       | invalid byte sequence for encoding "UTF8": 0xc1 0xbf
+ next 2-byte after overlongs        | \xc280     |              | 
+ largest 2-byte                     | \xdfbf     |              | 
+ missing third byte in 3-byte char  | \x         | \xe9af       | invalid byte sequence for encoding "UTF8": 0xe9 0xaf
+ smallest 3-byte overlong           | \x         | \xe08080     | invalid byte sequence for encoding "UTF8": 0xe0 0x80 0x80
+ largest 3-byte overlong            | \x         | \xe09fbf     | invalid byte sequence for encoding "UTF8": 0xe0 0x9f 0xbf
+ next 3-byte after overlong         | \xe0a080   |              | 
+ last before surrogates             | \xed9fbf   |              | 
+ smallest surrogate                 | \x         | \xeda080     | invalid byte sequence for encoding "UTF8": 0xed 0xa0 0x80
+ largest surrogate                  | \x         | \xedbfbf     | invalid byte sequence for encoding "UTF8": 0xed 0xbf 0xbf
+ next after surrogates              | \xee8080   |              | 
+ largest 3-byte                     | \xefbfbf   |              | 
+ missing fourth byte in 4-byte char | \x         | \xf1afbf     | invalid byte sequence for encoding "UTF8": 0xf1 0xaf 0xbf
+ smallest 4-byte overlong           | \x         | \xf0808080   | invalid byte sequence for encoding "UTF8": 0xf0 0x80 0x80 0x80
+ largest 4-byte overlong            | \x         | \xf08fbfbf   | invalid byte sequence for encoding "UTF8": 0xf0 0x8f 0xbf 0xbf
+ next 4-byte after overlong         | \xf0908080 |              | 
+ largest 4-byte                     | \xf48fbfbf |              | 
+ smallest too large                 | \x         | \xf4908080   | invalid byte sequence for encoding "UTF8": 0xf4 0x90 0x80 0x80
+ 5-byte                             | \x         | \xfa9a9a8a8a | invalid byte sequence for encoding "UTF8": 0xfa
+(23 rows)
+
+-- Test UTF-8 verification with ASCII padding appended to provide
+-- coverage for algorithms that work on multiple bytes at a time.
+with test_bytes as (
+  -- The error message for a sequence starting with a 4-byte lead
+  -- will contain all 4 bytes if they are present, so add 3
+  -- ASCII bytes to the end to ensure consistent error messages.
+  select
+    inbytes,
+    description,
+    (test_conv(inbytes || repeat('.', 3)::bytea, 'utf8', 'utf8')).error
+  from utf8_verification_inputs
+), test_padded as (
+  select
+    description,
+    (test_conv(inbytes || repeat('.', 32)::bytea, 'utf8', 'utf8')).error
+  from test_bytes
+)
+select
+  description,
+  b.error as orig_error,
+  p.error as error_after_padding
+from test_padded p
+join test_bytes b
+using (description)
+where p.error is distinct from b.error
+order by description;
+ description | orig_error | error_after_padding 
+-------------+------------+---------------------
+(0 rows)
+
+-- Test ASCII fast path with cases where incomplete UTF-8 sequences
+-- fall at the end of a 16-byte boundary followed by more ASCII.
+with test_bytes as (
+  select
+    inbytes,
+    description,
+    (test_conv(inbytes || repeat('.', 3)::bytea, 'utf8', 'utf8')).error
+  from utf8_verification_inputs
+), test_padded as (
+  select
+    description,
+    (test_conv(repeat('.', 32 - length(inbytes))::bytea || inbytes || repeat('.', 32)::bytea, 'utf8', 'utf8')).error
+  from test_bytes
+)
+select
+  description,
+  b.error as orig_error,
+  p.error as error_after_padding
+from test_padded p
+join test_bytes b
+using (description)
+where p.error is distinct from b.error
+order by description;
+ description | orig_error | error_after_padding 
+-------------+------------+---------------------
+(0 rows)
+
 CREATE TABLE utf8_inputs (inbytes bytea, description text);
 insert into utf8_inputs  values
   ('\x666f6f',		'valid, pure ASCII'),
diff --git a/src/test/regress/sql/conversion.sql b/src/test/regress/sql/conversion.sql
index 8358682432..bce39e5296 100644
--- a/src/test/regress/sql/conversion.sql
+++ b/src/test/regress/sql/conversion.sql
@@ -74,6 +74,87 @@ $$;
 --
 -- UTF-8
 --
+-- The description column must be unique.
+CREATE TABLE utf8_verification_inputs (inbytes bytea, description text PRIMARY KEY);
+insert into utf8_verification_inputs  values
+  ('\x66006f',	'NUL byte'),
+  ('\xaf',		'bare continuation'),
+  ('\xc5',		'missing second byte in 2-byte char'),
+  ('\xc080',	'smallest 2-byte overlong'),
+  ('\xc1bf',	'largest 2-byte overlong'),
+  ('\xc280',	'next 2-byte after overlongs'),
+  ('\xdfbf',	'largest 2-byte'),
+  ('\xe9af',	'missing third byte in 3-byte char'),
+  ('\xe08080',	'smallest 3-byte overlong'),
+  ('\xe09fbf',	'largest 3-byte overlong'),
+  ('\xe0a080',	'next 3-byte after overlong'),
+  ('\xed9fbf',	'last before surrogates'),
+  ('\xeda080',	'smallest surrogate'),
+  ('\xedbfbf',	'largest surrogate'),
+  ('\xee8080',	'next after surrogates'),
+  ('\xefbfbf',	'largest 3-byte'),
+  ('\xf1afbf',	'missing fourth byte in 4-byte char'),
+  ('\xf0808080',	'smallest 4-byte overlong'),
+  ('\xf08fbfbf',	'largest 4-byte overlong'),
+  ('\xf0908080',	'next 4-byte after overlong'),
+  ('\xf48fbfbf',	'largest 4-byte'),
+  ('\xf4908080',	'smallest too large'),
+  ('\xfa9a9a8a8a',	'5-byte');
+
+-- Test UTF-8 verification
+select description, (test_conv(inbytes, 'utf8', 'utf8')).* from utf8_verification_inputs;
+
+-- Test UTF-8 verification with ASCII padding appended to provide
+-- coverage for algorithms that work on multiple bytes at a time.
+with test_bytes as (
+  -- The error message for a sequence starting with a 4-byte lead
+  -- will contain all 4 bytes if they are present, so add 3
+  -- ASCII bytes to the end to ensure consistent error messages.
+  select
+    inbytes,
+    description,
+    (test_conv(inbytes || repeat('.', 3)::bytea, 'utf8', 'utf8')).error
+  from utf8_verification_inputs
+), test_padded as (
+  select
+    description,
+    (test_conv(inbytes || repeat('.', 32)::bytea, 'utf8', 'utf8')).error
+  from test_bytes
+)
+select
+  description,
+  b.error as orig_error,
+  p.error as error_after_padding
+from test_padded p
+join test_bytes b
+using (description)
+where p.error is distinct from b.error
+order by description;
+
+-- Test ASCII fast path with cases where incomplete UTF-8 sequences
+-- fall at the end of a 16-byte boundary followed by more ASCII.
+with test_bytes as (
+  select
+    inbytes,
+    description,
+    (test_conv(inbytes || repeat('.', 3)::bytea, 'utf8', 'utf8')).error
+  from utf8_verification_inputs
+), test_padded as (
+  select
+    description,
+    (test_conv(repeat('.', 32 - length(inbytes))::bytea || inbytes || repeat('.', 32)::bytea, 'utf8', 'utf8')).error
+  from test_bytes
+)
+select
+  description,
+  b.error as orig_error,
+  p.error as error_after_padding
+from test_padded p
+join test_bytes b
+using (description)
+where p.error is distinct from b.error
+order by description;
+
 CREATE TABLE utf8_inputs (inbytes bytea, description text);
 insert into utf8_inputs  values
   ('\x666f6f',		'valid, pure ASCII'),
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 233ddbf4c2..9b8bad9044 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -120,10 +120,14 @@ sub mkvcbuild
 		push(@pgportfiles, 'pg_crc32c_sse42_choose.c');
 		push(@pgportfiles, 'pg_crc32c_sse42.c');
 		push(@pgportfiles, 'pg_crc32c_sb8.c');
+		push(@pgportfiles, 'pg_utf8_sse42_choose.c');
+		push(@pgportfiles, 'pg_utf8_sse42.c');
+		push(@pgportfiles, 'pg_utf8_fallback.c');
 	}
 	else
 	{
 		push(@pgportfiles, 'pg_crc32c_sb8.c');
+		push(@pgportfiles, 'pg_utf8_fallback.c');
 	}
 
 	our @pgcommonallfiles = qw(
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index fcb43b0ca0..db7a84e30a 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -490,6 +490,7 @@ sub GenerateFiles
 		USE_ASSERT_CHECKING => $self->{options}->{asserts} ? 1 : undef,
 		USE_BONJOUR         => undef,
 		USE_BSD_AUTH        => undef,
+		USE_FALLBACK_UTF8          => undef,
 		USE_ICU => $self->{options}->{icu} ? 1 : undef,
 		USE_LIBXML                 => undef,
 		USE_LIBXSLT                => undef,
@@ -502,6 +503,8 @@ sub GenerateFiles
 		USE_SLICING_BY_8_CRC32C    => undef,
 		USE_SSE42_CRC32C           => undef,
 		USE_SSE42_CRC32C_WITH_RUNTIME_CHECK => 1,
+		USE_SSE42_UTF8             => undef,
+		USE_SSE42_UTF8_WITH_RUNTIME_CHECK => 1,
 		USE_SYSTEMD                         => undef,
 		USE_SYSV_SEMAPHORES                 => undef,
 		USE_SYSV_SHARED_MEMORY              => undef,
-- 
2.31.1

Re: speed up verifying UTF-8

Reply via email to