Re: Improve CRC32C performance on SSE4.2

John Naylor Sun, 09 Feb 2025 01:01:04 -0800

On Thu, Feb 6, 2025 at 3:49 AM Devulapalli, Raghuveer
<raghuveer.devulapa...@intel.com> wrote:
>
> This patch improves the performance of SSE42 CRC32C algorithm. The current 
> SSE4.2 implementation of CRC32C relies on the native crc32 instruction and 
> processes 8 bytes at a time in a loop. The technique in  this paper uses the 
> pclmulqdq instruction and processing 64 bytes at time. The algorithm is based 
> on sse42 version of crc32 computation from Chromium’s copy of zlib with 
> modified constants for crc32c computation. See:
>
>
>
> https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c


Thanks for the patch!

> Microbenchmarks (generated with google benchmark using a standalone version 
> of the same algorithms):

|----------------------------------+---------------------+---------------+---------------|
| Benchmark                        | Buffer size (bytes) | Time Old
(ns) | Time New (ns) |
|----------------------------------+---------------------+---------------+---------------|
| [scalar_crc32c vs. sse42_crc32c] | 64                  | 33
  | 6             |
| [scalar_crc32c vs. sse42_crc32c] | 128                 | 88
  | 9             |
| [scalar_crc32c vs. sse42_crc32c] | 256                 | 211
  | 17            |
| [scalar_crc32c vs. sse42_crc32c] | 512                 | 486
  | 30            |
| [scalar_crc32c vs. sse42_crc32c] | 1024                | 1037
  | 57            |
| [scalar_crc32c vs. sse42_crc32c] | 2048                | 2140
  | 116           |
|----------------------------------+---------------------+---------------+---------------|

I'm highly suspicious of these numbers because they show this version
is about 20x faster than "scalar", so relatively speaking 3x faster
than the AVX-512 proposal? I ran my own benchmarks with the attached
script using your test function from the other thread, and found this
patch is actually slower than master on anything smaller than 256
bytes. Luckily, that's easily fixable: It turns out the implementation
in the paper (and chromium) has a very inefficient finalization step,
using more carryless multiplications and plenty of other operations.
After the main loop, and at the end, it's much more efficient to
convert the 128-bit intermediate result directly into a CRC in the
usual way. See here for details:

https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq

The author of this article also has an MIT-licensed program to
generate CRC implementations with specified requirements (including on
ARM):

https://github.com/corsix/fast-crc32

I generated a similar function for v2-0004 and this benchmark shows
it's faster than master on 128 bytes and above, which is encouraging
(see attached graph). As I mentioned in the other thread, we need to
be mindful of the fact that the latency of carryless multiplication
varies wildly on different microarchitectures. I did the benchmarks on
my older machine, which I believe has a latency of 7 cycles for this
instruction. Looking around *briefly*, the most recent x86 chips with
worse latency seem to be from around 2012-13 (e.g. Intel Silvermont
and AMD Piledriver). Chips newer than mine have a latency of 4-7
cycles. It seems okay to assume those who care the most about
performance will be on hardware less than 12 years old, but I'm open
to arguments to be more conservative here.

Large inputs would make the graph hard to read, so I'll just put the
results for 8kB here:

master: 1504ms
v1:      543ms
v2:      533ms

Some thoughts on building (not a complete review):

--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -557,14 +557,19 @@ AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS],
 [define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics])])dnl
 AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32], [Ac_cachevar],
 [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h>
+    #include <wmmintrin.h>
     #if defined(__has_attribute) && __has_attribute (target)
-    __attribute__((target("sse4.2")))
+    __attribute__((target("sse4.2,pclmul")))

It's probably okay to fold these together in the same compile-time
check, since both are fairly old by now, but for those following
along, pclmul is not in SSE 4.2 and is a bit newer. So this would
cause machines building on Nehalem (2008) to fail the check and go
back to slicing-by-8 with it written this way. That's a long time ago,
so I'm not sure if anyone would notice, and I think we could fix it
for people using a packaged binary by having a fallback wrapper
function that just calls the SSE 4.2 "tail", as 0002 calls it.

--
John Naylor
Amazon Web Services

test-crc.sh
Description: application/shellscript

From 5b329ccf89986ab5e6dd170f8fa317ed206e2137 Mon Sep 17 00:00:00 2001
From: Paul Amonson <paul.d.amon...@intel.com>
Date: Mon, 6 May 2024 08:34:17 -0700
Subject: [PATCH v2 2/4] Add a Postgres SQL function for crc32c benchmarking

Add a drive_crc32c() function to use for benchmarking crc32c
computation. The function takes 2 arguments:

(1) count: num of times CRC32C is computed in a loop.
(2) num: #bytes in the buffer to calculate crc over.

XXX not for commit

Extracted from a patch by  Raghuveer Devulapalli
---
 contrib/meson.build                          |  1 +
 contrib/test_crc32c/Makefile                 | 20 +++++++
 contrib/test_crc32c/expected/test_crc32c.out | 57 ++++++++++++++++++++
 contrib/test_crc32c/meson.build              | 34 ++++++++++++
 contrib/test_crc32c/sql/test_crc32c.sql      |  3 ++
 contrib/test_crc32c/test_crc32c--1.0.sql     |  1 +
 contrib/test_crc32c/test_crc32c.c            | 47 ++++++++++++++++
 contrib/test_crc32c/test_crc32c.control      |  4 ++
 8 files changed, 167 insertions(+)
 create mode 100644 contrib/test_crc32c/Makefile
 create mode 100644 contrib/test_crc32c/expected/test_crc32c.out
 create mode 100644 contrib/test_crc32c/meson.build
 create mode 100644 contrib/test_crc32c/sql/test_crc32c.sql
 create mode 100644 contrib/test_crc32c/test_crc32c--1.0.sql
 create mode 100644 contrib/test_crc32c/test_crc32c.c
 create mode 100644 contrib/test_crc32c/test_crc32c.control

diff --git a/contrib/meson.build b/contrib/meson.build
index 1ba73ebd67..06673db062 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ contrib_doc_args = {
   'install_dir': contrib_doc_dir,
 }
 
+subdir('test_crc32c')
 subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
diff --git a/contrib/test_crc32c/Makefile b/contrib/test_crc32c/Makefile
new file mode 100644
index 0000000000..5b747c6184
--- /dev/null
+++ b/contrib/test_crc32c/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = test_crc32c
+OBJS = test_crc32c.o
+PGFILEDESC = "test"
+EXTENSION = test_crc32c
+DATA = test_crc32c--1.0.sql
+
+first: all
+
+# test_crc32c.o:	CFLAGS+=-g
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_crc32c
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/test_crc32c/expected/test_crc32c.out b/contrib/test_crc32c/expected/test_crc32c.out
new file mode 100644
index 0000000000..dff6bb3133
--- /dev/null
+++ b/contrib/test_crc32c/expected/test_crc32c.out
@@ -0,0 +1,57 @@
+CREATE EXTENSION test_crc32c;
+select drive_crc32c(1, i) from generate_series(100, 300, 4) i;
+ drive_crc32c 
+--------------
+    532139994
+   2103623867
+    785984197
+   2686825890
+   3213049059
+   3819630168
+   1389234603
+    534072900
+   2930108140
+   2496889855
+   1475239611
+    136366931
+   3067402116
+   2012717871
+   3682416023
+   2054270645
+   1817339875
+   4100939569
+   1192727539
+   3636976218
+    369764421
+   3161609879
+   1067984880
+   1235066769
+   3138425899
+    648132037
+   4203750233
+   1330187888
+   2683521348
+   1951644495
+   2574090107
+   3904902018
+   3772697795
+   1644686344
+   2868962106
+   3369218491
+   3902689890
+   3456411865
+    141004025
+   1504497996
+   3782655204
+   3544797610
+   3429174879
+   2524728016
+   3935861181
+     25498897
+    692684159
+    345705535
+   2761600287
+   2654632420
+   3945991399
+(51 rows)
+
diff --git a/contrib/test_crc32c/meson.build b/contrib/test_crc32c/meson.build
new file mode 100644
index 0000000000..d7bec4ba1c
--- /dev/null
+++ b/contrib/test_crc32c/meson.build
@@ -0,0 +1,34 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+test_crc32c_sources = files(
+  'test_crc32c.c',
+)
+
+if host_system == 'windows'
+  test_crc32c_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_crc32c',
+    '--FILEDESC', 'test_crc32c - test code for crc32c library',])
+endif
+
+test_crc32c = shared_module('test_crc32c',
+  test_crc32c_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += test_crc32c
+
+install_data(
+  'test_crc32c.control',
+  'test_crc32c--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_crc32c',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_crc32c',
+    ],
+  },
+}
diff --git a/contrib/test_crc32c/sql/test_crc32c.sql b/contrib/test_crc32c/sql/test_crc32c.sql
new file mode 100644
index 0000000000..95c6dfe448
--- /dev/null
+++ b/contrib/test_crc32c/sql/test_crc32c.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_crc32c;
+
+select drive_crc32c(1, i) from generate_series(100, 300, 4) i;
diff --git a/contrib/test_crc32c/test_crc32c--1.0.sql b/contrib/test_crc32c/test_crc32c--1.0.sql
new file mode 100644
index 0000000000..52b9772f90
--- /dev/null
+++ b/contrib/test_crc32c/test_crc32c--1.0.sql
@@ -0,0 +1 @@
+CREATE FUNCTION drive_crc32c  (count int, num int) RETURNS bigint AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/contrib/test_crc32c/test_crc32c.c b/contrib/test_crc32c/test_crc32c.c
new file mode 100644
index 0000000000..b350caf5ce
--- /dev/null
+++ b/contrib/test_crc32c/test_crc32c.c
@@ -0,0 +1,47 @@
+/* select drive_crc32c(1000000, 1024); */
+
+#include "postgres.h"
+#include "fmgr.h"
+#include "port/pg_crc32c.h"
+#include "common/pg_prng.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * drive_crc32c(count: int, num: int) returns bigint
+ *
+ * count is the nuimber of loops to perform
+ *
+ * num is the number byte in the buffer to calculate
+ * crc32c over.
+ */
+PG_FUNCTION_INFO_V1(drive_crc32c);
+Datum
+drive_crc32c(PG_FUNCTION_ARGS)
+{
+	int64			count	= PG_GETARG_INT64(0);
+	int64			num		= PG_GETARG_INT64(1);
+	char*		data	= malloc((size_t)num);
+	pg_crc32c crc;
+	pg_prng_state state;
+	uint64 seed = 42;
+	pg_prng_seed(&state, seed);
+	/* set random data */
+	for (uint64 i = 0; i < num; i++)
+	{
+		data[i] = pg_prng_uint32(&state) % 255;
+	}
+
+	INIT_CRC32C(crc);
+
+	while(count--)
+	{
+		INIT_CRC32C(crc);
+		COMP_CRC32C(crc, data, num);
+		FIN_CRC32C(crc);
+	}
+
+	free((void *)data);
+
+	PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/contrib/test_crc32c/test_crc32c.control b/contrib/test_crc32c/test_crc32c.control
new file mode 100644
index 0000000000..878a077ee1
--- /dev/null
+++ b/contrib/test_crc32c/test_crc32c.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_crc32c'
+relocatable = true
-- 
2.48.1

From 0691791201bae023c2e4da73a6b370f9929f7cea Mon Sep 17 00:00:00 2001
From: Raghuveer Devulapalli <raghuveer.devulapa...@intel.com>
Date: Tue, 4 Feb 2025 15:20:13 -0800
Subject: [PATCH v2 3/4] Improve CRC32C performance on SSE4.2

The current SSE4.2 implementation of CRC32C relies on the native crc32
instruction and processes 8 bytes at a time in a loop. The technique in
this paper uses the pclmulqdq instruction and processing 64 bytes at
time.

Based on: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
V. Gopal, E. Ozturk, et al., 2009

The algorithm is based on crc32_sse42_simd from chromimum's copy of zlib. See:
from https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c

Microbenchmarks: (generated with google benchmark using a standalone
version of the same algorithms).

Comparing scalar_crc32c (current version) to sse42_crc32c (proposed new
version):
|----------------------------------+---------------------+---------------+---------------|
| Benchmark                        | Buffer size (bytes) | Time Old (ns) | Time New (ns) |
|----------------------------------+---------------------+---------------+---------------|
| [scalar_crc32c vs. sse42_crc32c] | 64                  | 33            | 6             |
| [scalar_crc32c vs. sse42_crc32c] | 128                 | 88            | 9             |
| [scalar_crc32c vs. sse42_crc32c] | 256                 | 211           | 17            |
| [scalar_crc32c vs. sse42_crc32c] | 512                 | 486           | 30            |
| [scalar_crc32c vs. sse42_crc32c] | 1024                | 1037          | 57            |
| [scalar_crc32c vs. sse42_crc32c] | 2048                | 2140          | 116           |
|----------------------------------+---------------------+---------------+---------------|
---
 config/c-compiler.m4       |   7 +-
 configure                  |   7 +-
 meson.build                |   7 +-
 src/port/pg_crc32c_sse42.c | 127 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 141 insertions(+), 7 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c1..8b255b5cc8 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -557,14 +557,19 @@ AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS],
 [define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics])])dnl
 AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32], [Ac_cachevar],
 [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h>
+    #include <wmmintrin.h>
     #if defined(__has_attribute) && __has_attribute (target)
-    __attribute__((target("sse4.2")))
+    __attribute__((target("sse4.2,pclmul")))
     #endif
     static int crc32_sse42_test(void)
+
     {
+      __m128i x1 = _mm_set1_epi32(1);
       unsigned int crc = 0;
       crc = _mm_crc32_u8(crc, 0);
       crc = _mm_crc32_u32(crc, 0);
+      x1 = _mm_clmulepi64_si128(x1, x1, 0x00); // pclmul
+      crc = crc + _mm_extract_epi32(x1, 1);
       /* return computed value, to prevent the above being optimized away */
       return crc == 0;
     }],
diff --git a/configure b/configure
index ceeef9b091..f457e3d3bc 100755
--- a/configure
+++ b/configure
@@ -17178,14 +17178,19 @@ else
   cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
 #include <nmmintrin.h>
+    #include <wmmintrin.h>
     #if defined(__has_attribute) && __has_attribute (target)
-    __attribute__((target("sse4.2")))
+    __attribute__((target("sse4.2,pclmul")))
     #endif
     static int crc32_sse42_test(void)
+
     {
+      __m128i x1 = _mm_set1_epi32(1);
       unsigned int crc = 0;
       crc = _mm_crc32_u8(crc, 0);
       crc = _mm_crc32_u32(crc, 0);
+      x1 = _mm_clmulepi64_si128(x1, x1, 0x00);
+      crc = crc + _mm_extract_epi32(x1, 1);
       /* return computed value, to prevent the above being optimized away */
       return crc == 0;
     }
diff --git a/meson.build b/meson.build
index 1ceadb9a83..456c3fafc3 100644
--- a/meson.build
+++ b/meson.build
@@ -2227,15 +2227,18 @@ if host_cpu == 'x86' or host_cpu == 'x86_64'
 
     prog = '''
 #include <nmmintrin.h>
-
+#include <wmmintrin.h>
 #if defined(__has_attribute) && __has_attribute (target)
-__attribute__((target("sse4.2")))
+__attribute__((target("sse4.2,pclmul")))
 #endif
 int main(void)
 {
+    __m128i x1 = _mm_set1_epi32(1);
     unsigned int crc = 0;
     crc = _mm_crc32_u8(crc, 0);
     crc = _mm_crc32_u32(crc, 0);
+    x1 = _mm_clmulepi64_si128(x1, x1, 0x00); // pclmul
+    crc = crc + _mm_extract_epi32(x1, 1);
     /* return computed value, to prevent the above being optimized away */
     return crc == 0;
 }
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index 22c2137df3..69f8917c7d 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -15,13 +15,13 @@
 #include "c.h"
 
 #include <nmmintrin.h>
-
+#include <wmmintrin.h>
 #include "port/pg_crc32c.h"
 
 pg_attribute_no_sanitize_alignment()
 pg_attribute_target("sse4.2")
-pg_crc32c
-pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
+static pg_crc32c
+pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)
 {
 	const unsigned char *p = data;
 	const unsigned char *pend = p + len;
@@ -68,3 +68,124 @@ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
 
 	return crc;
 }
+
+/*
+ * Based on: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
+ * Instruction" V. Gopal, E. Ozturk, et al., 2009
+ *
+ * The algorithm is based on crc32_sse42_simd from chromimum's copy of zlib.
+ * See:
+ * https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c
+ */
+
+pg_attribute_no_sanitize_alignment()
+pg_attribute_target("sse4.2,pclmul")
+pg_crc32c
+pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t length)
+{
+    ssize_t len = (ssize_t) length;
+    const unsigned char *buf = data;
+    /*
+     * Definitions of the bit-reflected domain constants k1,k2,k3, etc and
+     * the CRC32+Barrett polynomials given at the end of the paper.
+     */
+    static const uint64_t pg_attribute_aligned(16) k1k2[] = { 0x740eef02, 0x9e4addf8 };
+    static const uint64_t pg_attribute_aligned(16) k3k4[] = { 0xf20c0dfe, 0x14cd00bd6 };
+    static const uint64_t pg_attribute_aligned(16) k5k0[] = { 0xdd45aab8, 0x000000000 };
+    static const uint64_t pg_attribute_aligned(16) poly[] = { 0x105ec76f1, 0xdea713f1 };
+    if (len >= 64) {
+        __m128i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
+        /*
+         * There's at least one block of 64.
+         */
+        x1 = _mm_loadu_si128((__m128i *)(buf + 0x00));
+        x2 = _mm_loadu_si128((__m128i *)(buf + 0x10));
+        x3 = _mm_loadu_si128((__m128i *)(buf + 0x20));
+        x4 = _mm_loadu_si128((__m128i *)(buf + 0x30));
+        x1 = _mm_xor_si128(x1, _mm_cvtsi32_si128(crc));
+        x0 = _mm_load_si128((__m128i *)k1k2);
+        buf += 64;
+        len -= 64;
+        /*
+         * Parallel fold blocks of 64, if any.
+         */
+        while (len >= 64)
+        {
+            x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
+            x6 = _mm_clmulepi64_si128(x2, x0, 0x00);
+            x7 = _mm_clmulepi64_si128(x3, x0, 0x00);
+            x8 = _mm_clmulepi64_si128(x4, x0, 0x00);
+            x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
+            x2 = _mm_clmulepi64_si128(x2, x0, 0x11);
+            x3 = _mm_clmulepi64_si128(x3, x0, 0x11);
+            x4 = _mm_clmulepi64_si128(x4, x0, 0x11);
+            y5 = _mm_loadu_si128((__m128i *)(buf + 0x00));
+            y6 = _mm_loadu_si128((__m128i *)(buf + 0x10));
+            y7 = _mm_loadu_si128((__m128i *)(buf + 0x20));
+            y8 = _mm_loadu_si128((__m128i *)(buf + 0x30));
+            x1 = _mm_xor_si128(x1, x5);
+            x2 = _mm_xor_si128(x2, x6);
+            x3 = _mm_xor_si128(x3, x7);
+            x4 = _mm_xor_si128(x4, x8);
+            x1 = _mm_xor_si128(x1, y5);
+            x2 = _mm_xor_si128(x2, y6);
+            x3 = _mm_xor_si128(x3, y7);
+            x4 = _mm_xor_si128(x4, y8);
+            buf += 64;
+            len -= 64;
+        }
+        /*
+         * Fold into 128-bits.
+         */
+        x0 = _mm_load_si128((__m128i *)k3k4);
+        x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
+        x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
+        x1 = _mm_xor_si128(x1, x2);
+        x1 = _mm_xor_si128(x1, x5);
+        x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
+        x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
+        x1 = _mm_xor_si128(x1, x3);
+        x1 = _mm_xor_si128(x1, x5);
+        x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
+        x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
+        x1 = _mm_xor_si128(x1, x4);
+        x1 = _mm_xor_si128(x1, x5);
+        /*
+         * Single fold blocks of 16, if any.
+         */
+        while (len >= 16)
+        {
+            x2 = _mm_loadu_si128((__m128i *)buf);
+            x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
+            x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
+            x1 = _mm_xor_si128(x1, x2);
+            x1 = _mm_xor_si128(x1, x5);
+            buf += 16;
+            len -= 16;
+        }
+        /*
+         * Fold 128-bits to 64-bits.
+         */
+        x2 = _mm_clmulepi64_si128(x1, x0, 0x10);
+        x3 = _mm_setr_epi32(~0, 0, ~0, 0);
+        x1 = _mm_srli_si128(x1, 8);
+        x1 = _mm_xor_si128(x1, x2);
+        x0 = _mm_loadl_epi64((__m128i*)k5k0);
+        x2 = _mm_srli_si128(x1, 4);
+        x1 = _mm_and_si128(x1, x3);
+        x1 = _mm_clmulepi64_si128(x1, x0, 0x00);
+        x1 = _mm_xor_si128(x1, x2);
+        /*
+         * Barret reduce to 32-bits.
+         */
+        x0 = _mm_load_si128((__m128i*)poly);
+        x2 = _mm_and_si128(x1, x3);
+        x2 = _mm_clmulepi64_si128(x2, x0, 0x10);
+        x2 = _mm_and_si128(x2, x3);
+        x2 = _mm_clmulepi64_si128(x2, x0, 0x00);
+        x1 = _mm_xor_si128(x1, x2);
+        crc = _mm_extract_epi32(x1, 1);
+    }
+
+    return pg_comp_crc32c_sse42_tail(crc, buf, len);
+}
-- 
2.48.1

From d9c69f435cfecf46f2397091e36010af8c965f79 Mon Sep 17 00:00:00 2001
From: Raghuveer Devulapalli <raghuveer.devulapa...@intel.com>
Date: Tue, 4 Feb 2025 12:56:00 -0800
Subject: [PATCH v2 1/4] Add more test coverage for crc32c

---
 src/test/regress/expected/crc32c.out | 42 ++++++++++++++++++++++++++++
 src/test/regress/parallel_schedule   |  2 ++
 src/test/regress/sql/crc32c.sql      | 12 ++++++++
 3 files changed, 56 insertions(+)
 create mode 100644 src/test/regress/expected/crc32c.out
 create mode 100644 src/test/regress/sql/crc32c.sql

diff --git a/src/test/regress/expected/crc32c.out b/src/test/regress/expected/crc32c.out
new file mode 100644
index 0000000000..f25965df4a
--- /dev/null
+++ b/src/test/regress/expected/crc32c.out
@@ -0,0 +1,42 @@
+--
+-- CRC32C
+-- Testing CRC32C SSE4.2 algorithm.
+-- The new algorithm has various code paths that needs test coverage.
+-- We achieve that by computing CRC32C of text of various sizes: 15, 64, 128, 144, 159 and 256 bytes.
+--
+SELECT crc32c('');
+ crc32c 
+--------
+      0
+(1 row)
+
+SELECT crc32c('Hello 15 bytes!');
+   crc32c   
+------------
+ 3405757121
+(1 row)
+
+SELECT crc32c('This is a 64 byte piece of text to run through the main loop ...');
+  crc32c   
+-----------
+ 721494841
+(1 row)
+
+SELECT crc32c('This is a carefully constructed text that needs to be exactly 128 bytes long for testing purposes. Let me add more words to ....');
+   crc32c   
+------------
+ 1602016964
+(1 row)
+
+SELECT crc32c('This is a text that needs to be exactly 144 bytes long for testing purposes. I will add more words to reach that specific length. Now we are ...');
+   crc32c   
+------------
+ 1912862944
+(1 row)
+
+SELECT crc32c('This is a precisely crafted message that needs to be exactly 159 bytes in length for testing purposes. I will continue adding more text until we reach that ...');
+   crc32c   
+------------
+ 1245879782
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 1edd9e45eb..7c9dbf65db 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -56,6 +56,8 @@ test: create_aggregate create_function_sql create_cast constraints triggers sele
 # ----------
 test: sanity_check
 
+test: crc32c
+
 # ----------
 # Another group of parallel tests
 # aggregates depends on create_aggregate
diff --git a/src/test/regress/sql/crc32c.sql b/src/test/regress/sql/crc32c.sql
new file mode 100644
index 0000000000..5e481eab6f
--- /dev/null
+++ b/src/test/regress/sql/crc32c.sql
@@ -0,0 +1,12 @@
+--
+-- CRC32C
+-- Testing CRC32C SSE4.2 algorithm.
+-- The new algorithm has various code paths that needs test coverage.
+-- We achieve that by computing CRC32C of text of various sizes: 15, 64, 128, 144, 159 and 256 bytes.
+--
+SELECT crc32c('');
+SELECT crc32c('Hello 15 bytes!');
+SELECT crc32c('This is a 64 byte piece of text to run through the main loop ...');
+SELECT crc32c('This is a carefully constructed text that needs to be exactly 128 bytes long for testing purposes. Let me add more words to ....');
+SELECT crc32c('This is a text that needs to be exactly 144 bytes long for testing purposes. I will add more words to reach that specific length. Now we are ...');
+SELECT crc32c('This is a precisely crafted message that needs to be exactly 159 bytes in length for testing purposes. I will continue adding more text until we reach that ...');
-- 
2.48.1

From 61295213e3133fd319e0b2bd18e1a9c16a4af140 Mon Sep 17 00:00:00 2001
From: John Naylor <john.nay...@postgresql.org>
Date: Sun, 9 Feb 2025 12:25:56 +0700
Subject: [PATCH v2 4/4] Shorter version from corsix

---
 src/port/pg_crc32c_sse42.c | 165 ++++++++++++++-----------------------
 1 file changed, 62 insertions(+), 103 deletions(-)

diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index 69f8917c7d..dec685d139 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -78,114 +78,73 @@ pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)
  * https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c
  */
 
+#define clmul_lo(a, b) (_mm_clmulepi64_si128((a), (b), 0))
+#define clmul_hi(a, b) (_mm_clmulepi64_si128((a), (b), 17))
+
 pg_attribute_no_sanitize_alignment()
 pg_attribute_target("sse4.2,pclmul")
 pg_crc32c
-pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t length)
+pg_comp_crc32c_sse42(pg_crc32c crc0, const void *data, size_t length)
 {
-    ssize_t len = (ssize_t) length;
+    size_t len = length;
     const unsigned char *buf = data;
-    /*
-     * Definitions of the bit-reflected domain constants k1,k2,k3, etc and
-     * the CRC32+Barrett polynomials given at the end of the paper.
-     */
-    static const uint64_t pg_attribute_aligned(16) k1k2[] = { 0x740eef02, 0x9e4addf8 };
-    static const uint64_t pg_attribute_aligned(16) k3k4[] = { 0xf20c0dfe, 0x14cd00bd6 };
-    static const uint64_t pg_attribute_aligned(16) k5k0[] = { 0xdd45aab8, 0x000000000 };
-    static const uint64_t pg_attribute_aligned(16) poly[] = { 0x105ec76f1, 0xdea713f1 };
-    if (len >= 64) {
-        __m128i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8;
-        /*
-         * There's at least one block of 64.
-         */
-        x1 = _mm_loadu_si128((__m128i *)(buf + 0x00));
-        x2 = _mm_loadu_si128((__m128i *)(buf + 0x10));
-        x3 = _mm_loadu_si128((__m128i *)(buf + 0x20));
-        x4 = _mm_loadu_si128((__m128i *)(buf + 0x30));
-        x1 = _mm_xor_si128(x1, _mm_cvtsi32_si128(crc));
-        x0 = _mm_load_si128((__m128i *)k1k2);
-        buf += 64;
-        len -= 64;
-        /*
-         * Parallel fold blocks of 64, if any.
-         */
-        while (len >= 64)
-        {
-            x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
-            x6 = _mm_clmulepi64_si128(x2, x0, 0x00);
-            x7 = _mm_clmulepi64_si128(x3, x0, 0x00);
-            x8 = _mm_clmulepi64_si128(x4, x0, 0x00);
-            x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
-            x2 = _mm_clmulepi64_si128(x2, x0, 0x11);
-            x3 = _mm_clmulepi64_si128(x3, x0, 0x11);
-            x4 = _mm_clmulepi64_si128(x4, x0, 0x11);
-            y5 = _mm_loadu_si128((__m128i *)(buf + 0x00));
-            y6 = _mm_loadu_si128((__m128i *)(buf + 0x10));
-            y7 = _mm_loadu_si128((__m128i *)(buf + 0x20));
-            y8 = _mm_loadu_si128((__m128i *)(buf + 0x30));
-            x1 = _mm_xor_si128(x1, x5);
-            x2 = _mm_xor_si128(x2, x6);
-            x3 = _mm_xor_si128(x3, x7);
-            x4 = _mm_xor_si128(x4, x8);
-            x1 = _mm_xor_si128(x1, y5);
-            x2 = _mm_xor_si128(x2, y6);
-            x3 = _mm_xor_si128(x3, y7);
-            x4 = _mm_xor_si128(x4, y8);
-            buf += 64;
-            len -= 64;
-        }
-        /*
-         * Fold into 128-bits.
-         */
-        x0 = _mm_load_si128((__m128i *)k3k4);
-        x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
-        x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
-        x1 = _mm_xor_si128(x1, x2);
-        x1 = _mm_xor_si128(x1, x5);
-        x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
-        x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
-        x1 = _mm_xor_si128(x1, x3);
-        x1 = _mm_xor_si128(x1, x5);
-        x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
-        x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
-        x1 = _mm_xor_si128(x1, x4);
-        x1 = _mm_xor_si128(x1, x5);
-        /*
-         * Single fold blocks of 16, if any.
-         */
-        while (len >= 16)
-        {
-            x2 = _mm_loadu_si128((__m128i *)buf);
-            x5 = _mm_clmulepi64_si128(x1, x0, 0x00);
-            x1 = _mm_clmulepi64_si128(x1, x0, 0x11);
-            x1 = _mm_xor_si128(x1, x2);
-            x1 = _mm_xor_si128(x1, x5);
-            buf += 16;
-            len -= 16;
-        }
-        /*
-         * Fold 128-bits to 64-bits.
-         */
-        x2 = _mm_clmulepi64_si128(x1, x0, 0x10);
-        x3 = _mm_setr_epi32(~0, 0, ~0, 0);
-        x1 = _mm_srli_si128(x1, 8);
-        x1 = _mm_xor_si128(x1, x2);
-        x0 = _mm_loadl_epi64((__m128i*)k5k0);
-        x2 = _mm_srli_si128(x1, 4);
-        x1 = _mm_and_si128(x1, x3);
-        x1 = _mm_clmulepi64_si128(x1, x0, 0x00);
-        x1 = _mm_xor_si128(x1, x2);
-        /*
-         * Barret reduce to 32-bits.
-         */
-        x0 = _mm_load_si128((__m128i*)poly);
-        x2 = _mm_and_si128(x1, x3);
-        x2 = _mm_clmulepi64_si128(x2, x0, 0x10);
-        x2 = _mm_and_si128(x2, x3);
-        x2 = _mm_clmulepi64_si128(x2, x0, 0x00);
-        x1 = _mm_xor_si128(x1, x2);
-        crc = _mm_extract_epi32(x1, 1);
+
+  if (len >= 64) {
+    /* First vector chunk. */
+    __m128i x0 = _mm_loadu_si128((const __m128i*)buf), y0;
+    __m128i x1 = _mm_loadu_si128((const __m128i*)(buf + 16)), y1;
+    __m128i x2 = _mm_loadu_si128((const __m128i*)(buf + 32)), y2;
+    __m128i x3 = _mm_loadu_si128((const __m128i*)(buf + 48)), y3;
+    __m128i k;
+    k = _mm_setr_epi32(0x740eef02, 0, 0x9e4addf8, 0);
+    x0 = _mm_xor_si128(_mm_cvtsi32_si128(crc0), x0);
+    buf += 64;
+    len -= 64;
+    /* Main loop. */
+    while (len >= 64) {
+      y0 = clmul_lo(x0, k), x0 = clmul_hi(x0, k);
+      y1 = clmul_lo(x1, k), x1 = clmul_hi(x1, k);
+      y2 = clmul_lo(x2, k), x2 = clmul_hi(x2, k);
+      y3 = clmul_lo(x3, k), x3 = clmul_hi(x3, k);
+      y0 = _mm_xor_si128(y0, _mm_loadu_si128((const __m128i*)buf)), x0 = _mm_xor_si128(x0, y0);
+      y1 = _mm_xor_si128(y1, _mm_loadu_si128((const __m128i*)(buf + 16))), x1 = _mm_xor_si128(x1, y1);
+      y2 = _mm_xor_si128(y2, _mm_loadu_si128((const __m128i*)(buf + 32))), x2 = _mm_xor_si128(x2, y2);
+      y3 = _mm_xor_si128(y3, _mm_loadu_si128((const __m128i*)(buf + 48))), x3 = _mm_xor_si128(x3, y3);
+      buf += 64;
+      len -= 64;
+    }
+    /* Reduce x0 ... x3 to just x0. */
+    k = _mm_setr_epi32(0xf20c0dfe, 0, 0x493c7d27, 0);
+    y0 = clmul_lo(x0, k), x0 = clmul_hi(x0, k);
+    y2 = clmul_lo(x2, k), x2 = clmul_hi(x2, k);
+    y0 = _mm_xor_si128(y0, x1), x0 = _mm_xor_si128(x0, y0);
+    y2 = _mm_xor_si128(y2, x3), x2 = _mm_xor_si128(x2, y2);
+    k = _mm_setr_epi32(0x3da6d0cb, 0, 0xba4fc28e, 0);
+    y0 = clmul_lo(x0, k), x0 = clmul_hi(x0, k);
+    y0 = _mm_xor_si128(y0, x2), x0 = _mm_xor_si128(x0, y0);
+    /* Reduce 128 bits to 32 bits, and multiply by x^32. */
+    crc0 = _mm_crc32_u64(0, _mm_extract_epi64(x0, 0));
+    crc0 = _mm_crc32_u64(crc0, _mm_extract_epi64(x0, 1));
+  }
+  if (len >= 16) {
+    /* First vector chunk. */
+    __m128i x0 = _mm_loadu_si128((const __m128i*)buf), y0;
+    __m128i k;
+    k = _mm_setr_epi32(0xf20c0dfe, 0, 0x493c7d27, 0);
+    x0 = _mm_xor_si128(_mm_cvtsi32_si128(crc0), x0);
+    buf += 16;
+    len -= 16;
+    /* Main loop. */
+    while (len >= 16) {
+      y0 = clmul_lo(x0, k), x0 = clmul_hi(x0, k);
+      y0 = _mm_xor_si128(y0, _mm_loadu_si128((const __m128i*)buf)), x0 = _mm_xor_si128(x0, y0);
+      buf += 16;
+      len -= 16;
     }
+    /* Reduce 128 bits to 32 bits, and multiply by x^32. */
+    crc0 = _mm_crc32_u64(0, _mm_extract_epi64(x0, 0));
+    crc0 = _mm_crc32_u64(crc0, _mm_extract_epi64(x0, 1));
+  }
 
-    return pg_comp_crc32c_sse42_tail(crc, buf, len);
+    return pg_comp_crc32c_sse42_tail(crc0, buf, len);
 }
-- 
2.48.1

Re: Improve CRC32C performance on SSE4.2

Reply via email to