On Thu, Feb 6, 2025 at 3:49 AM Devulapalli, Raghuveer <raghuveer.devulapa...@intel.com> wrote: > > This patch improves the performance of SSE42 CRC32C algorithm. The current > SSE4.2 implementation of CRC32C relies on the native crc32 instruction and > processes 8 bytes at a time in a loop. The technique in this paper uses the > pclmulqdq instruction and processing 64 bytes at time. The algorithm is based > on sse42 version of crc32 computation from Chromium’s copy of zlib with > modified constants for crc32c computation. See: > > > > https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c
Thanks for the patch! > Microbenchmarks (generated with google benchmark using a standalone version > of the same algorithms): |----------------------------------+---------------------+---------------+---------------| | Benchmark | Buffer size (bytes) | Time Old (ns) | Time New (ns) | |----------------------------------+---------------------+---------------+---------------| | [scalar_crc32c vs. sse42_crc32c] | 64 | 33 | 6 | | [scalar_crc32c vs. sse42_crc32c] | 128 | 88 | 9 | | [scalar_crc32c vs. sse42_crc32c] | 256 | 211 | 17 | | [scalar_crc32c vs. sse42_crc32c] | 512 | 486 | 30 | | [scalar_crc32c vs. sse42_crc32c] | 1024 | 1037 | 57 | | [scalar_crc32c vs. sse42_crc32c] | 2048 | 2140 | 116 | |----------------------------------+---------------------+---------------+---------------| I'm highly suspicious of these numbers because they show this version is about 20x faster than "scalar", so relatively speaking 3x faster than the AVX-512 proposal? I ran my own benchmarks with the attached script using your test function from the other thread, and found this patch is actually slower than master on anything smaller than 256 bytes. Luckily, that's easily fixable: It turns out the implementation in the paper (and chromium) has a very inefficient finalization step, using more carryless multiplications and plenty of other operations. After the main loop, and at the end, it's much more efficient to convert the 128-bit intermediate result directly into a CRC in the usual way. See here for details: https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq The author of this article also has an MIT-licensed program to generate CRC implementations with specified requirements (including on ARM): https://github.com/corsix/fast-crc32 I generated a similar function for v2-0004 and this benchmark shows it's faster than master on 128 bytes and above, which is encouraging (see attached graph). As I mentioned in the other thread, we need to be mindful of the fact that the latency of carryless multiplication varies wildly on different microarchitectures. I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction. Looking around *briefly*, the most recent x86 chips with worse latency seem to be from around 2012-13 (e.g. Intel Silvermont and AMD Piledriver). Chips newer than mine have a latency of 4-7 cycles. It seems okay to assume those who care the most about performance will be on hardware less than 12 years old, but I'm open to arguments to be more conservative here. Large inputs would make the graph hard to read, so I'll just put the results for 8kB here: master: 1504ms v1: 543ms v2: 533ms Some thoughts on building (not a complete review): --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -557,14 +557,19 @@ AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS], [define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics])])dnl AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32], [Ac_cachevar], [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h> + #include <wmmintrin.h> #if defined(__has_attribute) && __has_attribute (target) - __attribute__((target("sse4.2"))) + __attribute__((target("sse4.2,pclmul"))) It's probably okay to fold these together in the same compile-time check, since both are fairly old by now, but for those following along, pclmul is not in SSE 4.2 and is a bit newer. So this would cause machines building on Nehalem (2008) to fail the check and go back to slicing-by-8 with it written this way. That's a long time ago, so I'm not sure if anyone would notice, and I think we could fix it for people using a packaged binary by having a fallback wrapper function that just calls the SSE 4.2 "tail", as 0002 calls it. -- John Naylor Amazon Web Services
test-crc.sh
Description: application/shellscript
From 5b329ccf89986ab5e6dd170f8fa317ed206e2137 Mon Sep 17 00:00:00 2001 From: Paul Amonson <paul.d.amon...@intel.com> Date: Mon, 6 May 2024 08:34:17 -0700 Subject: [PATCH v2 2/4] Add a Postgres SQL function for crc32c benchmarking Add a drive_crc32c() function to use for benchmarking crc32c computation. The function takes 2 arguments: (1) count: num of times CRC32C is computed in a loop. (2) num: #bytes in the buffer to calculate crc over. XXX not for commit Extracted from a patch by Raghuveer Devulapalli --- contrib/meson.build | 1 + contrib/test_crc32c/Makefile | 20 +++++++ contrib/test_crc32c/expected/test_crc32c.out | 57 ++++++++++++++++++++ contrib/test_crc32c/meson.build | 34 ++++++++++++ contrib/test_crc32c/sql/test_crc32c.sql | 3 ++ contrib/test_crc32c/test_crc32c--1.0.sql | 1 + contrib/test_crc32c/test_crc32c.c | 47 ++++++++++++++++ contrib/test_crc32c/test_crc32c.control | 4 ++ 8 files changed, 167 insertions(+) create mode 100644 contrib/test_crc32c/Makefile create mode 100644 contrib/test_crc32c/expected/test_crc32c.out create mode 100644 contrib/test_crc32c/meson.build create mode 100644 contrib/test_crc32c/sql/test_crc32c.sql create mode 100644 contrib/test_crc32c/test_crc32c--1.0.sql create mode 100644 contrib/test_crc32c/test_crc32c.c create mode 100644 contrib/test_crc32c/test_crc32c.control diff --git a/contrib/meson.build b/contrib/meson.build index 1ba73ebd67..06673db062 100644 --- a/contrib/meson.build +++ b/contrib/meson.build @@ -12,6 +12,7 @@ contrib_doc_args = { 'install_dir': contrib_doc_dir, } +subdir('test_crc32c') subdir('amcheck') subdir('auth_delay') subdir('auto_explain') diff --git a/contrib/test_crc32c/Makefile b/contrib/test_crc32c/Makefile new file mode 100644 index 0000000000..5b747c6184 --- /dev/null +++ b/contrib/test_crc32c/Makefile @@ -0,0 +1,20 @@ +MODULE_big = test_crc32c +OBJS = test_crc32c.o +PGFILEDESC = "test" +EXTENSION = test_crc32c +DATA = test_crc32c--1.0.sql + +first: all + +# test_crc32c.o: CFLAGS+=-g + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/test_crc32c +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/test_crc32c/expected/test_crc32c.out b/contrib/test_crc32c/expected/test_crc32c.out new file mode 100644 index 0000000000..dff6bb3133 --- /dev/null +++ b/contrib/test_crc32c/expected/test_crc32c.out @@ -0,0 +1,57 @@ +CREATE EXTENSION test_crc32c; +select drive_crc32c(1, i) from generate_series(100, 300, 4) i; + drive_crc32c +-------------- + 532139994 + 2103623867 + 785984197 + 2686825890 + 3213049059 + 3819630168 + 1389234603 + 534072900 + 2930108140 + 2496889855 + 1475239611 + 136366931 + 3067402116 + 2012717871 + 3682416023 + 2054270645 + 1817339875 + 4100939569 + 1192727539 + 3636976218 + 369764421 + 3161609879 + 1067984880 + 1235066769 + 3138425899 + 648132037 + 4203750233 + 1330187888 + 2683521348 + 1951644495 + 2574090107 + 3904902018 + 3772697795 + 1644686344 + 2868962106 + 3369218491 + 3902689890 + 3456411865 + 141004025 + 1504497996 + 3782655204 + 3544797610 + 3429174879 + 2524728016 + 3935861181 + 25498897 + 692684159 + 345705535 + 2761600287 + 2654632420 + 3945991399 +(51 rows) + diff --git a/contrib/test_crc32c/meson.build b/contrib/test_crc32c/meson.build new file mode 100644 index 0000000000..d7bec4ba1c --- /dev/null +++ b/contrib/test_crc32c/meson.build @@ -0,0 +1,34 @@ +# Copyright (c) 2022-2024, PostgreSQL Global Development Group + +test_crc32c_sources = files( + 'test_crc32c.c', +) + +if host_system == 'windows' + test_crc32c_sources += rc_lib_gen.process(win32ver_rc, extra_args: [ + '--NAME', 'test_crc32c', + '--FILEDESC', 'test_crc32c - test code for crc32c library',]) +endif + +test_crc32c = shared_module('test_crc32c', + test_crc32c_sources, + kwargs: contrib_mod_args, +) +contrib_targets += test_crc32c + +install_data( + 'test_crc32c.control', + 'test_crc32c--1.0.sql', + kwargs: contrib_data_args, +) + +tests += { + 'name': 'test_crc32c', + 'sd': meson.current_source_dir(), + 'bd': meson.current_build_dir(), + 'regress': { + 'sql': [ + 'test_crc32c', + ], + }, +} diff --git a/contrib/test_crc32c/sql/test_crc32c.sql b/contrib/test_crc32c/sql/test_crc32c.sql new file mode 100644 index 0000000000..95c6dfe448 --- /dev/null +++ b/contrib/test_crc32c/sql/test_crc32c.sql @@ -0,0 +1,3 @@ +CREATE EXTENSION test_crc32c; + +select drive_crc32c(1, i) from generate_series(100, 300, 4) i; diff --git a/contrib/test_crc32c/test_crc32c--1.0.sql b/contrib/test_crc32c/test_crc32c--1.0.sql new file mode 100644 index 0000000000..52b9772f90 --- /dev/null +++ b/contrib/test_crc32c/test_crc32c--1.0.sql @@ -0,0 +1 @@ +CREATE FUNCTION drive_crc32c (count int, num int) RETURNS bigint AS 'MODULE_PATHNAME' LANGUAGE C; diff --git a/contrib/test_crc32c/test_crc32c.c b/contrib/test_crc32c/test_crc32c.c new file mode 100644 index 0000000000..b350caf5ce --- /dev/null +++ b/contrib/test_crc32c/test_crc32c.c @@ -0,0 +1,47 @@ +/* select drive_crc32c(1000000, 1024); */ + +#include "postgres.h" +#include "fmgr.h" +#include "port/pg_crc32c.h" +#include "common/pg_prng.h" + +PG_MODULE_MAGIC; + +/* + * drive_crc32c(count: int, num: int) returns bigint + * + * count is the nuimber of loops to perform + * + * num is the number byte in the buffer to calculate + * crc32c over. + */ +PG_FUNCTION_INFO_V1(drive_crc32c); +Datum +drive_crc32c(PG_FUNCTION_ARGS) +{ + int64 count = PG_GETARG_INT64(0); + int64 num = PG_GETARG_INT64(1); + char* data = malloc((size_t)num); + pg_crc32c crc; + pg_prng_state state; + uint64 seed = 42; + pg_prng_seed(&state, seed); + /* set random data */ + for (uint64 i = 0; i < num; i++) + { + data[i] = pg_prng_uint32(&state) % 255; + } + + INIT_CRC32C(crc); + + while(count--) + { + INIT_CRC32C(crc); + COMP_CRC32C(crc, data, num); + FIN_CRC32C(crc); + } + + free((void *)data); + + PG_RETURN_INT64((int64_t)crc); +} diff --git a/contrib/test_crc32c/test_crc32c.control b/contrib/test_crc32c/test_crc32c.control new file mode 100644 index 0000000000..878a077ee1 --- /dev/null +++ b/contrib/test_crc32c/test_crc32c.control @@ -0,0 +1,4 @@ +comment = 'test' +default_version = '1.0' +module_pathname = '$libdir/test_crc32c' +relocatable = true -- 2.48.1
From 0691791201bae023c2e4da73a6b370f9929f7cea Mon Sep 17 00:00:00 2001 From: Raghuveer Devulapalli <raghuveer.devulapa...@intel.com> Date: Tue, 4 Feb 2025 15:20:13 -0800 Subject: [PATCH v2 3/4] Improve CRC32C performance on SSE4.2 The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in this paper uses the pclmulqdq instruction and processing 64 bytes at time. Based on: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" V. Gopal, E. Ozturk, et al., 2009 The algorithm is based on crc32_sse42_simd from chromimum's copy of zlib. See: from https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c Microbenchmarks: (generated with google benchmark using a standalone version of the same algorithms). Comparing scalar_crc32c (current version) to sse42_crc32c (proposed new version): |----------------------------------+---------------------+---------------+---------------| | Benchmark | Buffer size (bytes) | Time Old (ns) | Time New (ns) | |----------------------------------+---------------------+---------------+---------------| | [scalar_crc32c vs. sse42_crc32c] | 64 | 33 | 6 | | [scalar_crc32c vs. sse42_crc32c] | 128 | 88 | 9 | | [scalar_crc32c vs. sse42_crc32c] | 256 | 211 | 17 | | [scalar_crc32c vs. sse42_crc32c] | 512 | 486 | 30 | | [scalar_crc32c vs. sse42_crc32c] | 1024 | 1037 | 57 | | [scalar_crc32c vs. sse42_crc32c] | 2048 | 2140 | 116 | |----------------------------------+---------------------+---------------+---------------| --- config/c-compiler.m4 | 7 +- configure | 7 +- meson.build | 7 +- src/port/pg_crc32c_sse42.c | 127 ++++++++++++++++++++++++++++++++++++- 4 files changed, 141 insertions(+), 7 deletions(-) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 8534cc54c1..8b255b5cc8 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -557,14 +557,19 @@ AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS], [define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics])])dnl AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32], [Ac_cachevar], [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <nmmintrin.h> + #include <wmmintrin.h> #if defined(__has_attribute) && __has_attribute (target) - __attribute__((target("sse4.2"))) + __attribute__((target("sse4.2,pclmul"))) #endif static int crc32_sse42_test(void) + { + __m128i x1 = _mm_set1_epi32(1); unsigned int crc = 0; crc = _mm_crc32_u8(crc, 0); crc = _mm_crc32_u32(crc, 0); + x1 = _mm_clmulepi64_si128(x1, x1, 0x00); // pclmul + crc = crc + _mm_extract_epi32(x1, 1); /* return computed value, to prevent the above being optimized away */ return crc == 0; }], diff --git a/configure b/configure index ceeef9b091..f457e3d3bc 100755 --- a/configure +++ b/configure @@ -17178,14 +17178,19 @@ else cat confdefs.h - <<_ACEOF >conftest.$ac_ext /* end confdefs.h. */ #include <nmmintrin.h> + #include <wmmintrin.h> #if defined(__has_attribute) && __has_attribute (target) - __attribute__((target("sse4.2"))) + __attribute__((target("sse4.2,pclmul"))) #endif static int crc32_sse42_test(void) + { + __m128i x1 = _mm_set1_epi32(1); unsigned int crc = 0; crc = _mm_crc32_u8(crc, 0); crc = _mm_crc32_u32(crc, 0); + x1 = _mm_clmulepi64_si128(x1, x1, 0x00); + crc = crc + _mm_extract_epi32(x1, 1); /* return computed value, to prevent the above being optimized away */ return crc == 0; } diff --git a/meson.build b/meson.build index 1ceadb9a83..456c3fafc3 100644 --- a/meson.build +++ b/meson.build @@ -2227,15 +2227,18 @@ if host_cpu == 'x86' or host_cpu == 'x86_64' prog = ''' #include <nmmintrin.h> - +#include <wmmintrin.h> #if defined(__has_attribute) && __has_attribute (target) -__attribute__((target("sse4.2"))) +__attribute__((target("sse4.2,pclmul"))) #endif int main(void) { + __m128i x1 = _mm_set1_epi32(1); unsigned int crc = 0; crc = _mm_crc32_u8(crc, 0); crc = _mm_crc32_u32(crc, 0); + x1 = _mm_clmulepi64_si128(x1, x1, 0x00); // pclmul + crc = crc + _mm_extract_epi32(x1, 1); /* return computed value, to prevent the above being optimized away */ return crc == 0; } diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c index 22c2137df3..69f8917c7d 100644 --- a/src/port/pg_crc32c_sse42.c +++ b/src/port/pg_crc32c_sse42.c @@ -15,13 +15,13 @@ #include "c.h" #include <nmmintrin.h> - +#include <wmmintrin.h> #include "port/pg_crc32c.h" pg_attribute_no_sanitize_alignment() pg_attribute_target("sse4.2") -pg_crc32c -pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len) +static pg_crc32c +pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len) { const unsigned char *p = data; const unsigned char *pend = p + len; @@ -68,3 +68,124 @@ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len) return crc; } + +/* + * Based on: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ + * Instruction" V. Gopal, E. Ozturk, et al., 2009 + * + * The algorithm is based on crc32_sse42_simd from chromimum's copy of zlib. + * See: + * https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c + */ + +pg_attribute_no_sanitize_alignment() +pg_attribute_target("sse4.2,pclmul") +pg_crc32c +pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t length) +{ + ssize_t len = (ssize_t) length; + const unsigned char *buf = data; + /* + * Definitions of the bit-reflected domain constants k1,k2,k3, etc and + * the CRC32+Barrett polynomials given at the end of the paper. + */ + static const uint64_t pg_attribute_aligned(16) k1k2[] = { 0x740eef02, 0x9e4addf8 }; + static const uint64_t pg_attribute_aligned(16) k3k4[] = { 0xf20c0dfe, 0x14cd00bd6 }; + static const uint64_t pg_attribute_aligned(16) k5k0[] = { 0xdd45aab8, 0x000000000 }; + static const uint64_t pg_attribute_aligned(16) poly[] = { 0x105ec76f1, 0xdea713f1 }; + if (len >= 64) { + __m128i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8; + /* + * There's at least one block of 64. + */ + x1 = _mm_loadu_si128((__m128i *)(buf + 0x00)); + x2 = _mm_loadu_si128((__m128i *)(buf + 0x10)); + x3 = _mm_loadu_si128((__m128i *)(buf + 0x20)); + x4 = _mm_loadu_si128((__m128i *)(buf + 0x30)); + x1 = _mm_xor_si128(x1, _mm_cvtsi32_si128(crc)); + x0 = _mm_load_si128((__m128i *)k1k2); + buf += 64; + len -= 64; + /* + * Parallel fold blocks of 64, if any. + */ + while (len >= 64) + { + x5 = _mm_clmulepi64_si128(x1, x0, 0x00); + x6 = _mm_clmulepi64_si128(x2, x0, 0x00); + x7 = _mm_clmulepi64_si128(x3, x0, 0x00); + x8 = _mm_clmulepi64_si128(x4, x0, 0x00); + x1 = _mm_clmulepi64_si128(x1, x0, 0x11); + x2 = _mm_clmulepi64_si128(x2, x0, 0x11); + x3 = _mm_clmulepi64_si128(x3, x0, 0x11); + x4 = _mm_clmulepi64_si128(x4, x0, 0x11); + y5 = _mm_loadu_si128((__m128i *)(buf + 0x00)); + y6 = _mm_loadu_si128((__m128i *)(buf + 0x10)); + y7 = _mm_loadu_si128((__m128i *)(buf + 0x20)); + y8 = _mm_loadu_si128((__m128i *)(buf + 0x30)); + x1 = _mm_xor_si128(x1, x5); + x2 = _mm_xor_si128(x2, x6); + x3 = _mm_xor_si128(x3, x7); + x4 = _mm_xor_si128(x4, x8); + x1 = _mm_xor_si128(x1, y5); + x2 = _mm_xor_si128(x2, y6); + x3 = _mm_xor_si128(x3, y7); + x4 = _mm_xor_si128(x4, y8); + buf += 64; + len -= 64; + } + /* + * Fold into 128-bits. + */ + x0 = _mm_load_si128((__m128i *)k3k4); + x5 = _mm_clmulepi64_si128(x1, x0, 0x00); + x1 = _mm_clmulepi64_si128(x1, x0, 0x11); + x1 = _mm_xor_si128(x1, x2); + x1 = _mm_xor_si128(x1, x5); + x5 = _mm_clmulepi64_si128(x1, x0, 0x00); + x1 = _mm_clmulepi64_si128(x1, x0, 0x11); + x1 = _mm_xor_si128(x1, x3); + x1 = _mm_xor_si128(x1, x5); + x5 = _mm_clmulepi64_si128(x1, x0, 0x00); + x1 = _mm_clmulepi64_si128(x1, x0, 0x11); + x1 = _mm_xor_si128(x1, x4); + x1 = _mm_xor_si128(x1, x5); + /* + * Single fold blocks of 16, if any. + */ + while (len >= 16) + { + x2 = _mm_loadu_si128((__m128i *)buf); + x5 = _mm_clmulepi64_si128(x1, x0, 0x00); + x1 = _mm_clmulepi64_si128(x1, x0, 0x11); + x1 = _mm_xor_si128(x1, x2); + x1 = _mm_xor_si128(x1, x5); + buf += 16; + len -= 16; + } + /* + * Fold 128-bits to 64-bits. + */ + x2 = _mm_clmulepi64_si128(x1, x0, 0x10); + x3 = _mm_setr_epi32(~0, 0, ~0, 0); + x1 = _mm_srli_si128(x1, 8); + x1 = _mm_xor_si128(x1, x2); + x0 = _mm_loadl_epi64((__m128i*)k5k0); + x2 = _mm_srli_si128(x1, 4); + x1 = _mm_and_si128(x1, x3); + x1 = _mm_clmulepi64_si128(x1, x0, 0x00); + x1 = _mm_xor_si128(x1, x2); + /* + * Barret reduce to 32-bits. + */ + x0 = _mm_load_si128((__m128i*)poly); + x2 = _mm_and_si128(x1, x3); + x2 = _mm_clmulepi64_si128(x2, x0, 0x10); + x2 = _mm_and_si128(x2, x3); + x2 = _mm_clmulepi64_si128(x2, x0, 0x00); + x1 = _mm_xor_si128(x1, x2); + crc = _mm_extract_epi32(x1, 1); + } + + return pg_comp_crc32c_sse42_tail(crc, buf, len); +} -- 2.48.1
From d9c69f435cfecf46f2397091e36010af8c965f79 Mon Sep 17 00:00:00 2001 From: Raghuveer Devulapalli <raghuveer.devulapa...@intel.com> Date: Tue, 4 Feb 2025 12:56:00 -0800 Subject: [PATCH v2 1/4] Add more test coverage for crc32c --- src/test/regress/expected/crc32c.out | 42 ++++++++++++++++++++++++++++ src/test/regress/parallel_schedule | 2 ++ src/test/regress/sql/crc32c.sql | 12 ++++++++ 3 files changed, 56 insertions(+) create mode 100644 src/test/regress/expected/crc32c.out create mode 100644 src/test/regress/sql/crc32c.sql diff --git a/src/test/regress/expected/crc32c.out b/src/test/regress/expected/crc32c.out new file mode 100644 index 0000000000..f25965df4a --- /dev/null +++ b/src/test/regress/expected/crc32c.out @@ -0,0 +1,42 @@ +-- +-- CRC32C +-- Testing CRC32C SSE4.2 algorithm. +-- The new algorithm has various code paths that needs test coverage. +-- We achieve that by computing CRC32C of text of various sizes: 15, 64, 128, 144, 159 and 256 bytes. +-- +SELECT crc32c(''); + crc32c +-------- + 0 +(1 row) + +SELECT crc32c('Hello 15 bytes!'); + crc32c +------------ + 3405757121 +(1 row) + +SELECT crc32c('This is a 64 byte piece of text to run through the main loop ...'); + crc32c +----------- + 721494841 +(1 row) + +SELECT crc32c('This is a carefully constructed text that needs to be exactly 128 bytes long for testing purposes. Let me add more words to ....'); + crc32c +------------ + 1602016964 +(1 row) + +SELECT crc32c('This is a text that needs to be exactly 144 bytes long for testing purposes. I will add more words to reach that specific length. Now we are ...'); + crc32c +------------ + 1912862944 +(1 row) + +SELECT crc32c('This is a precisely crafted message that needs to be exactly 159 bytes in length for testing purposes. I will continue adding more text until we reach that ...'); + crc32c +------------ + 1245879782 +(1 row) + diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1edd9e45eb..7c9dbf65db 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -56,6 +56,8 @@ test: create_aggregate create_function_sql create_cast constraints triggers sele # ---------- test: sanity_check +test: crc32c + # ---------- # Another group of parallel tests # aggregates depends on create_aggregate diff --git a/src/test/regress/sql/crc32c.sql b/src/test/regress/sql/crc32c.sql new file mode 100644 index 0000000000..5e481eab6f --- /dev/null +++ b/src/test/regress/sql/crc32c.sql @@ -0,0 +1,12 @@ +-- +-- CRC32C +-- Testing CRC32C SSE4.2 algorithm. +-- The new algorithm has various code paths that needs test coverage. +-- We achieve that by computing CRC32C of text of various sizes: 15, 64, 128, 144, 159 and 256 bytes. +-- +SELECT crc32c(''); +SELECT crc32c('Hello 15 bytes!'); +SELECT crc32c('This is a 64 byte piece of text to run through the main loop ...'); +SELECT crc32c('This is a carefully constructed text that needs to be exactly 128 bytes long for testing purposes. Let me add more words to ....'); +SELECT crc32c('This is a text that needs to be exactly 144 bytes long for testing purposes. I will add more words to reach that specific length. Now we are ...'); +SELECT crc32c('This is a precisely crafted message that needs to be exactly 159 bytes in length for testing purposes. I will continue adding more text until we reach that ...'); -- 2.48.1
From 61295213e3133fd319e0b2bd18e1a9c16a4af140 Mon Sep 17 00:00:00 2001 From: John Naylor <john.nay...@postgresql.org> Date: Sun, 9 Feb 2025 12:25:56 +0700 Subject: [PATCH v2 4/4] Shorter version from corsix --- src/port/pg_crc32c_sse42.c | 165 ++++++++++++++----------------------- 1 file changed, 62 insertions(+), 103 deletions(-) diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c index 69f8917c7d..dec685d139 100644 --- a/src/port/pg_crc32c_sse42.c +++ b/src/port/pg_crc32c_sse42.c @@ -78,114 +78,73 @@ pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len) * https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c */ +#define clmul_lo(a, b) (_mm_clmulepi64_si128((a), (b), 0)) +#define clmul_hi(a, b) (_mm_clmulepi64_si128((a), (b), 17)) + pg_attribute_no_sanitize_alignment() pg_attribute_target("sse4.2,pclmul") pg_crc32c -pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t length) +pg_comp_crc32c_sse42(pg_crc32c crc0, const void *data, size_t length) { - ssize_t len = (ssize_t) length; + size_t len = length; const unsigned char *buf = data; - /* - * Definitions of the bit-reflected domain constants k1,k2,k3, etc and - * the CRC32+Barrett polynomials given at the end of the paper. - */ - static const uint64_t pg_attribute_aligned(16) k1k2[] = { 0x740eef02, 0x9e4addf8 }; - static const uint64_t pg_attribute_aligned(16) k3k4[] = { 0xf20c0dfe, 0x14cd00bd6 }; - static const uint64_t pg_attribute_aligned(16) k5k0[] = { 0xdd45aab8, 0x000000000 }; - static const uint64_t pg_attribute_aligned(16) poly[] = { 0x105ec76f1, 0xdea713f1 }; - if (len >= 64) { - __m128i x0, x1, x2, x3, x4, x5, x6, x7, x8, y5, y6, y7, y8; - /* - * There's at least one block of 64. - */ - x1 = _mm_loadu_si128((__m128i *)(buf + 0x00)); - x2 = _mm_loadu_si128((__m128i *)(buf + 0x10)); - x3 = _mm_loadu_si128((__m128i *)(buf + 0x20)); - x4 = _mm_loadu_si128((__m128i *)(buf + 0x30)); - x1 = _mm_xor_si128(x1, _mm_cvtsi32_si128(crc)); - x0 = _mm_load_si128((__m128i *)k1k2); - buf += 64; - len -= 64; - /* - * Parallel fold blocks of 64, if any. - */ - while (len >= 64) - { - x5 = _mm_clmulepi64_si128(x1, x0, 0x00); - x6 = _mm_clmulepi64_si128(x2, x0, 0x00); - x7 = _mm_clmulepi64_si128(x3, x0, 0x00); - x8 = _mm_clmulepi64_si128(x4, x0, 0x00); - x1 = _mm_clmulepi64_si128(x1, x0, 0x11); - x2 = _mm_clmulepi64_si128(x2, x0, 0x11); - x3 = _mm_clmulepi64_si128(x3, x0, 0x11); - x4 = _mm_clmulepi64_si128(x4, x0, 0x11); - y5 = _mm_loadu_si128((__m128i *)(buf + 0x00)); - y6 = _mm_loadu_si128((__m128i *)(buf + 0x10)); - y7 = _mm_loadu_si128((__m128i *)(buf + 0x20)); - y8 = _mm_loadu_si128((__m128i *)(buf + 0x30)); - x1 = _mm_xor_si128(x1, x5); - x2 = _mm_xor_si128(x2, x6); - x3 = _mm_xor_si128(x3, x7); - x4 = _mm_xor_si128(x4, x8); - x1 = _mm_xor_si128(x1, y5); - x2 = _mm_xor_si128(x2, y6); - x3 = _mm_xor_si128(x3, y7); - x4 = _mm_xor_si128(x4, y8); - buf += 64; - len -= 64; - } - /* - * Fold into 128-bits. - */ - x0 = _mm_load_si128((__m128i *)k3k4); - x5 = _mm_clmulepi64_si128(x1, x0, 0x00); - x1 = _mm_clmulepi64_si128(x1, x0, 0x11); - x1 = _mm_xor_si128(x1, x2); - x1 = _mm_xor_si128(x1, x5); - x5 = _mm_clmulepi64_si128(x1, x0, 0x00); - x1 = _mm_clmulepi64_si128(x1, x0, 0x11); - x1 = _mm_xor_si128(x1, x3); - x1 = _mm_xor_si128(x1, x5); - x5 = _mm_clmulepi64_si128(x1, x0, 0x00); - x1 = _mm_clmulepi64_si128(x1, x0, 0x11); - x1 = _mm_xor_si128(x1, x4); - x1 = _mm_xor_si128(x1, x5); - /* - * Single fold blocks of 16, if any. - */ - while (len >= 16) - { - x2 = _mm_loadu_si128((__m128i *)buf); - x5 = _mm_clmulepi64_si128(x1, x0, 0x00); - x1 = _mm_clmulepi64_si128(x1, x0, 0x11); - x1 = _mm_xor_si128(x1, x2); - x1 = _mm_xor_si128(x1, x5); - buf += 16; - len -= 16; - } - /* - * Fold 128-bits to 64-bits. - */ - x2 = _mm_clmulepi64_si128(x1, x0, 0x10); - x3 = _mm_setr_epi32(~0, 0, ~0, 0); - x1 = _mm_srli_si128(x1, 8); - x1 = _mm_xor_si128(x1, x2); - x0 = _mm_loadl_epi64((__m128i*)k5k0); - x2 = _mm_srli_si128(x1, 4); - x1 = _mm_and_si128(x1, x3); - x1 = _mm_clmulepi64_si128(x1, x0, 0x00); - x1 = _mm_xor_si128(x1, x2); - /* - * Barret reduce to 32-bits. - */ - x0 = _mm_load_si128((__m128i*)poly); - x2 = _mm_and_si128(x1, x3); - x2 = _mm_clmulepi64_si128(x2, x0, 0x10); - x2 = _mm_and_si128(x2, x3); - x2 = _mm_clmulepi64_si128(x2, x0, 0x00); - x1 = _mm_xor_si128(x1, x2); - crc = _mm_extract_epi32(x1, 1); + + if (len >= 64) { + /* First vector chunk. */ + __m128i x0 = _mm_loadu_si128((const __m128i*)buf), y0; + __m128i x1 = _mm_loadu_si128((const __m128i*)(buf + 16)), y1; + __m128i x2 = _mm_loadu_si128((const __m128i*)(buf + 32)), y2; + __m128i x3 = _mm_loadu_si128((const __m128i*)(buf + 48)), y3; + __m128i k; + k = _mm_setr_epi32(0x740eef02, 0, 0x9e4addf8, 0); + x0 = _mm_xor_si128(_mm_cvtsi32_si128(crc0), x0); + buf += 64; + len -= 64; + /* Main loop. */ + while (len >= 64) { + y0 = clmul_lo(x0, k), x0 = clmul_hi(x0, k); + y1 = clmul_lo(x1, k), x1 = clmul_hi(x1, k); + y2 = clmul_lo(x2, k), x2 = clmul_hi(x2, k); + y3 = clmul_lo(x3, k), x3 = clmul_hi(x3, k); + y0 = _mm_xor_si128(y0, _mm_loadu_si128((const __m128i*)buf)), x0 = _mm_xor_si128(x0, y0); + y1 = _mm_xor_si128(y1, _mm_loadu_si128((const __m128i*)(buf + 16))), x1 = _mm_xor_si128(x1, y1); + y2 = _mm_xor_si128(y2, _mm_loadu_si128((const __m128i*)(buf + 32))), x2 = _mm_xor_si128(x2, y2); + y3 = _mm_xor_si128(y3, _mm_loadu_si128((const __m128i*)(buf + 48))), x3 = _mm_xor_si128(x3, y3); + buf += 64; + len -= 64; + } + /* Reduce x0 ... x3 to just x0. */ + k = _mm_setr_epi32(0xf20c0dfe, 0, 0x493c7d27, 0); + y0 = clmul_lo(x0, k), x0 = clmul_hi(x0, k); + y2 = clmul_lo(x2, k), x2 = clmul_hi(x2, k); + y0 = _mm_xor_si128(y0, x1), x0 = _mm_xor_si128(x0, y0); + y2 = _mm_xor_si128(y2, x3), x2 = _mm_xor_si128(x2, y2); + k = _mm_setr_epi32(0x3da6d0cb, 0, 0xba4fc28e, 0); + y0 = clmul_lo(x0, k), x0 = clmul_hi(x0, k); + y0 = _mm_xor_si128(y0, x2), x0 = _mm_xor_si128(x0, y0); + /* Reduce 128 bits to 32 bits, and multiply by x^32. */ + crc0 = _mm_crc32_u64(0, _mm_extract_epi64(x0, 0)); + crc0 = _mm_crc32_u64(crc0, _mm_extract_epi64(x0, 1)); + } + if (len >= 16) { + /* First vector chunk. */ + __m128i x0 = _mm_loadu_si128((const __m128i*)buf), y0; + __m128i k; + k = _mm_setr_epi32(0xf20c0dfe, 0, 0x493c7d27, 0); + x0 = _mm_xor_si128(_mm_cvtsi32_si128(crc0), x0); + buf += 16; + len -= 16; + /* Main loop. */ + while (len >= 16) { + y0 = clmul_lo(x0, k), x0 = clmul_hi(x0, k); + y0 = _mm_xor_si128(y0, _mm_loadu_si128((const __m128i*)buf)), x0 = _mm_xor_si128(x0, y0); + buf += 16; + len -= 16; } + /* Reduce 128 bits to 32 bits, and multiply by x^32. */ + crc0 = _mm_crc32_u64(0, _mm_extract_epi64(x0, 0)); + crc0 = _mm_crc32_u64(crc0, _mm_extract_epi64(x0, 1)); + } - return pg_comp_crc32c_sse42_tail(crc, buf, len); + return pg_comp_crc32c_sse42_tail(crc0, buf, len); } -- 2.48.1