On Mon, Mar 23, 2026, at 11:09 AM, Nathan Bossart wrote: > On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote: >> I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was >> spent >> all that effectively - hard to believe there's any real world workloads where >> that gain is worth the squeeze. At least for aarch64 and x86-64 there's real >> world use of those platforms, making niche-y perf improvements somewhat >> worthwhile. Whereas there's afaict not yet a whole lot of riscv production >> adoption.
Hey Nathan,
> That work was partially motivated by vector stuff that used popcount
> functions pretty heavily, but yeah, the complexity compared to the gains is
> the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2
> and Neon). I'd still consider using AVX-512, etc. for things if the impact
> on real-world workloads was huge, though.
Yes, that and by research done while trying to understand why my RISC-V build
farm animal "greenfly" (OrangePi RV2 with a VisionFive 2 CPU: RISC-V RV64GC +
Zba/Zbb/Zbc/Zbs) is failing consistently.
> --
> nathan
Forgive me, while $subject only mentions popcount I couldn't help myself so I
added a few more RISC-V patches including a bug fix that I hope makes greenfly
happy again.
0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.
------> Join me in "the rabbit hole" on this issue if you care to...
The existing software DES (as shown by the build-farm animal "greenfly" [1])
fails because Clang 20 has an auto-vectorization bug that we trigger in the DES
initialization code (des_init() function), not the DES encryption algorithm
itself.
I searched the LLVM issue tracker, here are the issues that caught my eye:
1. Issue #176001 - "RISC-V Wrong code at -O1"
- Vector peephole optimization with vmerge folding
- Fixed by PR #176077 (merged Jan 2024)
- Link: https://github.com/llvm/llvm-project/issues/176001
2. Issue #187458 - "Wrong code for vector.extract.last.active"
- Large index issues with zvl1024b
- Partially fixed, still work ongoing
- Link: https://github.com/llvm/llvm-project/issues/187458
3. Issue #171978 - "RISC-V Wrong code at -O2/O3"
- Illegal instruction from mismatched EEW
- Under investigation
- Link: https://github.com/llvm/llvm-project/issues/171978
4. PR #176105 - "Fix i64 gather/scatter cost on rv32"
- Cost model fixes for scatter/gather (merged Jan 2026)
- Link: https://github.com/llvm/llvm-project/pull/176105
My fix in 0001 is simply adding this in a few places in crypt-des.c:
#if defined(__riscv) && defined(__clang__)
pg_memory_barrier();
#endif
While searching I ran across a different solution, adding `-mllvm
-riscv-v-vector-bits-min=0` sets the minimum vector bit width for RISC-V vector
extension in LLVM to 0 disabling all vectorization forcing scalar code
generation, no RVV instructions are emitted. This would prevent the DES bug at
the cost of any vectorization anywhere in the binary.
While that might also fix the other intermittent bug we'd been seeing on
greenfly (not tested) disablnig all RVV optimizations seems to heavy handed to
me.
------> Moving on.
0002 - (was "0001" in v2) this is unchanged, it implements popcount using Zbb
extension on RISC-V
0003 - is a small patch that adapted from the Google Abseil project's RISC-V
CRC32C implementation [1]. It is *a lot faster* than the software crc32c we
fall back to now (see: riscv-crc32c.c). This algorithm requires the Zbc (or
Zbkc) extension (for clmul) so the patch tests for that at build and adds the
'-march' flag when it is. However, as is the case for Zbb and popcnt in, the
presence of Zbc (or Zbkc) must be detected at runtime. That's done following
the pre-existing pattern used for ARM features. This does introduce some
runtime overhead and complexity, not more than required I hope.
I attached test code, and results at the end of this email:
* riscv-popcnt.c - unchanged
* riscv-crc32c.c - new, based on work in the Google Abseil project
* riscv-des.c - highlights the fix for DES using Clang on RISC-V
I guess the question for 002 and/or 003 is if the "juice" is worth the
"squeeze" or not. There is a lot of performance juice to be had IMO. But some
might argue that RISC-V isn't widely adopted yet, and they'd be right. Others
might point out that RISC-V is currently showing up in embedded systems more
than server/desktop/laptop/cloud, also true. However, there is some evidence
that is changing as there are RISC-V in servers [2][3], and there is a hosted
(cloud) solution from Scaleway [4]. There exists a 64 core RISC-V desktop [6]
and a Framework laptop mainboard [7] sporting a RISC-V CPUs. And there is the
OrangePi RV2 [7] I have that is "greenfly".
Is it early days? Certainly! But too early? That's up for debate. :)
If nothing else, these patches can be a durable record and used later when
RISC-V is a critical platform for Postgres or informational to other projects.
best.
-greg
[1] https://github.com/abseil/abseil-cpp/pull/1986
absl/crc/internal/crc_riscv.cc
[2]
https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042
[3]
https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/
[4]
https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/
[5] https://milkv.io/pioneer and
https://www.crowdsupply.com/milk-v/milk-v-pioneer/updates/current-status-of-production
[6] https://deepcomputing.io/product/dc-roma-risc-v-mainboard/
[7]
http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html
---- TEST PROGRAM OUTPUT:
gburd@rv:~/ws/postgres$ make -f Makefile.RISCV
gcc -O2 riscv-des.c -o des-gcc-sw
gcc -O2 riscv-des.c -march=rv64gcv -o des-gcc-hw
clang-20 -O1 riscv-des.c -o des-clang-o1-sw
clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-hw
clang-20 -O2 riscv-des.c -o des-clang-o2-sw
clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-o2-hw
gcc -O2 -o popcnt-gcc-o2-sw riscv-popcnt.c
gcc -O2 -march=rv64gc_zbb -o popcnt-gcc-o2-hw riscv-popcnt.c
clang-20 -O2 -o popcnt-clang-o2-sw riscv-popcnt.c
clang-20 -O2 -march=rv64gc_zbb -o popcnt-clang-o2-hw riscv-popcnt.c
gcc -O2 -o crc32c-gcc-o2-sw riscv-crc32c.c
gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-o2-hw riscv-crc32c.c
clang-20 -O2 -o crc32c-clang-o2-sw riscv-crc32c.c
clang-20 -O2 -march=rv64gc_zbc -o crc32c-clang-o2-hw riscv-crc32c.c
gburd@rv:~/ws/postgres$ make -f Makefile.RISCV test
./des-gcc-sw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Not enabled
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.409 seconds (409 ns/iter)
With barriers: 0.416 seconds (416 ns/iter)
Overhead: 1.6%
./des-gcc-hw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.410 seconds (410 ns/iter)
With barriers: 0.410 seconds (410 ns/iter)
Overhead: Negligible
./des-clang-o1-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.516 seconds (516 ns/iter)
Overhead: Negligible
./des-clang-o1-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.405 seconds (405 ns/iter)
With barriers: 0.405 seconds (405 ns/iter)
Overhead: Negligible
./des-clang-o2-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.518 seconds (518 ns/iter)
Overhead: Negligible
./des-clang-o2-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)
Testing WITHOUT compiler barriers:
ERROR: un_pbox mismatch:
un_pbox[0] = 15, expected 8
un_pbox[1] = 6, expected 16
un_pbox[2] = 19, expected 22
un_pbox[3] = 20, expected 30
un_pbox[4] = 28, expected 12
... and 27 more errors
FAIL: Permutation tables are incorrect
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.093 seconds (93 ns/iter)
With barriers: 0.407 seconds (407 ns/iter)
Overhead: 335.5%
./popcnt-gcc-o2-sw
sw popcount: 0.183 sec ( 547.89 MB/s)
hw popcount: 0.274 sec ( 365.40 MB/s)
diff: 0.67x
match: 406261900 bits counted
./popcnt-gcc-o2-hw
sw popcount: 0.182 sec ( 548.17 MB/s)
hw popcount: 0.044 sec ( 2287.82 MB/s)
diff: 4.17x
match: 406261900 bits counted
./popcnt-clang-o2-sw
sw popcount: 0.188 sec ( 531.96 MB/s)
hw popcount: 0.207 sec ( 482.84 MB/s)
diff: 0.91x
match: 406261900 bits counted
./popcnt-clang-o2-hw
sw popcount: 0.224 sec ( 446.46 MB/s)
hw popcount: 0.056 sec ( 1794.83 MB/s)
diff: 4.02x
match: 406261900 bits counted
./crc32c-gcc-o2-sw
sw crc32c: 0.651 sec ( 153.68 MB/s)
hw crc32c: 0.651 sec ( 153.72 MB/s)
diff: 1.00x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-gcc-o2-hw
sw crc32c: 0.651 sec ( 153.70 MB/s)
hw crc32c: 0.000 sec ( 308052.33 MB/s)
diff: 2004.21x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-sw
sw crc32c: 0.584 sec ( 171.10 MB/s)
hw crc32c: 0.584 sec ( 171.17 MB/s)
diff: 1.00x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-hw
sw crc32c: 0.584 sec ( 171.15 MB/s)
hw crc32c: 0.000 sec ( 309282.38 MB/s)
diff: 1807.08x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
Makefile.RISCV
Description: Binary data
/* * riscv-crc32c.c * * RISC-V Zbc CRC32C (Castagnoli) hardware acceleration test * * Based on Abseil's production implementation. * https://github.com/abseil/abseil-cpp/pull/1986 * absl/crc/internal/crc_riscv.cc * * Build commands: * gcc -O2 -o crc32c-gcc-sw riscv-crc32c.c * gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-zbc riscv-crc32c.c * clang -O2 -o crc32c-clang-sw riscv-crc32c.c * clang -O2 -march=rv64gc_zbc -o crc32c-clang-zbc riscv-crc32c.c */ #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <endian.h> #define TEST_SIZE (1024 * 1024) /* 1 MB */ #define ITERATIONS 100 /* CRC-32C (Castagnoli) polynomial: 0x1EDC6F41 (normal), 0x82F63B78 (reflected) */ #define CRC32C_POLY 0x82F63B78 /* * Software CRC32C implementation * Standard table-driven algorithm used as baseline */ static uint32_t crc32c_table[256]; static void init_crc32c_table(void) { uint32_t crc; int i, j; for (i = 0; i < 256; i++) { crc = i; for (j = 0; j < 8; j++) { if (crc & 1) crc = (crc >> 1) ^ CRC32C_POLY; else crc >>= 1; } crc32c_table[i] = crc; } } static uint32_t crc32c_sw(uint32_t crc, const uint8_t *data, size_t len) { const uint8_t *p = data; const uint8_t *end = data + len; crc = ~crc; while (p < end) { crc = crc32c_table[(crc ^ *p++) & 0xFF] ^ (crc >> 8); } return ~crc; } #if defined(__riscv) && (__riscv_xlen == 64) && \ (defined(__riscv_zbc) || defined(__riscv_zbkc)) /* * Hardware CRC32C implementation using RISC-V Zbc extension * * Algorithm from Abseil (PR #1986): * - Fold 16-byte blocks using carry-less multiply (clmul/clmulh) * - Reduce 128-bit to 64-bit to 32-bit using precomputed constants * - Barrett reduction for final 32-bit CRC * * This is the production-tested algorithm used by Google Abseil. */ typedef struct { uint64_t lo; uint64_t hi; } V128; /* Carry-less multiply intrinsics */ static inline uint64_t clmul(uint64_t a, uint64_t b) { uint64_t out; __asm__("clmul %0, %1, %2" : "=r"(out) : "r"(a), "r"(b)); return out; } static inline uint64_t clmulh(uint64_t a, uint64_t b) { uint64_t out; __asm__("clmulh %0, %1, %2" : "=r"(out) : "r"(a), "r"(b)); return out; } static inline V128 clmul128(uint64_t a, uint64_t b) { V128 result; result.lo = clmul(a, b); result.hi = clmulh(a, b); return result; } static inline V128 v128_xor(V128 a, V128 b) { V128 result; result.lo = a.lo ^ b.lo; result.hi = a.hi ^ b.hi; return result; } static inline V128 v128_and_mask32(V128 a) { V128 result; result.lo = a.lo & 0x00000000FFFFFFFFull; result.hi = a.hi & 0x00000000FFFFFFFFull; return result; } static inline V128 v128_shift_right64(V128 a) { V128 result; result.lo = a.hi; result.hi = 0; return result; } static inline V128 v128_shift_right32(V128 a) { V128 result; result.lo = (a.lo >> 32) | (a.hi << 32); result.hi = (a.hi >> 32); return result; } static inline V128 v128_load(const uint8_t *p) { V128 result; result.lo = le64toh(*(const uint64_t *)p); result.hi = le64toh(*(const uint64_t *)(p + 8)); return result; } /* * Core CRC32C algorithm using carry-less multiplication * Input: crc in working form (already inverted with ~crc) * Output: crc in working form (still inverted) */ static uint32_t crc32c_clmul_core(uint32_t crc_inverted, const uint8_t *buf, uint64_t len) { /* CRC32C (Castagnoli) constants from Abseil */ const uint64_t kK5 = 0x0f20c0dfeull; /* Folding constant */ const uint64_t kK6 = 0x14cd00bd6ull; /* Folding constant */ const uint64_t kK7 = 0x0dd45aab8ull; /* 64->32 reduction */ const uint64_t kP1 = 0x105ec76f0ull; /* Barrett reduction */ const uint64_t kP2 = 0x0dea713f1ull; /* Barrett reduction */ /* Load first 16-byte block and fold in CRC */ V128 x = v128_load(buf); x.lo ^= (uint64_t)crc_inverted; buf += 16; len -= 16; /* Fold 16-byte blocks into 128-bit accumulator */ while (len >= 16) { V128 block = v128_load(buf); V128 lo = clmul128(x.lo, kK5); V128 hi = clmul128(x.hi, kK6); x = v128_xor(v128_xor(lo, hi), block); buf += 16; len -= 16; } /* Reduce 128-bit to 64-bit */ { V128 tmp = clmul128(kK6, x.lo); x = v128_xor(v128_shift_right64(x), tmp); } /* Reduce 64-bit to 32-bit */ { V128 tmp = v128_shift_right32(x); x = v128_and_mask32(x); x = clmul128(kK7, x.lo); x = v128_xor(x, tmp); } /* Barrett reduction to final 32-bit CRC */ { V128 tmp = v128_and_mask32(x); tmp = clmul128(kP2, tmp.lo); tmp = v128_and_mask32(tmp); tmp = clmul128(kP1, tmp.lo); x = v128_xor(x, tmp); } /* Extract result from second 32-bit lane */ return (uint32_t)((x.lo >> 32) & 0xFFFFFFFFu); } /* * High-level CRC32C with hardware acceleration * Handles alignment and small buffers, delegates to hardware for large aligned blocks */ static uint32_t crc32c_hw(uint32_t crc, const uint8_t *buf, size_t length) { const size_t kMinLen = 32; const size_t kChunkLen = 16; /* Use software for small buffers */ if (length < kMinLen) return crc32c_sw(crc, buf, length); /* Process unaligned head with software */ size_t unaligned = length % kChunkLen; if (unaligned) { crc = crc32c_sw(crc, buf, unaligned); buf += unaligned; length -= unaligned; } /* Process aligned blocks with hardware */ if (length > 0) { /* Hardware expects inverted CRC (working form) */ uint32_t crc_inverted = ~crc; crc_inverted = crc32c_clmul_core(crc_inverted, buf, length); /* Invert back to standard form */ crc = ~crc_inverted; } return crc; } #else /* On non-RISC-V or without Zbc, hardware version is same as software */ static uint32_t crc32c_hw(uint32_t crc, const uint8_t *buf, size_t length) { return crc32c_sw(crc, buf, length); } #endif /* __riscv && __riscv_zbc */ static double now(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec + ts.tv_nsec / 1e9; } int main(void) { uint8_t *data; uint32_t crc_sw = 0, crc_hw = 0; double start, elapsed_sw, elapsed_hw; double mb_per_sec; size_t i; int iter; /* Initialize lookup table */ init_crc32c_table(); /* Allocate test data */ data = malloc(TEST_SIZE); /* Fill with random data */ srand(42); for (i = 0; i < TEST_SIZE; i++) data[i] = rand() & 0xFF; /* Benchmark software implementation */ start = now(); for (iter = 0; iter < ITERATIONS; iter++) { crc_sw = crc32c_sw(0, data, TEST_SIZE); } elapsed_sw = now() - start; mb_per_sec = (TEST_SIZE * ITERATIONS / (1024.0 * 1024.0)) / elapsed_sw; printf("sw crc32c: %8.3f sec (%10.2f MB/s)\n", elapsed_sw, mb_per_sec); /* Benchmark hardware implementation */ start = now(); for (iter = 0; iter < ITERATIONS; iter++) { crc_hw = crc32c_hw(0, data, TEST_SIZE); } elapsed_hw = now() - start; mb_per_sec = (TEST_SIZE * ITERATIONS / (1024.0 * 1024.0)) / elapsed_hw; printf("hw crc32c: %8.3f sec (%10.2f MB/s)\n", elapsed_hw, mb_per_sec); printf("\ndiff: %.2fx\n", elapsed_sw / elapsed_hw); /* Verify correctness */ if (crc_sw != crc_hw) { printf("\n[ERROR] Results don't match!\n"); printf("\tsw: 0x%08X\n", crc_sw); printf("\thw: 0x%08X\n", crc_hw); free(data); return 1; } printf("match: 0x%08X\n", crc_sw); /* Test with known vector */ { const char *test = "123456789"; const uint32_t expected = 0xE3069283; uint32_t result; result = crc32c_sw(0, (const uint8_t *)test, strlen(test)); printf("\nvalidation: CRC32C(\"%s\") = 0x%08X %s\n", test, result, (result == expected) ? "(correct)" : "(WRONG!)"); } /* Test 48-byte pattern */ { uint8_t buf[48]; for (int i = 0; i < 48; i++) buf[i] = i; printf("\n=== Testing 48-byte pattern [0..47] ===\n"); printf("Buffer address: %p (mod 16 = %lu)\n", buf, (unsigned long)((uintptr_t)buf % 16)); uint32_t sw = crc32c_sw(0, buf, 48); uint32_t hw = crc32c_hw(0, buf, 48); printf("Software: 0x%08X\n", sw); printf("Hardware: 0x%08X\n", hw); printf("Match: %s\n", (sw == hw) ? "YES" : "NO"); } free(data); return 0; }
/*
* riscv-des.c
*
* Demonstrates Clang RISC-V vector extension bug affecting DES implementation
* and tests compiler barrier workaround.
*
* Clang 20.1.2 and possibly earlier versions miscompile scatter/gather write
* patterns when auto-vectorizing with -O2 causing incorrect DES encryption
* results.
*
* Build and test:
* gcc -O2 riscv-des.c -o des-gcc
* gcc -O2 riscv-des.c -march=rv64gcv riscv-des.c -o des-gcc-vec
* clang-20 -O2 riscv-des.c -o des-clang
* clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-vec
* clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-vec
*
* All GCC compiled versions should produce "PASS", Clang with optimization
* "O" greater-than "1" fails because initialization will produce the wrong
* permutation tables.
*/
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
/* Compiler barrier macro */
#ifdef __clang__
#define MEMORY_BARRIER() __asm__ volatile("" ::: "memory")
#else
#define MEMORY_BARRIER() ((void)0)
#endif
/* DES constants - P-box permutation (32 bits) */
static const uint8_t pbox[32] = {
16, 7, 20, 21,
29, 12, 28, 17,
1, 15, 23, 26,
5, 18, 31, 10,
2, 8, 24, 14,
32, 27, 3, 9,
19, 13, 30, 6,
22, 11, 4, 25
};
/* Initial Permutation (64 bits) */
static const uint8_t IP[64] = {
58, 50, 42, 34, 26, 18, 10, 2,
60, 52, 44, 36, 28, 20, 12, 4,
62, 54, 46, 38, 30, 22, 14, 6,
64, 56, 48, 40, 32, 24, 16, 8,
57, 49, 41, 33, 25, 17, 9, 1,
59, 51, 43, 35, 27, 19, 11, 3,
61, 53, 45, 37, 29, 21, 13, 5,
63, 55, 47, 39, 31, 23, 15, 7
};
static uint8_t un_pbox[32];
static uint8_t init_perm[64];
static uint8_t final_perm[64];
/*
* Initialize DES permutation tables.
* This function contains scatter/gather patterns that trigger the Clang bug.
*/
static void
des_init_buggy(void)
{
int i;
/* Invert the P-box permutation - BUGGY with Clang -march=rv64gcv */
for (i = 0; i < 32; i++)
un_pbox[pbox[i] - 1] = i;
/* Set up initial & final permutations - BUGGY with Clang -march=rv64gcv */
for (i = 0; i < 64; i++)
init_perm[final_perm[i] = IP[i] - 1] = i;
}
/*
* Initialize DES permutation tables with compiler barriers.
* This version uses MEMORY_BARRIER() to prevent auto-vectorization.
*/
static void
des_init_fixed(void)
{
int i;
/* Invert the P-box permutation - with barrier */
for (i = 0; i < 32; i++)
{
un_pbox[pbox[i] - 1] = i;
MEMORY_BARRIER();
}
/* Set up initial & final permutations - with barriers */
for (i = 0; i < 64; i++)
{
init_perm[final_perm[i] = IP[i] - 1] = i;
MEMORY_BARRIER();
}
}
/*
* Verify that permutation tables are correct.
*/
static int
verify_permutations(void)
{
int i;
int errors = 0;
/* Expected un_pbox values (computed manually) */
const uint8_t expected_un_pbox[32] = {
8, 16, 22, 30, 12, 27, 1, 17,
23, 15, 29, 5, 25, 19, 9, 0,
7, 13, 24, 2, 3, 28, 10, 18,
31, 11, 21, 6, 4, 26, 14, 20
};
/* Check un_pbox */
for (i = 0; i < 32; i++)
{
if (un_pbox[i] != expected_un_pbox[i])
{
if (errors == 0)
printf("ERROR: un_pbox mismatch:\n");
if (errors < 5)
printf("\tun_pbox[%d] = %d, expected %d\n",
i, un_pbox[i], expected_un_pbox[i]);
errors++;
}
}
/* Check that init_perm and final_perm are inverses */
for (i = 0; i < 64; i++)
{
if (init_perm[final_perm[i]] != i)
{
if (errors == 0 || errors == 32)
printf("ERROR: init_perm/final_perm not inverses\n");
if (errors < 5)
printf("\tinit_perm[final_perm[%d]] = %d, expected %d\n",
i, init_perm[final_perm[i]], i);
errors++;
}
}
if (errors > 5)
printf(" ... and %d more errors\n", errors - 5);
return errors;
}
/*
* Benchmark initialization performance.
*/
static double
benchmark_init(void (*init_func)(void), int iterations)
{
struct timespec start, end;
int i;
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < iterations; i++)
{
memset(un_pbox, 0, sizeof(un_pbox));
memset(init_perm, 0, sizeof(init_perm));
memset(final_perm, 0, sizeof(final_perm));
init_func();
}
clock_gettime(CLOCK_MONOTONIC, &end);
double elapsed = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec) / 1e9;
return elapsed;
}
/*
* Get compiler and compilation flags.
*/
static void
print_compiler_info(void)
{
#ifdef __clang__
printf("Compiler: Clang %d.%d.%d\n",
__clang_major__, __clang_minor__, __clang_patchlevel__);
#elif defined(__GNUC__)
printf("Compiler: GCC %d.%d.%d\n",
__GNUC__, __GNUC_MINOR__, __GNUC_PATCHLEVEL__);
#else
printf("Compiler: Unknown\n");
#endif
#ifdef __riscv
printf("Target: RISC-V %d-bit\n", __riscv_xlen);
#ifdef __riscv_vector
printf("Vector extension: Enabled (RVV)\n");
#else
printf("Vector extension: Not enabled\n");
#endif
#else
printf("Target: Not RISC-V\n");
#endif
printf("\n");
}
int
main(void)
{
double buggy_time, fixed_time;
int iterations = 1000000;
print_compiler_info();
/* Test buggy version (without barriers) */
printf("Testing WITHOUT compiler barriers:\n");
memset(un_pbox, 0, sizeof(un_pbox));
memset(init_perm, 0, sizeof(init_perm));
memset(final_perm, 0, sizeof(final_perm));
des_init_buggy();
if (verify_permutations() == 0)
printf("PASS: Permutation tables are correct\n");
else
printf("FAIL: Permutation tables are incorrect\n");
printf("\n");
/* Test fixed version (with barriers) */
printf("Testing WITH compiler barriers:\n");
memset(un_pbox, 0, sizeof(un_pbox));
memset(init_perm, 0, sizeof(init_perm));
memset(final_perm, 0, sizeof(final_perm));
des_init_fixed();
if (verify_permutations() == 0)
printf("PASS: Permutation tables are correct\n");
else
printf("FAIL: Permutation tables are incorrect (fix didn't work)\n");
printf("\n");
/* Performance comparison */
printf("Performance Comparison (%d iterations):\n", iterations);
buggy_time = benchmark_init(des_init_buggy, iterations);
printf("Without barriers: %.3f seconds (%.0f ns/iter)\n",
buggy_time, buggy_time * 1e9 / iterations);
fixed_time = benchmark_init(des_init_fixed, iterations);
printf("With barriers: %.3f seconds (%.0f ns/iter)\n",
fixed_time, fixed_time * 1e9 / iterations);
if (fixed_time > buggy_time * 1.01)
printf("Overhead: %.1f%%\n",
(fixed_time / buggy_time - 1.0) * 100.0);
else
printf("Overhead: Negligible\n");
return 0;
}
/*
* riscv-popcnt.c
*
* RISC-V Zbb popcount optimization
*
* gcc -O2 -o popcnt-wo-zbb riscv-popcnt.c
* gcc -O2 -march=rv64gc_zbb -o popcnt-zbb riscv-popcnt.c
*/
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#define TEST_SIZE (1024 * 1024) /* 1 MB */
#define ITERATIONS 100
/* software popcount taken from pg_bitutils.h */
static int
popcount_sw(uint64_t x)
{
x = (x & 0x5555555555555555ULL) + ((x >> 1) & 0x5555555555555555ULL);
x = (x & 0x3333333333333333ULL) + ((x >> 2) & 0x3333333333333333ULL);
x = (x & 0x0F0F0F0F0F0F0F0FULL) + ((x >> 4) & 0x0F0F0F0F0F0F0F0FULL);
return (x * 0x0101010101010101ULL) >> 56;
}
/* hardware popcount, expect that the compiler will use cpop on Zbb */
static int
popcount_hw(uint64_t x)
{
return __builtin_popcountll(x);
}
static double
now(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec + ts.tv_nsec / 1e9;
}
int
main(void)
{
uint64_t *data;
uint64_t count_sw = 0, count_hw = 0;
double start, elapsed_sw, elapsed_hw;
double mb_per_sec;
size_t i;
data = malloc(TEST_SIZE);
srand(42);
for (i = 0; i < TEST_SIZE / sizeof(uint64_t); i++)
data[i] = ((uint64_t)rand() << 32) | rand();
start = now();
for (int iter = 0; iter < ITERATIONS; iter++)
{
for (i = 0; i < TEST_SIZE / sizeof(uint64_t); i++)
count_sw += popcount_sw(data[i]);
}
elapsed_sw = now() - start;
mb_per_sec = (TEST_SIZE * ITERATIONS / (1024.0 * 1024.0)) / elapsed_sw;
printf("sw popcount: %8.3f sec (%10.2f MB/s)\n",
elapsed_sw, mb_per_sec);
start = now();
for (int iter = 0; iter < ITERATIONS; iter++)
{
for (i = 0; i < TEST_SIZE / sizeof(uint64_t); i++)
count_hw += popcount_hw(data[i]);
}
elapsed_hw = now() - start;
mb_per_sec = (TEST_SIZE * ITERATIONS / (1024.0 * 1024.0)) / elapsed_hw;
printf("hw popcount: %8.3f sec (%10.2f MB/s)\n",
elapsed_hw, mb_per_sec);
printf("\ndiff: %.2fx\n", elapsed_sw / elapsed_hw);
if (count_sw != count_hw)
{
printf("\n[ERROR] Results don't match!\n");
printf("\tsw: %llu\n", (unsigned long long)count_sw);
printf("\thw: %llu\n", (unsigned long long)count_hw);
}
else
{
printf("match: %llu bits counted\n", (unsigned long long)count_sw);
}
free(data);
return 0;
}
From d23f185ab546ace81c293249c54d723a7e4be8a0 Mon Sep 17 00:00:00 2001 From: Greg Burd <[email protected]> Date: Mon, 23 Mar 2026 11:26:24 -0400 Subject: [PATCH v3 1/3] Avoid Clang RISC-V auto-vectorization bug in DES Clang 20.1.2 (and possibly earlier/later versions) miscompiles scatter-write patterns like "array[perm[i]] = i" when compiling with -O2. This causes incorrect DES permutation tables in contrib/pgcrypto/crypt-des.c, resulting in wrong password hashes and authentication failures. Add compiler barriers (memory clobber asm statements) after scatter writes in des_init() to prevent auto-vectorization of the affected loops. The barriers are harmless on all compilers (GCC, Clang, MSVC) and have negligible performance impact since DES initialization occurs only once per connection. The fix applies to all compilers to ensure consistent behavior and avoid future compiler bugs with similar optimizations. --- contrib/pgcrypto/crypt-des.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/contrib/pgcrypto/crypt-des.c b/contrib/pgcrypto/crypt-des.c index 98c30ea122e..0a698da7132 100644 --- a/contrib/pgcrypto/crypt-des.c +++ b/contrib/pgcrypto/crypt-des.c @@ -62,12 +62,14 @@ #include "postgres.h" #include "miscadmin.h" +#include "port/atomics.h" #include "port/pg_bswap.h" #include "px-crypt.h" #define _PASSWORD_EFMT1 '_' + static const char _crypt_a64[] = "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; @@ -265,6 +267,10 @@ des_init(void) for (i = 0; i < 64; i++) { init_perm[final_perm[i] = IP[i] - 1] = i; + /* This prevents a Clang bug related to auto-vectorization */ +#if defined(__riscv) && defined(__clang__) + pg_memory_barrier(); +#endif inv_key_perm[i] = 255; } @@ -276,6 +282,10 @@ des_init(void) { u_key_perm[i] = key_perm[i] - 1; inv_key_perm[key_perm[i] - 1] = i; + /* This prevents a Clang bug related to auto-vectorization */ +#if defined(__riscv) && defined(__clang__) + pg_memory_barrier(); +#endif inv_comp_perm[i] = 255; } @@ -283,7 +293,13 @@ des_init(void) * Invert the key compression permutation. */ for (i = 0; i < 48; i++) + { inv_comp_perm[comp_perm[i] - 1] = i; + /* This prevents a Clang bug related to auto-vectorization */ +#if defined(__riscv) && defined(__clang__) + pg_memory_barrier(); +#endif + } /* * Set up the OR-mask arrays for the initial and final permutations, and @@ -353,7 +369,13 @@ des_init(void) * the output of the S-box arrays setup above. */ for (i = 0; i < 32; i++) + { un_pbox[pbox[i] - 1] = i; + /* This prevents a Clang bug related to auto-vectorization */ +#if defined(__riscv) && defined(__clang__) + pg_memory_barrier(); +#endif + } for (b = 0; b < 4; b++) for (i = 0; i < 256; i++) -- 2.51.2
From 9ef5eacda334ffd13f1104f213c2c2e3ddf64615 Mon Sep 17 00:00:00 2001 From: Greg Burd <[email protected]> Date: Sun, 22 Mar 2026 11:15:41 -0400 Subject: [PATCH v3 2/3] Add RISC-V popcount using Zbb extension Implement hardware popcount support for RISC-V using the Zbb (basic bit manipulation) extension when present. The Zbb extension provides the 'cpop' instruction which GCC and Clang emit from __builtin_popcountll() when compiling with -march=rv64gc_zbb. This patch adds: - Build-time detection of Zbb support (configure.ac, meson.build) - Runtime detection using __riscv_hwprobe() on Linux - Optimized popcount implementation using cpop instruction The implementation follows established pattern for hardware acceleration (similar to x86 POPCNT and ARM SVE). Zbb-optimized code is compiled separately with -march=rv64gc_zbb, while the main binary remains portable across all RISC-V 64-bit systems. --- configure.ac | 29 ++++++ meson.build | 32 +++++++ src/include/port/pg_bitutils.h | 2 +- src/port/meson.build | 7 +- src/port/pg_bitutils.c | 5 +- src/port/pg_popcount_riscv.c | 156 +++++++++++++++++++++++++++++++++ 6 files changed, 226 insertions(+), 5 deletions(-) create mode 100644 src/port/pg_popcount_riscv.c diff --git a/configure.ac b/configure.ac index 2baac5e9da7..8c5970d1be0 100644 --- a/configure.ac +++ b/configure.ac @@ -2154,6 +2154,35 @@ if test x"$host_cpu" = x"aarch64"; then fi fi +# Check for RISC-V Zbb bitmanip extension (provides 'cpop' for popcount). +# +# The Zbb extension provides the 'cpop' instruction for hardware popcount. +# GCC/Clang emit the cpop instruction from __builtin_popcountll() when +# -march=rv64gc_zbb is used. We test compilation with this flag, then +# restore CFLAGS to avoid global march flags (for binary portability). +# We define USE_RISCV_ZBB_WITH_RUNTIME_CHECK and use __riscv_hwprobe() +# for runtime detection. We compile src/port/pg_popcount_riscv.c with +# -march=rv64gc_zbb separately (like ARM SVE and x86 POPCNT). +AC_MSG_CHECKING([for RISC-V Zbb extension (cpop/popcount)]) +if test x"$host_cpu" = x"riscv64"; then + pgac_save_CFLAGS_zbb="$CFLAGS" + CFLAGS="$CFLAGS -march=rv64gc_zbb" + AC_COMPILE_IFELSE( + [AC_LANG_PROGRAM( + [/* Test that the compiler will emit cpop from __builtin_popcountll */ + static inline int test_cpop(unsigned long long x) + { return __builtin_popcountll(x); }], + [volatile int r = test_cpop(0xdeadbeefULL); (void) r;])], + [AC_DEFINE(USE_RISCV_ZBB_WITH_RUNTIME_CHECK, 1, + [Define to 1 to use RISC-V Zbb popcount with runtime detection.]) + CFLAGS="$pgac_save_CFLAGS_zbb" + AC_MSG_RESULT([yes, with runtime check])], + [CFLAGS="$pgac_save_CFLAGS_zbb" + AC_MSG_RESULT([no])]) +else + AC_MSG_RESULT([not on RISC-V]) +fi + # Check for Intel SSE 4.2 intrinsics to do CRC calculations. # PGAC_SSE42_CRC32_INTRINSICS() diff --git a/meson.build b/meson.build index ea31cbce9c0..dac6fbdeb1e 100644 --- a/meson.build +++ b/meson.build @@ -2532,6 +2532,38 @@ int main(void) endif +# --------------------------------------------------------------------------- +# Check for RISC-V Zbb bitmanip extension (provides 'cpop' for popcount). +# +# The Zbb extension provides the 'cpop' instruction for hardware popcount. +# GCC/Clang emit the cpop instruction from __builtin_popcountll() when +# -march=rv64gc_zbb is used. We test compilation with this flag, but +# do NOT add it globally (for binary portability). Instead, we define +# USE_RISCV_ZBB_WITH_RUNTIME_CHECK and compile src/port/pg_popcount_riscv.c +# with -march=rv64gc_zbb separately (like ARM SVE and x86 POPCNT). +# Runtime detection uses __riscv_hwprobe(). +# --------------------------------------------------------------------------- +zbb_test_code = ''' +static inline int test_cpop(unsigned long long x) +{ return __builtin_popcountll(x); } +int main(void) { + volatile int r = test_cpop(0xdeadbeefULL); + (void) r; + return 0; +} +''' + +cflags_zbb = [] +if host_cpu == 'riscv64' + if cc.compiles(zbb_test_code, + args: ['-march=rv64gc_zbb'], + name: 'RISC-V Zbb cpop') + cdata.set('USE_RISCV_ZBB_WITH_RUNTIME_CHECK', 1) + # Flag will be added only to pg_popcount_riscv.c in src/port/meson.build + cflags_zbb = ['-march=rv64gc_zbb'] + endif +endif + ############################################################### # Select CRC-32C implementation. diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 0bca559caaa..8db645c4a42 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -279,7 +279,7 @@ pg_ceil_log2_64(uint64 num) extern uint64 pg_popcount_portable(const char *buf, int bytes); extern uint64 pg_popcount_masked_portable(const char *buf, int bytes, bits8 mask); -#if defined(HAVE_X86_64_POPCNTQ) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK) +#if defined(HAVE_X86_64_POPCNTQ) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK) || defined(USE_RISCV_ZBB_WITH_RUNTIME_CHECK) /* * Attempt to use specialized CPU instructions, but perform a runtime check * first. diff --git a/src/port/meson.build b/src/port/meson.build index 7296f8e3c03..9d0bb59aca0 100644 --- a/src/port/meson.build +++ b/src/port/meson.build @@ -98,12 +98,15 @@ replace_funcs_pos = [ # loongarch ['pg_crc32c_loongarch', 'USE_LOONGARCH_CRC32C'], + # riscv + ['pg_popcount_riscv', 'USE_RISCV_ZBB_WITH_RUNTIME_CHECK', 'zbb'], + # generic fallback ['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'], ] -pgport_cflags = {'crc': cflags_crc} -pgport_sources_cflags = {'crc': []} +pgport_cflags = {'crc': cflags_crc, 'zbb': cflags_zbb} +pgport_sources_cflags = {'crc': [], 'zbb': []} foreach f : replace_funcs_neg func = f.get(0) diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c index 49b130f1306..699ae89129f 100644 --- a/src/port/pg_bitutils.c +++ b/src/port/pg_bitutils.c @@ -162,7 +162,7 @@ pg_popcount_masked_portable(const char *buf, int bytes, bits8 mask) return popcnt; } -#if !defined(HAVE_X86_64_POPCNTQ) && !defined(USE_NEON) +#if !defined(HAVE_X86_64_POPCNTQ) && !defined(USE_NEON) && !defined(USE_RISCV_ZBB_WITH_RUNTIME_CHECK) /* * When special CPU instructions are not available, there's no point in using @@ -191,4 +191,5 @@ pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask) return pg_popcount_masked_portable(buf, bytes, mask); } -#endif /* ! HAVE_X86_64_POPCNTQ && ! USE_NEON */ +#endif /* ! HAVE_X86_64_POPCNTQ && ! USE_NEON && ! + * USE_RISCV_ZBB_WITH_RUNTIME_CHECK */ diff --git a/src/port/pg_popcount_riscv.c b/src/port/pg_popcount_riscv.c new file mode 100644 index 00000000000..0b1b0e66e21 --- /dev/null +++ b/src/port/pg_popcount_riscv.c @@ -0,0 +1,156 @@ +/*------------------------------------------------------------------------- + * + * pg_popcount_riscv.c + * Holds the RISC-V Zbb popcount implementations. + * + * Copyright (c) 2026, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/port/pg_popcount_riscv.c + * + *------------------------------------------------------------------------- + */ +#include "c.h" + +#ifdef USE_RISCV_ZBB_WITH_RUNTIME_CHECK + +#if defined(__linux__) +#include <sys/syscall.h> +#include <unistd.h> +#include <asm/hwprobe.h> +#endif + +#include "port/pg_bitutils.h" + +/* + * Hardware implementation using RISC-V Zbb cpop instruction. + */ +static uint64 pg_popcount_zbb(const char *buf, int bytes); +static uint64 pg_popcount_masked_zbb(const char *buf, int bytes, bits8 mask); + +/* + * The function pointers are initially set to "choose" functions. These + * functions will first set the pointers to the right implementations (based on + * what the current CPU supports) and then will call the pointer to fulfill the + * caller's request. + */ +static uint64 pg_popcount_choose(const char *buf, int bytes); +static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask); +uint64 (*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose; +uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose; + +static inline bool +pg_popcount_zbb_available(void) +{ +#if defined(__linux__) && defined(__NR_riscv_hwprobe) + struct riscv_hwprobe pair = {.key = RISCV_HWPROBE_KEY_IMA_EXT_0}; + + if (syscall(__NR_riscv_hwprobe, &pair, 1, 0, NULL, 0) != 0) + return false; + + return (pair.value & RISCV_HWPROBE_EXT_ZBB) != 0; +#else + return false; +#endif +} + +static inline void +choose_popcount_functions(void) +{ + if (pg_popcount_zbb_available()) + { + pg_popcount_optimized = pg_popcount_zbb; + pg_popcount_masked_optimized = pg_popcount_masked_zbb; + } + else + { + pg_popcount_optimized = pg_popcount_portable; + pg_popcount_masked_optimized = pg_popcount_masked_portable; + } +} + +static uint64 +pg_popcount_choose(const char *buf, int bytes) +{ + choose_popcount_functions(); + return pg_popcount_optimized(buf, bytes); +} + +static uint64 +pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask) +{ + choose_popcount_functions(); + return pg_popcount_masked_optimized(buf, bytes, mask); +} + +/* + * pg_popcount64_zbb + * Return the number of 1 bits set in word + * + * Uses the RISC-V Zbb 'cpop' (count population) instruction via + * __builtin_popcountll(). When compiled with -march=rv64gc_zbb, GCC and + * Clang will emit the cpop instruction for this builtin. + */ +static inline int +pg_popcount64_zbb(uint64 word) +{ + return __builtin_popcountll(word); +} + +/* + * pg_popcount_zbb + * Returns number of 1 bits in buf + * + * Similar approach to x86 SSE4.2 POPCNT: process data in 8-byte chunks using + * the cpop instruction, with byte-by-byte fallback for remaining data. + */ +static uint64 +pg_popcount_zbb(const char *buf, int bytes) +{ + uint64 popcnt = 0; + const uint64 *words = (const uint64 *) buf; + + /* Process 8-byte chunks */ + while (bytes >= 8) + { + popcnt += pg_popcount64_zbb(*words++); + bytes -= 8; + } + + buf = (const char *) words; + + /* Process any remaining bytes */ + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++]; + + return popcnt; +} + +/* + * pg_popcount_masked_zbb + * Returns number of 1 bits in buf after applying the mask to each byte + */ +static uint64 +pg_popcount_masked_zbb(const char *buf, int bytes, bits8 mask) +{ + uint64 popcnt = 0; + uint64 maskv = ~UINT64CONST(0) / 0xFF * mask; + const uint64 *words = (const uint64 *) buf; + + /* Process 8-byte chunks */ + while (bytes >= 8) + { + popcnt += pg_popcount64_zbb(*words++ & maskv); + bytes -= 8; + } + + buf = (const char *) words; + + /* Process any remaining bytes */ + while (bytes--) + popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask]; + + return popcnt; +} + +#endif /* USE_RISCV_ZBB_WITH_RUNTIME_CHECK */ -- 2.51.2
From ca3d7a87745e9743877a522f9920fe2335a98fec Mon Sep 17 00:00:00 2001 From: Greg Burd <[email protected]> Date: Mon, 23 Mar 2026 12:31:58 +0000 Subject: [PATCH v3 3/3] Add RISC-V CRC32C using the Zbc extension This adds hardware-accelerated CRC-32C computation for RISC-V platforms with the Zbc (carry-less multiply) or Zbkc (crypto carry-less) extension. The implementation uses the clmul and clmulh instructions for polynomial folding with Barrett reduction to compute CRC-32C checksums. This provides approximately 20x speedup over the software slicing-by-8 implementation. The algorithm is based on the Google Abseil project's RISC-V CRC32C implementation (https://github.com/abseil/abseil-cpp/pull/1986 in absl/crc/internal/crc_riscv.cc) that is Copyright 2025 The Abseil Authors licensed under the Apache License, Version 2.0. Runtime detection uses the Linux riscv_hwprobe syscall (kernel 6.4+) to check for Zbc/Zbkc support, falling back gracefully to software on older kernels or non-Linux platforms. Similar to ARMv8 CRC Extension and x86 SSE 4.2 support, this is compiled with '-march=rv64gc_zbc' and selected at runtime based on CPU capabilities. --- config/c-compiler.m4 | 41 +++++ configure.ac | 36 ++++- meson.build | 36 +++++ src/include/port/pg_crc32c.h | 14 ++ src/port/meson.build | 3 + src/port/pg_crc32c_riscv_choose.c | 101 ++++++++++++ src/port/pg_crc32c_riscv_zbc.c | 257 ++++++++++++++++++++++++++++++ src/tools/pgindent/typedefs.list | 1 + 8 files changed, 482 insertions(+), 7 deletions(-) create mode 100644 src/port/pg_crc32c_riscv_choose.c create mode 100644 src/port/pg_crc32c_riscv_zbc.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 629572ee350..91cc0688808 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -791,6 +791,47 @@ fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS +# PGAC_RISCV_ZBC_CRC32C_INTRINSICS +# --------------------------------- +# Check if the compiler supports RISC-V Zbc (carry-less multiply) instructions +# for CRC-32C computation, using inline assembly for clmul instruction. +# +# An optional compiler flag can be passed as argument (e.g. -march=rv64gc_zbc). +# If the intrinsics are supported, sets pgac_riscv_zbc_crc32c_intrinsics and +# CFLAGS_CRC. +# +# The Zbc extension provides clmul and clmulh instructions which are used with +# polynomial folding to compute CRC-32C. This implementation is based on the +# algorithm from Google Abseil (https://github.com/abseil/abseil-cpp/pull/1986). +AC_DEFUN([PGAC_RISCV_ZBC_CRC32C_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_riscv_zbc_crc32c_intrinsics_$1])])dnl +AC_CACHE_CHECK([for RISC-V Zbc clmul with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([ +#if !defined(__riscv) || !defined(__riscv_xlen) || __riscv_xlen != 64 +#error not RISC-V 64-bit +#endif + +static inline unsigned long clmul_test(unsigned long a, unsigned long b) +{ + unsigned long result; + __asm__("clmul %0, %1, %2" : "=r"(result) : "r"(a), "r"(b)); + return result; +}], + [unsigned long result = clmul_test(0x123, 0x456); + /* return computed value, to prevent the above being optimized away */ + return result == 0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_CRC="$1" + pgac_riscv_zbc_crc32c_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_RISCV_ZBC_CRC32C_INTRINSICS + # PGAC_XSAVE_INTRINSICS # --------------------- # Check if the compiler supports the XSAVE instructions using the _xgetbv diff --git a/configure.ac b/configure.ac index 8c5970d1be0..1f8a4fa51d4 100644 --- a/configure.ac +++ b/configure.ac @@ -2215,6 +2215,17 @@ fi # with the default compiler flags. PGAC_LOONGARCH_CRC32C_INTRINSICS() +# Check for RISC-V Zbc (carry-less multiply) for CRC calculations. +# +# The Zbc extension provides clmul and clmulh instructions for hardware- +# accelerated CRC-32C computation using polynomial folding. Check if we +# can compile with -march=rv64gc_zbc flag. CFLAGS_CRC is set if the flag +# is required. +# +# This implementation is based on Google Abseil's algorithm: +# https://github.com/abseil/abseil-cpp/pull/1986 +PGAC_RISCV_ZBC_CRC32C_INTRINSICS([-march=rv64gc_zbc]) + AC_SUBST(CFLAGS_CRC) # Select CRC-32C implementation. @@ -2245,7 +2256,7 @@ AC_SUBST(CFLAGS_CRC) # # If we are targeting a LoongArch processor, CRC instructions are # always available (at least on 64 bit), so no runtime check is needed. -if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x""; then +if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && test x"$USE_SSE42_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_ARMV8_CRC32C" = x"" && test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"" && test x"$USE_LOONGARCH_CRC32C" = x"" && test x"$USE_RISCV_ZBC_CRC32C_WITH_RUNTIME_CHECK" = x""; then # Use Intel SSE 4.2 if available. if test x"$pgac_sse42_crc32_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = x"1" ; then USE_SSE42_CRC32C=1 @@ -2267,9 +2278,14 @@ if test x"$USE_SLICING_BY_8_CRC32C" = x"" && test x"$USE_SSE42_CRC32C" = x"" && if test x"$pgac_loongarch_crc32c_intrinsics" = x"yes"; then USE_LOONGARCH_CRC32C=1 else - # fall back to slicing-by-8 algorithm, which doesn't require any - # special CPU support. - USE_SLICING_BY_8_CRC32C=1 + # RISC-V Zbc CRC, with runtime check. + if test x"$pgac_riscv_zbc_crc32c_intrinsics" = x"yes"; then + USE_RISCV_ZBC_CRC32C_WITH_RUNTIME_CHECK=1 + else + # fall back to slicing-by-8 algorithm, which doesn't require any + # special CPU support. + USE_SLICING_BY_8_CRC32C=1 + fi fi fi fi @@ -2304,9 +2320,15 @@ else PG_CRC32C_OBJS="pg_crc32c_loongarch.o" AC_MSG_RESULT(LoongArch CRCC instructions) else - AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).]) - PG_CRC32C_OBJS="pg_crc32c_sb8.o" - AC_MSG_RESULT(slicing-by-8) + if test x"$USE_RISCV_ZBC_CRC32C_WITH_RUNTIME_CHECK" = x"1"; then + AC_DEFINE(USE_RISCV_ZBC_CRC32C_WITH_RUNTIME_CHECK, 1, [Define to 1 to use RISC-V Zbc CRC instructions with a runtime check.]) + PG_CRC32C_OBJS="pg_crc32c_riscv_zbc.o pg_crc32c_sb8.o pg_crc32c_riscv_choose.o" + AC_MSG_RESULT(RISC-V Zbc instructions with runtime check) + else + AC_DEFINE(USE_SLICING_BY_8_CRC32C, 1, [Define to 1 to use software CRC-32C implementation (slicing-by-8).]) + PG_CRC32C_OBJS="pg_crc32c_sb8.o" + AC_MSG_RESULT(slicing-by-8) + fi fi fi fi diff --git a/meson.build b/meson.build index dac6fbdeb1e..3fbf8756ab7 100644 --- a/meson.build +++ b/meson.build @@ -2733,6 +2733,42 @@ int main(void) have_optimized_crc = true endif +elif host_cpu == 'riscv64' + + # Check for RISC-V Zbc (carry-less multiply) extension for CRC-32C. + # The Zbc extension provides clmul and clmulh instructions used for + # hardware-accelerated CRC computation via polynomial folding. + # + # This implementation is based on Google Abseil's algorithm: + # https://github.com/abseil/abseil-cpp/pull/1986 + + prog = ''' +#if !defined(__riscv) || !defined(__riscv_xlen) || __riscv_xlen != 64 +#error not RISC-V 64-bit +#endif + +static inline unsigned long clmul(unsigned long a, unsigned long b) +{ + unsigned long result; + __asm__("clmul %0, %1, %2" : "=r"(result) : "r"(a), "r"(b)); + return result; +} + +int main(void) +{ + unsigned long result = clmul(0x123, 0x456); + return result == 0; +} +''' + + if cc.links(prog, name: 'RISC-V Zbc clmul with -march=rv64gc_zbc', + args: test_c_args + ['-march=rv64gc_zbc']) + # Use RISC-V Zbc CRC, with runtime check + cflags_crc += '-march=rv64gc_zbc' + cdata.set('USE_RISCV_ZBC_CRC32C_WITH_RUNTIME_CHECK', 1) + have_optimized_crc = true + endif + endif if not have_optimized_crc diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h index 9ac619aec3e..27fa27f773e 100644 --- a/src/include/port/pg_crc32c.h +++ b/src/include/port/pg_crc32c.h @@ -142,6 +142,20 @@ extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len) extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len); extern pg_crc32c pg_comp_crc32c_armv8(pg_crc32c crc, const void *data, size_t len); +#elif defined(USE_RISCV_ZBC_CRC32C_WITH_RUNTIME_CHECK) + +/* + * Use RISC-V Zbc instructions, but perform a runtime check first + * to check that they are available. + */ +#define COMP_CRC32C(crc, data, len) \ + ((crc) = pg_comp_crc32c((crc), (data), (len))) +#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF) + +extern pg_crc32c pg_comp_crc32c_sb8(pg_crc32c crc, const void *data, size_t len); +extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len); +extern pg_crc32c pg_comp_crc32c_riscv_zbc(pg_crc32c crc, const void *data, size_t len); + #else /* * Use slicing-by-8 algorithm. diff --git a/src/port/meson.build b/src/port/meson.build index 9d0bb59aca0..e7c1bccf9d5 100644 --- a/src/port/meson.build +++ b/src/port/meson.build @@ -99,6 +99,9 @@ replace_funcs_pos = [ ['pg_crc32c_loongarch', 'USE_LOONGARCH_CRC32C'], # riscv + ['pg_crc32c_riscv_zbc', 'USE_RISCV_ZBC_CRC32C_WITH_RUNTIME_CHECK', 'crc'], + ['pg_crc32c_riscv_choose', 'USE_RISCV_ZBC_CRC32C_WITH_RUNTIME_CHECK'], + ['pg_crc32c_sb8', 'USE_RISCV_ZBC_CRC32C_WITH_RUNTIME_CHECK'], ['pg_popcount_riscv', 'USE_RISCV_ZBB_WITH_RUNTIME_CHECK', 'zbb'], # generic fallback diff --git a/src/port/pg_crc32c_riscv_choose.c b/src/port/pg_crc32c_riscv_choose.c new file mode 100644 index 00000000000..18d105e5e12 --- /dev/null +++ b/src/port/pg_crc32c_riscv_choose.c @@ -0,0 +1,101 @@ +/*------------------------------------------------------------------------- + * + * pg_crc32c_riscv_choose.c + * Choose between RISC-V Zbc and software CRC-32C implementation. + * + * On first call, checks if the CPU supports the RISC-V Zbc (or Zbkc) extension. + * If it does, use carry-less multiply instructions for CRC-32C computation. + * Otherwise, fall back to the pure software implementation (slicing-by-8). + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/port/pg_crc32c_riscv_choose.c + * + *------------------------------------------------------------------------- + */ + +#ifndef FRONTEND +#include "postgres.h" +#else +#include "postgres_fe.h" +#endif + +#include <sys/syscall.h> +#include <unistd.h> + +#include "port/pg_crc32c.h" + +/* + * RISC-V hardware probing definitions + */ +#ifndef __NR_riscv_hwprobe +#define __NR_riscv_hwprobe 258 +#endif + +#ifndef RISCV_HWPROBE_KEY_IMA_EXT_0 +#define RISCV_HWPROBE_KEY_IMA_EXT_0 4 +#endif + +#ifndef RISCV_HWPROBE_EXT_ZBC +#define RISCV_HWPROBE_EXT_ZBC (1ULL << 7) +#endif + +#ifndef RISCV_HWPROBE_EXT_ZBKC +#define RISCV_HWPROBE_EXT_ZBKC (1ULL << 27) +#endif + +struct riscv_hwprobe +{ + int64 key; + uint64 value; +}; + +/* + * Check if RISC-V Zbc or Zbkc extension is available + * + * Uses the riscv_hwprobe syscall which is available on Linux kernel 6.4+ + * Falls back to software if the syscall fails or extensions are not available. + */ +static bool +pg_crc32c_riscv_zbc_available(void) +{ +#if defined(__linux__) && defined(__riscv) && (__riscv_xlen == 64) + struct riscv_hwprobe pair = {.key = RISCV_HWPROBE_KEY_IMA_EXT_0}; + + /* + * Make the syscall. If it fails (e.g., old kernel, non-Linux), fall back + * to software. + */ + if (syscall(__NR_riscv_hwprobe, &pair, 1, 0, NULL, 0) != 0) + return false; + + /* + * Check if either Zbc (general bitmanip carry-less) or Zbkc (crypto + * carry-less) is available. Both provide clmul/clmulh instructions. + */ + return (pair.value & (RISCV_HWPROBE_EXT_ZBC | RISCV_HWPROBE_EXT_ZBKC)) != 0; +#else + /* Not on RISC-V Linux, or not 64-bit - use software fallback */ + return false; +#endif +} + +/* + * This gets called on the first call. It replaces the function pointer + * so that subsequent calls are routed directly to the chosen implementation. + */ +static pg_crc32c +pg_comp_crc32c_choose(pg_crc32c crc, const void *data, size_t len) +{ + if (pg_crc32c_riscv_zbc_available()) + pg_comp_crc32c = pg_comp_crc32c_riscv_zbc; + else + pg_comp_crc32c = pg_comp_crc32c_sb8; + + return pg_comp_crc32c(crc, data, len); +} + +pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len) = pg_comp_crc32c_choose; diff --git a/src/port/pg_crc32c_riscv_zbc.c b/src/port/pg_crc32c_riscv_zbc.c new file mode 100644 index 00000000000..9eb845dca69 --- /dev/null +++ b/src/port/pg_crc32c_riscv_zbc.c @@ -0,0 +1,257 @@ +/*------------------------------------------------------------------------- + * + * pg_crc32c_riscv_zbc.c + * Compute CRC-32C checksum using RISC-V Zbc carry-less multiply instructions + * + * This implementation uses the RISC-V Zbc (or Zbkc) extension for hardware- + * accelerated CRC-32C computation. It uses carry-less multiplication (clmul + * and clmulh) with polynomial folding and Barrett reduction. + * + * The algorithm is based on Google Abseil's implementation: + * https://github.com/abseil/abseil-cpp/pull/1986 + * File: absl/crc/internal/crc_riscv.cc + * + * Copyright 2025 The Abseil Authors + * Licensed under the Apache License, Version 2.0 + * Adapted for PostgreSQL under PostgreSQL license + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/port/pg_crc32c_riscv_zbc.c + * + *------------------------------------------------------------------------- + */ +#include "c.h" + +#ifdef WORDS_BIGENDIAN +#error "RISC-V Zbc CRC implementation does not support big-endian systems" +#endif + +#include "port/pg_crc32c.h" + +/* + * 128-bit value for polynomial arithmetic + */ +typedef struct +{ + uint64 lo; + uint64 hi; +} V128; + +/* + * Carry-less multiply instructions from RISC-V Zbc/Zbkc extension + */ +static inline uint64 +pg_clmul(uint64 a, uint64 b) +{ + uint64 _res; + + __asm__( + " clmul %0, %1, %2\n" +: "=r"(_res) +: "r"(a), "r"(b)); + + return _res; +} + +static inline uint64 +pg_clmulh(uint64 a, uint64 b) +{ + uint64 _res; + + __asm__( + " clmulh %0, %1, %2" +: "=r"(_res) +: "r"(a), "r"(b)); + + return _res; +} + +static inline V128 +pg_clmul128(uint64 a, uint64 b) +{ + V128 result; + + result.lo = pg_clmul(a, b); + result.hi = pg_clmulh(a, b); + return result; +} + +/* + * 128-bit operations + */ +static inline V128 +pg_v128_xor(V128 a, V128 b) +{ + V128 result; + + result.lo = a.lo ^ b.lo; + result.hi = a.hi ^ b.hi; + return result; +} + +static inline V128 +pg_v128_and_mask32(V128 a) +{ + V128 result; + + result.lo = a.lo & UINT64CONST(0x00000000FFFFFFFF); + result.hi = a.hi & UINT64CONST(0x00000000FFFFFFFF); + return result; +} + +static inline V128 +pg_v128_shift_right64(V128 a) +{ + V128 result; + + result.lo = a.hi; + result.hi = 0; + return result; +} + +static inline V128 +pg_v128_shift_right32(V128 a) +{ + V128 result; + + result.lo = (a.lo >> 32) | (a.hi << 32); + result.hi = (a.hi >> 32); + return result; +} + +static inline V128 +pg_v128_load(const unsigned char *p) +{ + V128 result; + + /* + * Load 16 bytes as two 64-bit values. Use direct loads like Abseil + * reference implementation. RISC-V is always little-endian so no byte + * swapping needed. + */ + result.lo = *(const uint64 *) p; + result.hi = *(const uint64 *) (p + 8); + return result; +} + +/* + * CRC-32C (Castagnoli) polynomial folding constants. These are computed + * for the polynomial 0x1EDC6F41 (normal form) or 0x82F63B78 (reflected). + */ +static const uint64 kK5 = UINT64CONST(0x0f20c0dfe); /* Folding constant */ +static const uint64 kK6 = UINT64CONST(0x14cd00bd6); /* Folding constant */ +static const uint64 kK7 = UINT64CONST(0x0dd45aab8); /* 64->32 reduction */ +static const uint64 kP1 = UINT64CONST(0x105ec76f0); /* Barrett reduction */ +static const uint64 kP2 = UINT64CONST(0x0dea713f1); /* Barrett reduction */ + +/* + * Core CRC-32C computation using carry-less multiplication. + * + * Input: CRC in working form (already inverted with ~crc) + * Output: CRC in working form (still inverted) + * + * Precondition: len >= 32 and len % 16 == 0 + */ +static uint32 +pg_crc32c_clmul_core(uint32 crc_inverted, const unsigned char *buf, uint64 len) +{ + V128 x; + + /* Load first 16-byte block and XOR with inverted CRC */ + x = pg_v128_load(buf); + x.lo ^= (uint64) crc_inverted; + buf += 16; + len -= 16; + + /* Fold 16-byte blocks into 128-bit accumulator */ + while (len >= 16) + { + V128 block = pg_v128_load(buf); + V128 lo = pg_clmul128(x.lo, kK5); + V128 hi = pg_clmul128(x.hi, kK6); + + x = pg_v128_xor(pg_v128_xor(lo, hi), block); + buf += 16; + len -= 16; + } + + /* Reduce 128-bit to 64-bit */ + { + V128 tmp = pg_clmul128(kK6, x.lo); + + x = pg_v128_xor(pg_v128_shift_right64(x), tmp); + } + + /* Reduce 64-bit to 32-bit */ + { + V128 tmp = pg_v128_shift_right32(x); + + x = pg_v128_and_mask32(x); + x = pg_clmul128(kK7, x.lo); + x = pg_v128_xor(x, tmp); + } + + /* Barrett reduction to final 32-bit CRC */ + { + V128 tmp = pg_v128_and_mask32(x); + + tmp = pg_clmul128(kP2, tmp.lo); + tmp = pg_v128_and_mask32(tmp); + tmp = pg_clmul128(kP1, tmp.lo); + x = pg_v128_xor(x, tmp); + } + + /* Extract result from second 32-bit lane */ + return (uint32) ((x.lo >> 32) & UINT64CONST(0xFFFFFFFF)); +} + +/* + * Main CRC-32C computation function with RISC-V Zbc acceleration + */ +pg_crc32c +pg_comp_crc32c_riscv_zbc(pg_crc32c crc, const void *data, size_t len) +{ + const unsigned char *p = data; + const size_t kMinLen = 32; + const size_t kChunkLen = 16; + size_t tail; + + /* Use software fallback for small buffers */ + if (len < kMinLen) + return pg_comp_crc32c_sb8(crc, data, len); + + /* + * Process head bytes to align to 16-byte boundary if needed. The hardware + * algorithm requires 16-byte aligned access. + */ + /* Process tail bytes with software (Abseil approach) */ + tail = len % kChunkLen; + if (tail) + { + crc = pg_comp_crc32c_sb8(crc, p, tail); + p += tail; + len -= tail; + } + + /* + * Process remaining bytes (now a multiple of 16) with hardware. The core + * algorithm requires at least 32 bytes. + */ + if (len >= 32) + { + /* + * The Abseil core algorithm expects to receive 0xFFFFFFFF as the + * initial CRC value (corresponding to Abseil's initial value of 0 + * after inversion). PostgreSQL's convention already passes 0xFFFFFFFF + * initially, so pass it directly. The core returns a value that needs + * final XOR with 0xFFFFFFFF (done by the caller). + */ + crc = pg_crc32c_clmul_core(crc, p, len); + } + + return crc; +} diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 112653c1680..49baf0bad39 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -3330,6 +3330,7 @@ VirtualTransactionId VirtualTupleTableSlot VolatileFunctionStatus Vsrt +V128 WAIT_ORDER WALAvailability WALInsertLock -- 2.51.2
