Hi John Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here.
> On Jan 12, 2026, at 1:27 AM, John Naylor <[email protected]> wrote: > > On Wed, May 14, 2025 I wrote: >> >> We did something similar for x86 for v18, and here is some progress >> towards Arm support. > > Coming back to this, since there's been recent interest in Arm support. > > v2 is a rebase, with a few changes. > > - I simplified it by leaving out the inlining for "assume CRC" builds, > since I wanted to avoid alignment considerations if I can. I think > always indirecting through a pointer will have less risk of > regressions in a realistic setting than for x86 since Arm chips > typically have low latency for carryless multiplication instructions. > With just a bit of code we can still use the direct call for small > constant inputs, so I did that to avoid regressions under WAL insert > lock. > > - One coding idiom for a vector literal in the generated code was > giving pgindent indigestion, I so rewrote it using Neon intrinsics and > verified it in Godbolt. > >> 0002: Like 3c6e8c12389 and in fact uses the same program to generate >> the code, by specifying Neon instructions with the Arm "crypto" >> extension instead. There are some interesting differences from x86 >> here as well: >> - The upstream implementation chose to use inline assembly instead of >> intrinsics for some reason. I initially thought that was a way to get >> broader compiler support, but it turns out you still need to pass the >> relevant flags to get the assembly to link. Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpers still need to be inline asm rather than intrinsics. Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code? > To follow-up for curiosity's sake, [1] says that Apple chips can issue > PMULL + EOR as a single uop if they are next to each other in the > instruction stream. > >> - I only have Meson support for now, since I used MacOS on CI to test. >> That OS and compiler combination apparently targets the CRC extension, >> but the PMULL instruction runtime check uses Linux-only headers, I >> believe, so previously I hacked the choose function to return true for >> testing. The choose function in 0002 is untested in this form. > > This is still true, but now the CI hack lives in a separate > not-for-commit patch for clarity. > > autoconf support is a WIP, and I will share that after I do some > testing on an Arm Linux instance. > > [1] https://dougallj.github.io/applecpu/firestorm.html > > -- > John Naylor > Amazon Web Services > <v2-0001-Compute-CRC32C-on-ARM-using-the-Crypto-Extension-.patch><v2-0002-Force-testing-on-MacOS-CI-XXX-not-for-commit.patch> Regards Haibo
