On Wed, Mar 18, 2026 at 10:34 AM Haibo Yan <[email protected]> wrote: > > Hi John > > Thank yo for working on this. I had one question about the mixed use of > intrinsics and inline asm here.
> Since the implementation already uses NEON intrinsics such as vld1q_u64, I > was wondering why the pmull / pmull2 + eor helpers still need to be inline > asm rather than intrinsics. > > Is that due to compiler/toolchain support, or because the intrinsic-based > version produced noticeably worse code? I answered that in the email you replied to, re-quoted here: > To follow-up for curiosity's sake, [1] says that Apple chips can issue > PMULL + EOR as a single uop if they are next to each other in the > instruction stream. > [1] https://dougallj.github.io/applecpu/firestorm.html I don't know if that's relevant for current server hardware, so it could be pointless. I'm personally not a fan of inline assembly, but I also didn't yet want to put in the effort to alter generated code. I don't think it would be very hard to do, however. -- John Naylor Amazon Web Services
