On Tue, Mar 17, 2026 at 11:52 PM John Naylor <[email protected]> wrote:
> On Wed, Mar 18, 2026 at 10:34 AM Haibo Yan <[email protected]> wrote: > > > > Hi John > > > > Thank yo for working on this. I had one question about the mixed use of > intrinsics and inline asm here. > > > Since the implementation already uses NEON intrinsics such as vld1q_u64, > I was wondering why the pmull / pmull2 + eor helpers still need to be > inline asm rather than intrinsics. > > > > Is that due to compiler/toolchain support, or because the > intrinsic-based version produced noticeably worse code? > > I answered that in the email you replied to, re-quoted here: > > > To follow-up for curiosity's sake, [1] says that Apple chips can issue > > PMULL + EOR as a single uop if they are next to each other in the > > instruction stream. > > [1] https://dougallj.github.io/applecpu/firestorm.html > > I don't know if that's relevant for current server hardware, so it > could be pointless. I'm personally not a fan of inline assembly, but I > also didn't yet want to put in the effort to alter generated code. I > don't think it would be very hard to do, however. > Thanks, that makes sense as an explanation for why the inline asm is there today. But it also sounds like this is more of a temporary implementation choice than a conclusion that intrinsics are unsuitable. If so, I wonder whether it would be better to treat an intrinsics-based version as the preferred end state unless benchmarks show a clear regression. Regards Haibo
