On Tue, Aug 3, 2021 at 10:43 PM John Naylor <john.nay...@enterprisedb.com> wrote: > (Side note, but sort of related to #1 above: non-x86 platforms have to > indirect through a function pointer even though they have no fast > implementation to make it worth their while. It would be better for them if > the "slow" implementation was called static inline or at least a direct > function call, but that's a separate thread.)
+1 I haven't looked into whether we could benefit from it in real use cases, but it seems like it'd also be nice if pg_popcount() were a candidate for auto-vectorisation and inlining. For example, NEON has vector popcount, and for Intel/AMD there is a shuffle-based AVX2 trick that at least Clang produces automatically[1]. We're obstructing that by doing function dispatch at individual word level, and using inline assembler instead of builtins. [1] https://arxiv.org/abs/1611.07612