On Fri, Mar 07, 2025 at 03:20:07AM +0000, chiranmoy.bhattacha...@fujitsu.com wrote: > Sounds good. Let us know your findings.
Alright, here's what I saw on an R8g for drive_popcount(1000000, N): 8-byte words master v5-no-sve v5-sve v5-4reg 1 2.540 ms 2.170 ms 1.807 ms 2.178 ms 2 2.534 ms 2.180 ms 1.804 ms 2.167 ms 4 3.988 ms 3.240 ms 1.590 ms 2.879 ms 8 5.033 ms 4.672 ms 2.175 ms 2.525 ms 16 8.252 ms 10.916 ms 3.235 ms 3.588 ms 32 20.932 ms 22.883 ms 5.134 ms 5.395 ms 64 40.446 ms 45.668 ms 9.817 ms 9.285 ms 128 66.087 ms 91.386 ms 20.072 ms 17.175 ms 256 153.852 ms 182.594 ms 40.447 ms 32.212 ms 512 246.271 ms 300.941 ms 87.116 ms 60.729 ms 1024 487.180 ms 607.289 ms 180.574 ms 116.948 ms 2048 969.335 ms 1223.838 ms 363.595 ms 232.575 ms 4096 1934.646 ms 2472.154 ms 729.525 ms 459.495 ms (Note that there should be no need to test anything smaller than 8 bytes because we use the inline version in pg_bitutils.h in that case.) v5-no-sve is the result of using a function pointer, but pointing to the "slow" versions instead of the SVE version. v5-sve is the result of the latest patch in this thread on a machine with SVE support, and v5-4reg is the result of the latest patch in this thread modified to process 4 register's worth of data at a time. The biggest takeaways for me are as follows: * The 4-register version does show some nice improvements as the data grows. * Machines without SVE will likely incur a rather sizable regression from the newly introduced function pointer. For the latter point, I think we should consider trying to add a separate Neon implementation that we use as a fallback for machines that don't have SVE. My understanding is that Neon is virtually universally supported on 64-bit Arm gear, so that will not only help offset the function pointer overhead but may even improve performance for a much wider set of machines. -- nathan