On Wed, Mar 12, 2025 at 02:41:18AM +0000, nathandboss...@gmail.com wrote: > v5-no-sve is the result of using a function pointer, but pointing to the > "slow" versions instead of the SVE version. v5-sve is the result of the > latest patch in this thread on a machine with SVE support, and v5-4reg is > the result of the latest patch in this thread modified to process 4 > register's worth of data at a time.
Nice, I wonder why I did not observe any performance gain in the 4reg version. Did you modify the 4reg version code? One possible explanation is that you used Graviton4 based instances whereas I used Graviton3 instances. > For the latter point, I think we should consider trying to add a separate > Neon implementation that we use as a fallback for machines that don't have > SVE. My understanding is that Neon is virtually universally supported on > 64-bit Arm gear, so that will not only help offset the function pointer > overhead but may even improve performance for a much wider set of machines. I have added the NEON implementation in the latest patch. Here are the numbers for drive_popcount(1000000, 1024) on m7g.8xlarge: Scalar - 692ms Neon - 298ms SVE - 112ms -Chiranmoy
v6-0001-SVE-and-NEON-support-for-popcount.patch
Description: v6-0001-SVE-and-NEON-support-for-popcount.patch