On Wed, Mar 12, 2025 at 10:34:46AM +0000, chiranmoy.bhattacha...@fujitsu.com wrote: > On Wed, Mar 12, 2025 at 02:41:18AM +0000, nathandboss...@gmail.com wrote: > >> v5-no-sve is the result of using a function pointer, but pointing to the >> "slow" versions instead of the SVE version. v5-sve is the result of the >> latest patch in this thread on a machine with SVE support, and v5-4reg is >> the result of the latest patch in this thread modified to process 4 >> register's worth of data at a time. > > Nice, I wonder why I did not observe any performance gain in the 4reg > version. Did you modify the 4reg version code? > > One possible explanation is that you used Graviton4 based instances > whereas I used Graviton3 instances.
Yeah, it looks like the number of vector registers is different [0]. >> For the latter point, I think we should consider trying to add a separate >> Neon implementation that we use as a fallback for machines that don't have >> SVE. My understanding is that Neon is virtually universally supported on >> 64-bit Arm gear, so that will not only help offset the function pointer >> overhead but may even improve performance for a much wider set of machines. > > I have added the NEON implementation in the latest patch. > > Here are the numbers for drive_popcount(1000000, 1024) on m7g.8xlarge: > Scalar - 692ms > Neon - 298ms > SVE - 112ms Those are nice results. I'm a little worried about the Neon implementation for smaller inputs since it uses a per-byte loop for the remaining bytes, though. If we can ensure there's no regression there, I think this patch will be in decent shape. [0] https://github.com/aws/aws-graviton-getting-started?tab=readme-ov-file#building-for-graviton -- nathan