On Wed, Mar 13, 2025 at 12:02:07AM +0000, nathandboss...@gmail.com wrote: > Those are nice results. I'm a little worried about the Neon implementation > for smaller inputs since it uses a per-byte loop for the remaining bytes, > though. If we can ensure there's no regression there, I think this patch > will be in decent shape.
True, the neon implementation in patch v6 did perform worse for smaller inputs. This is solved in v7, we have added pg_popcount64 to speed up the processing of smaller inputs/remaining bytes. Also, similar to sve, the neon-2reg version performed better than neon-1reg but no improvement in neon-4reg. The below table compares patches v6 and v7 on m7g.4xlarge Query: SELECT drive_popcount(1000000, 8-byte words); 8-byte words | master | v6-neon-2reg| v7-neon-2reg| v7-sve --------------+----------+-------------+-------------+-------- 1 | 4.051 | 6.239 | 3.431 | 3.343 2 | 4.429 | 10.773 | 3.899 | 3.335 3 | 4.844 | 14.066 | 4.398 | 3.348 4 | 5.324 | 3.342 | 3.663 | 3.365 5 | 5.900 | 7.108 | 4.349 | 4.441 6 | 6.478 | 11.720 | 4.851 | 4.441 7 | 7.192 | 15.686 | 5.551 | 4.447 8 | 8.016 | 4.288 | 4.367 | 4.013 We modified [0] to get the numbers for pg_popcount_masked 8-byte words | master | v7-neon-2reg| v7-sve --------------+----------+-------------+-------- 1 | 4.289 | 4.202 | 3.827 2 | 4.993 | 4.662 | 3.823 3 | 5.981 | 5.459 | 3.834 4 | 6.438 | 4.230 | 3.846 5 | 7.169 | 5.236 | 5.072 6 | 7.949 | 5.922 | 5.106 7 | 9.130 | 6.535 | 5.060 8 | 9.796 | 5.328 | 4.718 512 | 387.543 | 182.801 | 77.077 1024 | 760.644 | 360.660 | 150.519 [0] https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=g...@mail.gmail.com -Chiranmoy
v7-0001-SVE-and-NEON-support-for-pg_popcount.patch
Description: v7-0001-SVE-and-NEON-support-for-pg_popcount.patch