Hello, Attached v3 which is same as v2 with the added PCLMULQDQ runtime CPUID check.
> > I ran the same benchmark drive_crc32c with the postgres infrastructure and > found that your v2 sse42 version from corsix is slower than > pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. > > That matches my findings as well. Never mind, I was building using the Makefile which doesn’t seem to add any optimization flag by default. I switched to using meson which uses -O2 and benchmarked using pgbench (using your script) and this behavior goes away on my TGL. Here is what I measure with your v2 (and v3): | bytes | master (ms) | sse4.2-v2 (ms) | ratio | | 64 | 9.627 | 6.306 | 1.52 | | 80 | 10.976 | 6.662 | 1.64 | | 96 | 12.411 | 8.212 | 1.51 | | 112 | 13.871 | 9.403 | 1.47 | | 128 | 15.283 | 7.724 | 1.97 | | 144 | 16.715 | 9.173 | 1.82 | | 160 | 18.18 | 11.292 | 1.60 | | 176 | 19.847 | 12.606 | 1.57 | | 192 | 22.043 | 10.16 | 2.16 | | 208 | 24.261 | 11.699 | 2.07 | | 224 | 26.63 | 13.607 | 1.95 | | 240 | 28.994 | 14.721 | 1.96 | | 256 | 31.418 | 13.132 | 2.39 | > On my machine that still regresses compared to master in that range (although > by > not as much) so I still think 128 bytes is the right threshold. On my TGL, buffer sizes as small as 64 bytes see performance benefits. > The effect of -O3 with gcc14.2 is that the single-block loop (after the > 4-block loop) > is unrolled. Unrolling adds branches and binary space, so it'd be nice to > avoid that > even for systems that build with -O3. Agreed. My perf data shows -O2 is just as good. > Okay, Nehalem is 17 years old, and the additional cpuid check would still > work on > hardware 14-15 years old, so I think it's fine to bump the requirement for > runtime > hardware support. Sounds good. I updated the runtime check to include PCLMULQDQ. New algorithm will run only on Westmere and newer CPU. Raghuveer
v3-0001-Add-more-test-coverage-for-crc32c.patch
Description: v3-0001-Add-more-test-coverage-for-crc32c.patch
v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch
Description: v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch
v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch
Description: v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch
v3-0004-Shorter-version-from-corsix.patch
Description: v3-0004-Shorter-version-from-corsix.patch