Hello, 

Attached v3 which is same as v2 with the added PCLMULQDQ runtime CPUID check.

> > I ran the same benchmark drive_crc32c with the postgres infrastructure and
> found that your v2 sse42 version from corsix is slower than
> pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.
> 
> That matches my findings as well.

Never mind, I was building using the Makefile which doesn’t seem to add any 
optimization flag by default. I switched to using meson which uses -O2 and 
benchmarked using pgbench (using your script) and this behavior goes away on my 
TGL. Here is what I measure with your v2 (and v3): 

| bytes | master (ms) | sse4.2-v2 (ms) | ratio |
| 64    | 9.627       | 6.306          | 1.52  |
| 80    | 10.976      | 6.662          | 1.64  |
| 96    | 12.411      | 8.212          | 1.51  |
| 112   | 13.871      | 9.403          | 1.47  |
| 128   | 15.283      | 7.724          | 1.97  |
| 144   | 16.715      | 9.173          | 1.82  |
| 160   | 18.18       | 11.292         | 1.60  |
| 176   | 19.847      | 12.606         | 1.57  |
| 192   | 22.043      | 10.16          | 2.16  |
| 208   | 24.261      | 11.699         | 2.07  |
| 224   | 26.63       | 13.607         | 1.95  |
| 240   | 28.994      | 14.721         | 1.96  |
| 256   | 31.418      | 13.132         | 2.39  |

 
> On my machine that still regresses compared to master in that range (although 
> by
> not as much) so I still think 128 bytes is the right threshold.

On my TGL, buffer sizes as small as 64 bytes see performance benefits. 

> The effect of -O3 with gcc14.2 is that the single-block loop (after the 
> 4-block loop)
> is unrolled. Unrolling adds branches and binary space, so it'd be nice to 
> avoid that
> even for systems that build with -O3. 

Agreed. My perf data shows -O2 is just as good. 

> Okay, Nehalem is 17 years old, and the additional cpuid check would still 
> work on
> hardware 14-15 years old, so I think it's fine to bump the requirement for 
> runtime
> hardware support.

Sounds good. I updated the runtime check to include PCLMULQDQ. New algorithm 
will run only on Westmere and newer CPU.

Raghuveer

Attachment: v3-0001-Add-more-test-coverage-for-crc32c.patch
Description: v3-0001-Add-more-test-coverage-for-crc32c.patch

Attachment: v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch
Description: v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch

Attachment: v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch
Description: v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch

Attachment: v3-0004-Shorter-version-from-corsix.patch
Description: v3-0004-Shorter-version-from-corsix.patch

Reply via email to