I've released a new paper here https://arxiv.org/abs/2412.16398 and this was the easiest algorithm to implement from it. It gets a 5-20% speedup for SSE/AVX1 and diminishing returns for AVX2/AVX512
AMD ryzen $ time ./cksum_bench_pclmul 262144 100000 Hash: 0AF85340, length: 262144 real 0m2.156s user 0m2.196s sys 0m0.000s $ time ./cksum_bench_pclmul_chorba 262144 100000 Hash: 0AF85340, length: 262144 real 0m1.920s user 0m1.949s sys 0m0.000s $ time ./cksum_bench_avx2 262144 100000 Hash: 0AF85340, length: 262144 real 0m1.419s user 0m1.427s sys 0m0.000s $ time ./cksum_bench_avx2_chorba 262144 100000 Hash: 0AF85340, length: 262144 real 0m1.300s user 0m1.323s sys 0m0.000s icelake $ time ./cksum_bench_avx512 262144 100000 Hash: 0AF85340, length: 262144 real 0m1.475s user 0m1.473s sys 0m0.002s $ time ./cksum_bench_avx512_chorba 262144 100000 Hash: 0AF85340, length: 262144 real 0m1.450s user 0m1.449s sys 0m0.002s
0001-crc-Add-PCLMUL-implementation.patch
Description: Binary data