I've released a new paper here https://arxiv.org/abs/2412.16398 and this
was the easiest algorithm to implement from it. It gets a 5-20% speedup for
SSE/AVX1 and diminishing returns for AVX2/AVX512

AMD ryzen

$ time ./cksum_bench_pclmul 262144 100000
Hash: 0AF85340, length: 262144

real    0m2.156s
user    0m2.196s
sys     0m0.000s
$ time ./cksum_bench_pclmul_chorba 262144 100000
Hash: 0AF85340, length: 262144

real    0m1.920s
user    0m1.949s
sys     0m0.000s
$ time ./cksum_bench_avx2 262144 100000
Hash: 0AF85340, length: 262144

real    0m1.419s
user    0m1.427s
sys     0m0.000s
$ time ./cksum_bench_avx2_chorba 262144 100000
Hash: 0AF85340, length: 262144

real    0m1.300s
user    0m1.323s
sys     0m0.000s

icelake

$ time ./cksum_bench_avx512 262144 100000
Hash: 0AF85340, length: 262144

real    0m1.475s
user    0m1.473s
sys     0m0.002s
$ time ./cksum_bench_avx512_chorba 262144 100000
Hash: 0AF85340, length: 262144

real    0m1.450s
user    0m1.449s
sys     0m0.002s

Attachment: 0001-crc-Add-PCLMUL-implementation.patch
Description: Binary data

Reply via email to