That's interesting. I'm having issues across cfarm as they often don't have the coreutils dependencies and won't work with the version of clib I'm building against.
Are you comparing the user times or the real times? IMO the user time is the important part as the sys part of the timing just depends on disk I/O. The high I/O (and the fact that we're only reading in 64KB chunks) means that there's going to be large variance, but I'm still seeing a consistent improvement over 5-10 runs. On amazon EC2 t3 (Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz) ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul --debug file cksum_pclmul: using pclmul hardware support 4215202376 4294967296 file real 0m3.129s user 0m0.422s sys 0m2.705s ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul --debug file cksum_pclmul: using pclmul hardware support 4215202376 4294967296 file real 0m3.025s user 0m0.394s sys 0m2.630s ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul --debug file cksum_pclmul: using pclmul hardware support 4215202376 4294967296 file real 0m3.705s user 0m0.517s sys 0m3.187s ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul --debug file cksum_pclmul: using pclmul hardware support 4215202376 4294967296 file real 0m3.334s user 0m0.431s sys 0m2.903s ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul --debug file cksum_pclmul: using pclmul hardware support 4215202376 4294967296 file real 0m3.250s user 0m0.420s sys 0m2.829s ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul_chorba --debug file cksum_pclmul_chorba: avx512 support not detected cksum_pclmul_chorba: using pclmul hardware support 4215202376 4294967296 file real 0m2.888s user 0m0.368s sys 0m2.518s ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul_chorba --debug file cksum_pclmul_chorba: avx512 support not detected cksum_pclmul_chorba: using pclmul hardware support 4215202376 4294967296 file real 0m3.032s user 0m0.366s sys 0m2.665s ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul_chorba --debug file cksum_pclmul_chorba: avx512 support not detected cksum_pclmul_chorba: using pclmul hardware support 4215202376 4294967296 file real 0m2.938s user 0m0.347s sys 0m2.583s ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul_chorba --debug file cksum_pclmul_chorba: avx512 support not detected cksum_pclmul_chorba: using pclmul hardware support 4215202376 4294967296 file real 0m3.148s user 0m0.419s sys 0m2.728s ubuntu@ip-172-31-40-136:~$ time ./cksum_pclmul_chorba --debug file cksum_pclmul_chorba: avx512 support not detected cksum_pclmul_chorba: using pclmul hardware support 4215202376 4294967296 file real 0m2.808s user 0m0.344s sys 0m2.463s cfarm13 (Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz) pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul file 4215202376 4294967296 file real 0m1.103s user 0m0.436s sys 0m0.667s pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul file 4215202376 4294967296 file real 0m1.320s user 0m0.464s sys 0m0.855s pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul file 4215202376 4294967296 file real 0m1.641s user 0m0.416s sys 0m1.224s pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul file 4215202376 4294967296 file real 0m1.714s user 0m0.496s sys 0m1.214s pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul file 4215202376 4294967296 file real 0m1.107s user 0m0.457s sys 0m0.650s pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul_chorba file 4215202376 4294967296 file real 0m1.091s user 0m0.485s sys 0m0.606s pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul_chorba file 4215202376 4294967296 file real 0m1.083s user 0m0.483s sys 0m0.600s pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul_chorba file 4215202376 4294967296 file real 0m1.102s user 0m0.403s sys 0m0.699s pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul_chorba file 4215202376 4294967296 file real 0m1.081s user 0m0.412s sys 0m0.669s pljeskavica@cfarm13:~/coreutils$ time ./cksum_pclmul_chorba file 4215202376 4294967296 file real 0m1.077s user 0m0.412s sys 0m0.665s If anyone has an i7 server I can test on I'd be happy to get more results. I had another change I was working on earlier that's also a 5-10% improvement that can get lost in the noise of the variance, I can combine them if we need a stronger improvement to consider taking this change? On Wed, 25 Dec 2024 at 00:52, Pádraig Brady <p...@draigbrady.com> wrote: > On 24/12/2024 20:43, Sam Russell wrote: > > ah sorry, clicked on the wrong patch file, here is the real one > > > > On Tue, Dec 24, 2024, 19:36 Pádraig Brady <p...@draigbrady.com <mailto: > p...@draigbrady.com>> wrote: > > > > On 24/12/2024 16:03, Sam Russell wrote: > > > I've released a new paper here https://arxiv.org/abs/2412.16398 < > https://arxiv.org/abs/2412.16398> and this > > > was the easiest algorithm to implement from it. It gets a 5-20% > speedup for > > > SSE/AVX1 and diminishing returns for AVX2/AVX512 > > > > Ignoring this as looks applicable to gnulib not coreutils, > > and I think you've already landed this in gnulib. > > Ah thanks, > However this is a regression on i7-5600U at least: > > $ truncate -s4G file > > $ time src/cksum --debug filecksum: avx512 support not detected > cksum: avx2 support not detected > cksum: using pclmul hardware support > 4215202376 4294967296 file > real 0m1.445s > user 0m0.250s > sys 0m1.132s > > $ git am < ~/0001-cksum-Implement-Chorba-algorithm-in-PCLMUL.patch > $ make > > $ time src/cksum --debug file > cksum: avx512 support not detected > cksum: avx2 support not detected > cksum: using pclmul hardware support > 4215202376 4294967296 file > real 0m1.969s > user 0m0.263s > sys 0m1.683s > > > (I've run this a few times, with similar timings). > > cheers, > Pádraig >