Hi, I implemented another improvement for cksum to increase the speed of it some more. It is possible to use x86 pclmul hardware instruction for CRC32 calculation. The patch detects support for this by using CPUID, and falls back to the slice by 8 algorithm if no support. Also added detection in autoconf, so it only will be compiled on supported targets.
By my testing it seem the checksum calculation is sped up about 6x compared to slice by 8 algorithm (looking at user time). However! Since the time the process spends waiting on syscalls (fread) is still the same, actual real time speedup is only 3x. It would be an interesting exercise to try to use async IO, so you could checksum one block while reading the next. Maybe I will try that one day. As a sidenote, x86 also has a crc32 hardware instruction but it uses a different polynominal than cksum does, so not possible to use here. Some benchmarking with a file already in file cache. Oldest version: (byte by byte) ztion@rita:~/coreutils/coreutils-8.32/src$ time ./cksum /disk2/download/bigfile2G real 0m7,311s user 0m7,039s sys 0m0,262s Slice by 8 version: ztion@rita:~/coreutils/coreutils-8.32/src$ time ./cksum.slice /disk2/download/bigfile2G real 0m1,546s user 0m1,267s sys 0m0,247s ztion@rita:~/coreutils/coreutils_fork/src$ time ./cksum /disk2/download/bigfile2G real 0m0,462s user 0m0,191s sys 0m0,271s The patch is at: https://github.com/coreutils/coreutils/pull/48 -- /Kristoffer Brånemyr
