On 28/11/2024 22:10, Pádraig Brady wrote:
On 28/11/2024 19:59, Sam Russell wrote:
I've ported the PCLMUL to for ARMv8 support, looks to be an 80% time
reduction over CPU on an EC2 T4g instance:
$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
Stepping: r3p1
BogoMIPS: 243.75
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32
atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
# ubuntu 24.04 package
$ time cksum ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.136s
user 0m2.044s
sys 0m1.691s
# built from head
$ time ./cksum_old ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.217s
user 0m2.022s
sys 0m1.770s
# this patch using only pmull opcodes
$ time ./cksum_neon ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.135s
user 0m0.353s
sys 0m1.819s
# this patch using pmull and pmull2 opcodes
$ time ./cksum_neon2 ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.136s
user 0m0.346s
sys 0m1.819s
Benchmark scripts (I used the crc_sum_stream() function so the hash output
is different, but have verified against the pclmul script functions locally)
$ time ./cksum_bench_old 65536 400000
Hash: 8984ED89, length: 65536
real 0m19.300s
user 0m19.299s
sys 0m0.001s
$ time ./cksum_bench_neon2 65536 400000
Hash: 828F9BAC, length: 65536
real 0m5.001s
user 0m4.997s
sys 0m0.003s
For hash validation
$ time ./cksum_bench_neon2 1048576 40000
Hash: EFA0B24F, length: 1048576
real 0m7.540s
user 0m7.538s
sys 0m0.001s
$ time ./cksum_bench_pclmul 1048576 10000
Hash: EFA0B24F, length: 1048576
real 0m3.018s
user 0m3.018s
sys 0m0.000s
-O3 does most of the optimisation work for us, there may be more savings
but this is still a good improvement.
Some questions
- There's no direct equivalent of "__builtin_cpu_supports" for ARM, but the
hwcaps interface seems to be the way to test this [1] [2]
- ARM is a much more diverse system than x86_64, it's possible that some
platforms (e.g. phones) would see a slowdown, is this something we want to
give maintainers a flag to disable?
- ARMv8 also has a CRC32() opcode, a quick test showed it wasn't super
efficient but it's possible that interleaving this against the folding
approach might add extra speedups. This is an exercise for the reader.
Cool. I'll try this out on some of the arm64 machines at:
https://portal.cfarm.net/machines/list/
It doesn't support macos currently as it uses the linux only getauxval()
to determine CPU support. That's fine for now. A very quick search suggests
something like the following may work instead on macos, which would then
support >= M1, which I may test later:
#if __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ >= 110000
#include <sys/types.h>
#include <sys/sysctl.h>
bool
macos_pmull_available (void)
{
int v = 0;
size_t l = sizeof v;
return sysctlbyname("hw.optional.arm.FEAT_PMULL", &v, &l, 0, 0) == 0
&& v != 0;
}
#endif
It doesn't work on gcc 6 on debian 9.13 as it doesn't have support
for the vget_lane_p64() intrinsics etc. Again that's fine as that's old.
I did find a more modern aarch64 (AMD Opteron 1100) Linux system
running OpenSUSE 15 (GCC 7), where the code worked fine and showed a
significant improvement in performance:
$ truncate -s 4G file
$ time src/cksum --debug file
cksum: using vmull hardware support
4215202376 4294967296 file
real 0m2.520s
# edit src/cksum.c to not use vmull
$ time src/cksum --debug file
4215202376 4294967296 file
real 0m6.266s
BTW I ran cksum_vmull.c through `indent -nut`, and I'll push this later.
thanks!
Pádraig