Looks good, the code is more readable now. > For both Neon and SVE, I do see improvements with looping over 4 > registers at a time, so IMHO it's worth doing so even if it performs the > same as 2-register blocks on some hardware.
There was no regression on Graviton 3 when using the 4-register version so can keep it. -Chiranmoy