Joachim Strömbergson <[email protected]> writes: > By vectorizing you mean running quarterrounds in parallel?
I mean putting several uint32_t values in a simd register, and using simd instructions. > Have you looked at the asm code by DJB? Not really, I find the generated assembly pretty hard to read, and I haven't tried to understand his qhasm tool. > He does up to four blocks in > parallel and do some tricks with the shifts. xmm-5 should be relevant. To me, it looks like all rotates are done with psrld + pslld. But I might be missing something. On the few machines I have benchmarked the code (I haven't been very systematic), pshufhw + pshuflw seems to be slightly faster. It saves one por instruction. I'm pretty sure doing a couple of blocks at a time in parellel, interleaving the instructions, will give some speedup. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list [email protected] http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
