Joachim Strömbergson <[email protected]> writes:

> By vectorizing you mean running quarterrounds in parallel?

I mean putting several uint32_t values in a simd register, and using
simd instructions.
 
> Have you looked at the asm code by DJB?

Not really, I find the generated assembly pretty hard to read, and I
haven't tried to understand his qhasm tool.

> He does up to four blocks in
> parallel and do some tricks with the shifts. xmm-5 should be relevant.

To me, it looks like all rotates are done with psrld + pslld. But I
might be missing something. On the few machines I have benchmarked the
code (I haven't been very systematic), pshufhw + pshuflw seems to be
slightly faster. It saves one por instruction.

I'm pretty sure doing a couple of blocks at a time in parellel,
interleaving the instructions, will give some speedup.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to