Interestingly, the new Intel optimization manual says that Skylake can issue 
(V)PMULQDQ twice every cycle instead of once, presumably even for 4-way 
(256-bit) multiplications.  So “Sandy2x” and other designs with similar 
strategies (including for Goldilocks) should be competitive on Skylake as well.

Cheers,
— Mike

> On Sep 29, 2015, at 10:29 PM, Trevor Perrin <[email protected]> wrote:
> 
> Tung Chou's "Sandy2x" code for 25519 on Sandy Bridge and Ivy Bridge is
> around 10-20% faster than other implementations:
> 
> https://eprint.iacr.org/2015/943
> 
> Speedup is attributed to using the 2-way 32x32->64 vectorized
> multiplier (vpmuludq) instead of the 64x64->128 serialized multiplier.
> 
> The paper doesn't say whether this strategy also pays off on Haswell
> (which seems to be lagging in 25519 performance?):
> 
> https://docs.google.com/spreadsheets/d/1SO3NGX-EgIZ1slw9uExb5FoeFy5TVkuA2lEutP6roYI/edit#gid=0
> 
> 
> Trevor
> _______________________________________________
> Curves mailing list
> [email protected]
> https://moderncrypto.org/mailman/listinfo/curves

_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves

Reply via email to