Tung Chou's "Sandy2x" code for 25519 on Sandy Bridge and Ivy Bridge is around 10-20% faster than other implementations:
https://eprint.iacr.org/2015/943 Speedup is attributed to using the 2-way 32x32->64 vectorized multiplier (vpmuludq) instead of the 64x64->128 serialized multiplier. The paper doesn't say whether this strategy also pays off on Haswell (which seems to be lagging in 25519 performance?): https://docs.google.com/spreadsheets/d/1SO3NGX-EgIZ1slw9uExb5FoeFy5TVkuA2lEutP6roYI/edit#gid=0 Trevor _______________________________________________ Curves mailing list [email protected] https://moderncrypto.org/mailman/listinfo/curves
