Interestingly, the new Intel optimization manual says that Skylake can issue (V)PMULQDQ twice every cycle instead of once, presumably even for 4-way (256-bit) multiplications. So “Sandy2x” and other designs with similar strategies (including for Goldilocks) should be competitive on Skylake as well.
Cheers, — Mike > On Sep 29, 2015, at 10:29 PM, Trevor Perrin <[email protected]> wrote: > > Tung Chou's "Sandy2x" code for 25519 on Sandy Bridge and Ivy Bridge is > around 10-20% faster than other implementations: > > https://eprint.iacr.org/2015/943 > > Speedup is attributed to using the 2-way 32x32->64 vectorized > multiplier (vpmuludq) instead of the 64x64->128 serialized multiplier. > > The paper doesn't say whether this strategy also pays off on Haswell > (which seems to be lagging in 25519 performance?): > > https://docs.google.com/spreadsheets/d/1SO3NGX-EgIZ1slw9uExb5FoeFy5TVkuA2lEutP6roYI/edit#gid=0 > > > Trevor > _______________________________________________ > Curves mailing list > [email protected] > https://moderncrypto.org/mailman/listinfo/curves _______________________________________________ Curves mailing list [email protected] https://moderncrypto.org/mailman/listinfo/curves
