Hi Trevor, Sandy2x takes 156076 Haswell cycles for X25519 shared-secret computation. This is very close to the Ivy Bridge cycles. Note that, however, the non-vectorized implementation from the Ed25519 paper performs much better on Haswell than on Ivy Bridge: 161648 cycles versus 182708 cycles.
Armando Faz-Hernández and Julio López have a Latincrypt paper this year about an X25519 implementation targeting for Haswell. They claim 1565xx Haswell cycles for shared-secret computation. They use a 4-way vectorized multiplier to perform 2 field multiplications/squarings at the same time. I think a better approach would be to find 4 independent multiplications/squarings in the formula and vectorize across them, but I haven't tried. Best regards, Tung Chou On Tue, Sep 29, 2015 at 10:29 PM, Trevor Perrin <[email protected]> wrote: > Tung Chou's "Sandy2x" code for 25519 on Sandy Bridge and Ivy Bridge is > around 10-20% faster than other implementations: > > https://eprint.iacr.org/2015/943 > > Speedup is attributed to using the 2-way 32x32->64 vectorized > multiplier (vpmuludq) instead of the 64x64->128 serialized multiplier. > > The paper doesn't say whether this strategy also pays off on Haswell > (which seems to be lagging in 25519 performance?): > > > https://docs.google.com/spreadsheets/d/1SO3NGX-EgIZ1slw9uExb5FoeFy5TVkuA2lEutP6roYI/edit#gid=0 > > > Trevor > _______________________________________________ > Curves mailing list > [email protected] > https://moderncrypto.org/mailman/listinfo/curves >
_______________________________________________ Curves mailing list [email protected] https://moderncrypto.org/mailman/listinfo/curves
