Thanks you Bodo, for the comments. Here are some quick answers
>>> It seems that the BN_MONT_CTX-related code The optimization made for the computation of the modular inverse in the ECDSA sigh, is using const-time mod-exp. Indeed, this is independent of the rest of the patch, and it can be used independently (for other usages of the library). We included this addition in the patch for the particular usage in ECDSA. The paper: it will be posted soon. >>> Note that in your code, OPENSSL_ia32cap_P-dependent initialization of >>> global variables is not done in a thread-safe way. This initialization is used for selecting a code path that would use ADCX/ADOX instructions when the processor supports them. The outcome depends only on the appropriate CPUID bits. Therefore, there is no “thread-safe” issue (because any thread would select the same path). Of course, feel free to use the patch code and modify this initialization to match OpenSSL conventions. >>> Your ec_p256_points_mul implementation is much worse than necessary when >>> then input comprises many points Indeed right. However, this patch is intended to optimize ECDSA sign/verify (and ECDH). This usage does not require adding more than a single point. If there are interesting cases - optimized multi-point addition can be added. Regards, Shay Gueron -----Original Message----- From: Bodo Moeller via RT [mailto:r...@openssl.org] Sent: Thursday, October 24, 2013 19:18 To: Gueron, Shay Cc: openssl-dev@openssl.org Subject: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms Thanks for the submission! It seems that the BN_MONT_CTX-related code (used in crypto/ecdsa for constant-time signing) is entirely independent of the remainder of the patch, and should be considered separately. Regarding your reference 'S.Gueron and V.Krasnov, "Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes"' for you NIST P-256 code, is that document available? (Web search only pointed me back to your patch.) I've noticed that for secret-independent constant-time memory access, your code relies on the scattering approach. However http://cryptojedi.org/peter/data/chesrump-20130822.pdf points out that apparently this doesn't actually work as intended. (Dan Bernstein's earlier references: Sections 14, 15 in http://cr.yp.to/papers.html#cachetiming; http://cr.yp.to/mac/athlon.html.) Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way. How about entirely avoiding this global state, and passing pointers down to the implementations? Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points (more precisely, more than one point other than the group generator), because you call ec_p256_windowed_mul multiple times separately and add the results. I'd suggest instead to implement this modeled on ec_GFp_nistp256_points_mul instead to benefit from interleaved left-to-right point multiplication. (This avoids the additional point-double operations from the separate point multiplication algorithm executions going through each additional scalar.) Your approach for precomputation also is different (using fewer point operations based on a larger precomputed table than the one we currently use in ec_GFp_nistp256_points_mul) -- that table size still seems appropriate, so keeping that probably makes sense. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org