Hi,

#1. It's prohibitively painful to maintain [and develop] code without
access to relevant hardware. Is there a way to arrange access to
Nano-based system [remote is preferred]?

#2. Please do provide up-to-date documentation. Sample code doesn't
provide required details.

>     This a patch set which updates PadLock engine for VIA C7 
> and Nano CPUs.It refers to  AES with ECB/CBC/CFB/OFB, SHA1/224/256, 
> RSA sign/verify and RNG, and all of them are accelerated by PadLock 
> hardware of VIA C7 and Nano CPUs. Some parts of this patch set are
> based on the codes originally written by Michal Ludvig and Timo Teras.
>     The patch set is available for both 32-bit/64-bit GNU compilers and 
> MS compilers, and it is produced from OpenSSL-1.0.0-stable branch 
> on OpenSSL CVS server. If other versionsof this patch is needed, such 
> as OpenSSL-0.9.8-stable branch, please tell me, I will send it soon.

The protocol is to develop in development HEAD branch and then back-port
if[!] possible/applicable. As for 1.0.0 and earlier version. A principal
decision was taken to effectively freeze them and accept only genuine
bug fixes. This means that suggested patch won't be committed to CVS.
Alternative is provide something similar to "Intel Acceleration Engine",
see http://www.mail-archive.com/openssl-dev@openssl.org/msg29626.html.
Naturally provided that new code works in development branch first.

I've just committed overhauled version that makes Padlock engine
independent of inline assembler (and therefore independent on compiler),
see http://cvs.openssl.org/chngview?cn=21360. Assembler modules already
include SHA primitives (not tested though), but no PMM yet (planned).
x86_64 module was not actually tested, feedback is appreciated. Either
way, this is new base-line to build upon further.

As for submission per se.

How come bn_mul_mont_padlock doesn't check if result is larger than
modulus? For reference crypto/bn/asm/via-mont.pl does it.

crypto/bn/asm/via-mont.pl effectively says that surrounding C code is
critical for performance. Yet we see malloc in suggested code. But don't
rush to replace it with e.g. alloca, because "right thing to do" is to
allocate buffers in modular exponentiation subroutine and thus take
buffer handling out of montgomery multiplication subroutine [as well as
out of loop calling it]. Moreover, in modular exponentiation context
there is no need to have BN_mod_mul_montgomery_padlock, one that handles
all possible lengths. See latest bn_exp.c in development branch for
ideas (for reference, there will be further optimizations). To
summarize. Modular exponentiation subroutine should allocate all the
buffers for Montgomery multiplication (on stack or with malloc depending
on how much data is required) and call the latter directly with minimal
overhead.

Suggested modulo exponentiation follows implementation path that was
shown to be prone to side-channel attacks. Yes, the attack is applicable
on multi-core designs, but is there indication that there won't be
multi-core Padlock-capable CPUs in the future? I'd argue that there is
no reason for not taking "constant time" approach.

As for SHA. It was shown that there is a way to use SHA even on
pre-Nano, see
http://www.mail-archive.com/openssl-dev@openssl.org/msg21787.html.
Challenge is to make it multi-thread safe. It would take allocation of
dynamic lock and serializing access to "crash page" allocated at engine
load.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Reply via email to