>> - this code is not cache attack resistant,
>> but the performance penality would be just too high.
>
> "Too high" is non-objective measure:-) Anyway, "compressed tables" I was
> referring to don't necessarily refer to minimal of 256B. I was rather
> referring to 1KB or 2KB tables. I mean what is relation between Te0-3?
> Rotate operation. I.e. something that can be derived from single table.
> See other assembler modules for example. Implementations using
> compressed 1KB or 2KB tables were observed to perform adequately, see
> sparcv9, parisc, arm4 modules...
For MIPS I suggest 1KB table accessed with lwl/lwr pairs substituting
for lw and rotate. Meaning that there will be one lw and three lwl/lwr
pairs per "iteration." To give an example, on big-endian (that's what I
have) Te[x]<<<8 is done with 'lwl $x,1($i); lwr $x,0($i)'. Furthermore,
something needs to be done about the way you pull addresses to Te/Td
tables. Normally I'd settle for placing Te in .text segment and then
bal .+8
nop
$PTR_ADD $x,$31,Te-.
Unfortunately there is toolchain that does not allow placing data in
.text segment, MIPSpro to be specific, and for a reason. Therefore we'd
need to find a unified way to address the tables through $gp, global
pointer, in position-independent manner.
For SH4 I suggest 2KB table accessed with combination of mov.l/swap.w
pair and movua.l. I mean there will be mov.l, mov.l/swap.w pair and two
movual.l per "iteration." Also, as far as I understand code position
independence is problem even in aes-sh4...
I'd insist on dropping aes/asm-asm-key.c. aes_core.c compiled with
-DAES_ASM does the job, see $sparcv9_asm in Configure for example.
>> 6/
>> I moved the bn_xxx_word to separate bn/asm/(sh4/mips32).S files, only
>> the comba functions were left as C file.
>
> It's possible to compile without comba functions. It can be achieved
> e.g. by compiling with -DOPENSSL_SMALL_FOOTPRINT, which is more than
> appropriate for embedded systems. Benchmark that too...
As for MIPS. Based of results I collect on my system I'd insist on
omitting mips32.S (as well as accompanying comba routines) in favor of
mips-mont.pl. Latter provides better performance on 512- and 1024-bit
key lengths and adequate on longer ones.
As for SH4. I can't find inline assembler in bn-asm-comba.c appropriate.
Most notably shouldn't multiplication be 'asm ("dmultu.l
%0,%1":"x"(mac):"r"(a),"r"(b))'? Provided that mac is declared unsigned
long long... Then why not "r"(0) and let compiler allocate register for
value of 0. BTW, I reckon bn_mul_mont would do good on SH4 too, likely
better than on MIPS. So it should be preferred option... A.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [email protected]
Automated List Manager [email protected]