Re: Rijndael patch

Andy Polyakov Mon, 23 Jul 2001 17:05:59 -0700
> > > Perhaps, I'll produce some actual numbers using OpenSSL and
> > > both implementations to prove my case.
> >
> > Note and respect that OpenSSL is cross-platform toolkit
> > meaning that we might face and resolve a trade-off.
> 
> Totally understood. OpenSSL should work well on all the
> platforms it supports.  If compromise is necessary however,
> shouldn't OpenSSL be tuned to the platform that is in (by
> far) the widest use?

Well, if you're so enthusiastic then why don't you throw a hand coded
assembler implementation in:-) But to get serious. I made some
preliminary tests with the public domain implementation (again, the
implementations are practically identical and we won't see dramatic
performance differences) and it didn't look particularly good when I
tried to produce position independent code: gcc 2.95 generated 50% (i.e.
2 *times*) slower code, gcc 3.0 - 40%, even icc (Intel C++ compiler)
surrendered 20% of performance... I don't see that it actually has to be
the case... Analysis revealed that performance degradation caused by the
fact that the generated code was operating with multiple pointers to
those tables. If I concatenate the tables together to two-dimentional
array Te[5][256] (it would be rijndael_e[4][256] in proposed
implementation) results improve, gcc 2.95 goes 30-35%, gcc 3.0 - 20% and
icc - only couple of percents slower. The laste result does prove that
PIC doesn't really have to cost anything in this case. But as very few
people has icc (I myself have only a demo version) assembler
implementation might actually be of interest...

There is another idea I've tested out. As IA-32 has to keep s0-s3 and
t0-t3 (dst[0-3] and block[0-3] in proposed implementation) in memory in
either case, why not hire byte load to perform shift-n-and's job? I mean
for example BYTE1(x) macro in proposed implementation could read (*(((u8
*)(&(x)))+1))... Even though it brings one extra load per round
performance doesn't suffer at all. But note that this trick minimizes
"the pressure" on ALUs potentially leaving enough room for run-time
S-box rotations. Compilers however generated a bit slower code with
run-time rotations, but a dummy assembler implementation of mine
exhibited perfectly competitive performance when compared to code
utilizing multiple tables. So that your previous "those rotations at
runtime are *very* costly" comment doesn't really hold... But it still
should be noted (as already mentioned earlier) that on other platforms
(other than IA-32) it never pays off to perform those rotations in
run-time (it's only 20 loads per round and the code starts starving for
ALUs much earlier).

> (SSE
> accesses of 128-bit chunks do even better and that is
> part of the reason the code is structured as it is)

What compiler are we talking about anyway? And how much those SSE
instructions actually buy you? Indeed, at the very best the core code
(unrolled loop) runs at 350 ticks (35 loads per round, there is one load
port, there're enough loads to group all arithmetic ops and stores
with). Now SSE means trading of 4 loads + 4 xors + 4 stores for 1
128-bit load + 1 pxor + 1 128-bit store, right? So it's 3% win at the
very best (in reality in will be even less as I didn't count SSE
latencies at all) at the cost of making code PIII+ specific.

> You know, given the complexity of modern architectures and
> this algorithm, it's difficult to say without actually doing
> the benchmarking.

The statement isn't accurate. It's not "architectures", but "IA-32
implementations" you must be actually talking about. You indeed can't
tell anything definite without benchmarking, but that's because being
impaired by the ancient instruction set they (IA-32 implementations) are
rather unpredictable and you basically never know what your code will
get punished for. For example I remember exchanging 'dec %ecx' for 'sub
$1,%ecx' costed 10% in performance someplace... At another occasion I
got excited over 15% performance improvement on Pentium, to the great
disappointment the "tuned" code exhibited 25% performance *decrease* on
Pentium Pro... Take even this algorithm. Code generated by icc runs 33%
slower on AMD (gcc generates code doesn't suffer that much, only ~10%).
But in either case it's still possible to *estimate* the very best-case
scenario and complexity (well, out-of-order execution) actually makes it
easier. All you have to do is to enumerate number of operations, compare
it with the number and type of execution units and then count on that
out-of-order execution core reorders the instruction stream to hide the
latencies.

> For the
> block oriented modes, at least, I suspect the data is
> almost always going to be 32-bit aligned.

Wrong! At the very least for the time being both libssl and ssh pass
unaligned pointers to libcrypto.

> > My question is, if I do the work to implement the above and do
> > so correctly, will it be accepted by the development team?
> 
> If it really is a win for all platforms, I don't see why not.

However, we can't guarantee that the very next hour somebody else (say
from AMD:-) won't submit better code, tested and benchmarked on several
*architectures*, can we:-) But in either case... To summarize:

- the tables should preferably be declared as two-dimentional arrays in
order to improve performance of PIC (yes, proposed implementation makes
easier to arrange);
- the code should preferably operate on automatic variables, it does
improve performance on RISCs (e.g. SPARC performance drops for 40% if I
declare s0-3 and t0-3 as s[4] and t[4] in the public domain
implementation) (yes, proposed implementation again makes it easier to
arrange);
- the alignment issue (as well as endianness) can be addressed in
routines calling the core RIJNDAEL_[en|de]crypt (at least that's how
it's done in other encryption routines implemented in toolkit, however
performance-wise both issues belong in the core routines);
- if endianness issue is addressed in ebc, cbc, cfb, ofb routines, then
we have to have two different tables for big- and little-endians hosts,
but at the same time code must produce correct result even if
[B|L]_ENDIAN macro is not defined;

Cheers. Andy.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [EMAIL PROTECTED]
Automated List Manager                           [EMAIL PROTECTED]
Re: Rijndael patch

Reply via email to