Re: Rijndael patch

Andy Polyakov Tue, 17 Jul 2001 06:08:58 -0700
Hi,

> I've put together an optimized implementation of Rijndael.
> Could a member of the core development team either apply
> the patch to the development tree or, at least, send
> feedback?  The patch was originally generated against the
> 20010711 snapshot but applies cleanly to any of the
> following snapshots.  The URL is
> 
>   http://www.geocities.com/andy_henroid/openssl-patch.txt
> 
> This is the second round for this patch.  I've recently
> fixed the licensing/header, which was the only objection
> I had heard after the original submission.

Well, here is more:-)

The proposed code (hereafter referring to the *core* en-/decryption
routines found at the URL above) is byte-order dependent and will fail
on all big-endians. The original implementation (hereafter referring to
the public domain one, already present in the development tree) is not
and produces correct result on all platforms at the nominal cost (I
estimate at most 5% across all platforms) of collecting 32-bit values
with 4 byte-loads and accompanying shift and or operations (or couple of
rotates and or if compiled with Microsoft C).

The proposed code is IA-32 specific as IA-32 is the only platform immune
to misaligned memory references. The original implementation doesn't use
word load/stores unless compiled with Microsoft C compiler (see even
next paragraph).

Even though IA-32 is immune to misaligned access the proposed code will
perform rather poorly in comparison to the original implementation when
destination vector happens to be misaligned. The catch is that the
implementations are extremely sensitive to load and even store latency.
Now as the proposed code performs encryption "in place" half of load and
stores (see even next paragraph on stores) will be misaligned which will
significantly hurt performance. The original code addresses this problem
by operating on automatic variables (which are guaranteed to be aligned
if they get allocated in memory, see even next paragraph).

Even if data is aligned the proposed code might perform poorer in
comparison to the original implementation (well, there is no "might" if
we start discussing RISC or IA-64 platforms, see below). Problem is that
the proposed code performs encryption in place and most compilers (being
unable to perform aliasing analysis at compile-time) shall start issuing
(otherwise) redundant stores. Well, you can argue that on IA-32 you have
to issue stores in either case (because the register bank is too damn
small) and portion of *redundant* stores will probably be negligible and
out-of-order execution core will compensate for those redundant ones by
aggressive reordering of instructions at run-time anyway... It surely
does, but think larger register banks, i.e. everything else but IA-32.
If register bank is large enough to accommodate *all* automatic
variables required by the original implementation then *all*
intermediate stores become redundant.

Bottom line. I see no reason for discarding the original implementation
in favor of your core routines. But how about merging of the surrounding
code, i.e. EVP support and other things?

And finally yet another reason for slow response. I also wondered if one
really has to spend whole page (of 4K) on those tables. I mean one can
instead have single 1K table and perform rotation at run-time as there
might enough time while waiting for data to become available. Idea is to
decrease foot-print and therefore "warm-up" time. The latter gets
reduced because effectively whole 1K table will be brought to the L1
cache already after encryption of the first block alone. Closer analysis
revealed that the only platform which might benefit from such change is
"wider" (i.e. *upcoming*) IA-64 implementations. On other platforms the
issue (well, if it actually turned to be an issue:-) should be addressed
by explicitely scheduling load-use pairs for L2 cache (most likely in
assembler). Apparently it doesn't pay-off to introduce more
computational instructions as the resources are already spent for offset
calculations while load-use prologue isn't long enough to compensate for
increased need for computational units.

Andy.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [EMAIL PROTECTED]
Automated List Manager                           [EMAIL PROTECTED]
Re: Rijndael patch

Reply via email to