But given the performance swing amplitude (2.5x one way or another) we most likely would have to detect P4 at run-time and fall down to alternative code path... In *both* 32- and 64-bit cases...

I failed to compose blended code which would perform satisfactory on both P4 and non-P4 cores and implemented alternative RC4_CHAR assembler code-path engaged only if executed on P4.

But is was possible to compose blended code which would perform better on AMD core without hurting "legacy" u-archs such as PIII and Pentium. See http://cvs.openssl.org/rlog?f=openssl/crypto/rc4/asm/rc4-586.pl. With regard to http://etud.epita.fr/~bevand_m/papers/rc4-amd64.html my final 0.9.8 RC4 performance matrix looks as following:

Opteron 24x 1.4GHz (32-bit)     197MBps 1.52 x 0.9.7 performance
Opteron 24x 1.4GHz (64-bit)     253MBps 3.3x ...
Pentium 4 2.4GHz (32-bit)       249MBps 2.8x ...
Itanium 2 900MHz                209Mbps [see the code]

So even if 0.9.7 was "offering some of the fastest implementations," it's no longer the case:-)

Note that 32-bit code delivers almost 80% of "64-bit" performance on AMD. As mentioned in commentary section in rc4-586.pl it's actually possible to achieve 224MBps in 32-bit mode, but I chose to stop at the above figure in order to avoid performance deterioration on legacy platforms and provide best all-round performance. It's trivial to back-port IA-64 and AMD improvements to 0.9.7, but not P4. Is there interest for this? IA-64 code can be improved [as discussed in commentary section], and might be at some later point. Cheers. A.
OpenSSL Project http://www.openssl.org
Development Mailing List [EMAIL PROTECTED]
Automated List Manager [EMAIL PROTECTED]

Reply via email to