Thus spake "Andy Polyakov" <[EMAIL PROTECTED]>
Ok. How about now?
Subject to SIGBUS on most platforms. It's easy to carry away and score on
x86 and render support for other platforms void, isn't it? I mean do mind
unaligned access!
Ah, that may have been why I didn't "fix" that code to use u32. More
likely, it was a happy accident that I inherited the portability of the code
I copied. I certainly introduced a few logic bugs of my own (which were
quickly fixed by others)...
I'm curious if there's a significant performance difference between using
u32 and u64; the former should be portable to all supported platforms,
and may make the latter unnecessary.
I'd recommend [or even insist] on for (i=0;i<16/sizeof(long);i++) loops
and let compiler unroll them. 4x4-byte chunks on 32-bit platforms and
2x8-byte chunks - on 64-bit ones without a single shred of "#if
that-or-that" spaghetti and no unnecessary dependency on totally unrelated
bn.h. And once again, unaligned input/output is to be treated byte by
byte.
My experience is that, for blocks as short as we're discussing here, the
tests for unaligned blocks usually defeat the benefit you get in the aligned
case. Functions like memcpy() generally require a minimum size before they
try any such trickery due to the cost of the test, and 16 bytes is probably
on the edge for most platforms.
If you're using a platform that will transparently handle unaligned access
(either in hardware or software), it's worth it, but IMHO not on code that
has to work on platforms that don't.
Plus, if we're going to go that route, we should consider that some
platforms have 128-bit XOR support in hardware; is it worth implementing
that too?
Is it really that widely used/important mode? To justify that much extra
complexity for little gain?
I hacked up a version of the AES code a while back that used SSE registers
to pass the blocks around, do bitwise operations, etc. It was faster than
the current version, but (IMHO) not enough to justify adding so much
unportable hackery to the project. If one desperately needs speed, the
existing approach is to use platform-specific asm, and that seems
sufficient.
How much of this should be extended to other ciphers? Should
xorN() and moveN() be part of the bignum code for reuse in other
modules?
I'd be opposed to this. If performance gets that important, function call
will hardly beat inline code anyway. Even if function is say 128-bit SSE2
and inline is just 4x32-bit. A.
When I find such things useful, I tend to put them in a module's headers as
a static inline function; that gets the speed of a macro with the semantics
and safety of a "real" function. Unfortunately, that approach probably
won't work on all of the platforms OpenSSL supports due to all the ancient
compilers floating around.
S
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-dev@openssl.org
Automated List Manager [EMAIL PROTECTED]