Thus spake "Andy Polyakov" <[EMAIL PROTECTED]>
Ok. How about now?

Subject to SIGBUS on most platforms. It's easy to carry away and score on x86 and render support for other platforms void, isn't it? I mean do mind unaligned access!

Ah, that may have been why I didn't "fix" that code to use u32. More likely, it was a happy accident that I inherited the portability of the code I copied. I certainly introduced a few logic bugs of my own (which were quickly fixed by others)...

I'm curious if there's a significant performance difference between using u32 and u64; the former should be portable to all supported platforms, and may make the latter unnecessary.

I'd recommend [or even insist] on for (i=0;i<16/sizeof(long);i++) loops and let compiler unroll them. 4x4-byte chunks on 32-bit platforms and 2x8-byte chunks - on 64-bit ones without a single shred of "#if that-or-that" spaghetti and no unnecessary dependency on totally unrelated bn.h. And once again, unaligned input/output is to be treated byte by byte.

My experience is that, for blocks as short as we're discussing here, the tests for unaligned blocks usually defeat the benefit you get in the aligned case. Functions like memcpy() generally require a minimum size before they try any such trickery due to the cost of the test, and 16 bytes is probably on the edge for most platforms.

If you're using a platform that will transparently handle unaligned access (either in hardware or software), it's worth it, but IMHO not on code that has to work on platforms that don't.

Plus, if we're going to go that route, we should consider that some platforms have 128-bit XOR support in hardware; is it worth implementing that too?

Is it really that widely used/important mode? To justify that much extra complexity for little gain?

I hacked up a version of the AES code a while back that used SSE registers to pass the blocks around, do bitwise operations, etc. It was faster than the current version, but (IMHO) not enough to justify adding so much unportable hackery to the project. If one desperately needs speed, the existing approach is to use platform-specific asm, and that seems sufficient.

How much of this should be extended to other ciphers?  Should
xorN() and moveN() be part of the bignum code for reuse in other
modules?

I'd be opposed to this. If performance gets that important, function call will hardly beat inline code anyway. Even if function is say 128-bit SSE2 and inline is just 4x32-bit. A.

When I find such things useful, I tend to put them in a module's headers as a static inline function; that gets the speed of a macro with the semantics and safety of a "real" function. Unfortunately, that approach probably won't work on all of the platforms OpenSSL supports due to all the ancient compilers floating around.

S

Stephen Sprunk        "Stupid people surround themselves with smart
CCIE #3723           people.  Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           [EMAIL PROTECTED]

Reply via email to