The techniques used in this plain v9 implementation are:
1) Use little-endian 32-bit loads when input data is aligned.
2) Avoid having to accumulate into the context hash values every
loop iteration.
3) In the aligned case try to seperate the loads from the first
use by as many instructions as possible, without sacrificing
the schedule too much.
4) Attempt to dual-issue as much as possible on UltraSPARC-I/II/III/IV
and SPARC-T4.
I had an old module lying around, dusted it off in
http://cvs.openssl.org/chngview?cn=22842. It's 20% faster than your
version on US pre-Tx. Improvement coefficient is likely to be even
higher on T1, because it keeps everything in register bank and there
are no loads except for input. Not really relevant, but it's nominally
faster even on T4.
Could you discuss something like this before checking in such
changes instead of just silently dismissing work I've posted?
All right, will do.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-dev@openssl.org
Automated List Manager majord...@openssl.org