> as I promised, here is the optimized code for SHA-256 hash, x86
> platform. Should work faster on Core 2/iX up to 20%.

I can't replicate the results, not on Intel CPUs. Well, I can get 20% on
Sandy Bridge if I replace rotate with double precision shift, but it's
not fair comparison (in sense that switch would improve original code as
well). I did observe more than 20% on Opteron, but on Core2/Sandy Bridge
I get only 13-11%... I've taken one of ideas, alternative Maj, and
managed to squeeze ~5% on Opteron and Sandy Bridge, none on Core2 and
whole 13% on Atom, see http://cvs.openssl.org/chngview?cn=22587.
Compared to this updated code I observe your code being
+20%/+13%/+6%/-18% faster/slower on Opteron/Core2/Sandy Bridge/Atom. So
that full unroll helps, but apparently less on most recent CPUs (modulo
lack of results for AMD Bulldozer and Bobcat). Something to attempt at
some later point... From Sandy Bridge viewpoint it makes more sense to
arrange run-time switch to shrd-based non-unrolled code path. This way
code increase would be minimal, while performance difference between
tight and fully unrolled loop nominal.

> I guess
> you should make it PIC, as any other code for x86 (I didn't make it
> because I don't need it in my projects).

Pure code, i.e. without references to data, is always
position-independent. As you effectively embed constants into
instructions, it already is PIC.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to