> as I promised, here is the optimized code for SHA-256 hash, x86 > platform. Should work faster on Core 2/iX up to 20%.
I can't replicate the results, not on Intel CPUs. Well, I can get 20% on Sandy Bridge if I replace rotate with double precision shift, but it's not fair comparison (in sense that switch would improve original code as well). I did observe more than 20% on Opteron, but on Core2/Sandy Bridge I get only 13-11%... I've taken one of ideas, alternative Maj, and managed to squeeze ~5% on Opteron and Sandy Bridge, none on Core2 and whole 13% on Atom, see http://cvs.openssl.org/chngview?cn=22587. Compared to this updated code I observe your code being +20%/+13%/+6%/-18% faster/slower on Opteron/Core2/Sandy Bridge/Atom. So that full unroll helps, but apparently less on most recent CPUs (modulo lack of results for AMD Bulldozer and Bobcat). Something to attempt at some later point... From Sandy Bridge viewpoint it makes more sense to arrange run-time switch to shrd-based non-unrolled code path. This way code increase would be minimal, while performance difference between tight and fully unrolled loop nominal. > I guess > you should make it PIC, as any other code for x86 (I didn't make it > because I don't need it in my projects). Pure code, i.e. without references to data, is always position-independent. As you effectively embed constants into instructions, it already is PIC. ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List [email protected] Automated List Manager [email protected]
