Re: SHA-256 implementation improvement

Andy Polyakov Mon, 28 May 2012 12:12:38 -0700

http://cvs.openssl.org/chngview?cn=22599
http://cvs.openssl.org/chngview?cn=22600

For reference. As for full unroll I've taken different approach. Insteadof trying to accommodate additional a-h variable in freed register Ikeep a^b->b^c in "rotating" pair of registers instead of stack. And I'vetaken instruction sequences from folded loop. As result I get betterperformance even in cases your code exhibits regression, biggest gap is>40% on Atom. Note that unrolled loop is executed for inputs >=1024bytes. If you want to experiment, adjust $unroll_after variable. I alsoavoid unrolled loop on P4 for reasons discussed below.


> > Could you retest 1.7 on your P4?


Surely, on Monday, I'll test on Northwoord and Prescott.

It appears that I was wrong about "my" P4 being "initial" version. It's2.4GHz and has to be Northwood, i.e. "second wave", as well. But therewere "better" P4s released later, at least those that are 64-bit capableones ought to be of the kind. On "my" P4 I measure 30 cpb for foldedloop and whole 40[!] cpb for unrolled, while you reported improvementfor unrolled loop. Presumably this is how sensitive it *can* get tolarger code size. As for "better" P4s. I've found 64-bit capable P4 thatexecutes folded loop in 23.6 cpb and unrolled in 19.7. Yet, as P4 is not"hot" anymore, I've chosen to opt for folded loop. If you want toexperiment with my unrolled loop on P4, locate "check for P4" insha256-586.pl and comment following jump instruction.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Re: SHA-256 implementation improvement

Reply via email to