http://cvs.openssl.org/chngview?cn=22599 http://cvs.openssl.org/chngview?cn=22600
For reference. As for full unroll I've taken different approach. Instead of trying to accommodate additional a-h variable in freed register I keep a^b->b^c in "rotating" pair of registers instead of stack. And I've taken instruction sequences from folded loop. As result I get better performance even in cases your code exhibits regression, biggest gap is >40% on Atom. Note that unrolled loop is executed for inputs >=1024 bytes. If you want to experiment, adjust $unroll_after variable. I also avoid unrolled loop on P4 for reasons discussed below.
> > Could you retest 1.7 on your P4?
Surely, on Monday, I'll test on Northwoord and Prescott.
It appears that I was wrong about "my" P4 being "initial" version. It's 2.4GHz and has to be Northwood, i.e. "second wave", as well. But there were "better" P4s released later, at least those that are 64-bit capable ones ought to be of the kind. On "my" P4 I measure 30 cpb for folded loop and whole 40[!] cpb for unrolled, while you reported improvement for unrolled loop. Presumably this is how sensitive it *can* get to larger code size. As for "better" P4s. I've found 64-bit capable P4 that executes folded loop in 23.6 cpb and unrolled in 19.7. Yet, as P4 is not "hot" anymore, I've chosen to opt for folded loop. If you want to experiment with my unrolled loop on P4, locate "check for P4" in sha256-586.pl and comment following jump instruction.
______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List [email protected] Automated List Manager [email protected]
