> [1] S. Gueron, V. Krasnov: "Parallelizing message schedules to accelerate the > computations of hash functions", http://eprint.iacr.org/2012/067.pdf > > > The AVX1 implementation: > ======================== > > The speedup (measured on Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz) > ================================================================== > > The speedup offered by this patch (compared to OpenSSL 1.0.1) is: > > Up to 1.45X for SHA256 > Up to 1.30X for SHA512
http://cvs.openssl.org/chngview?cn=22648 delivers 1.50x for SHA256 and 1.38x for SHA512 on Sandy Bridge. This is *without* parallelizing message schedules for multiple blocks (a.k.a. SMS in referred article), instead with message schedule for single input block being parallelized. As for SMS. Based on amount of instructions in round calculations I estimate performance improvement for 4xSMS SHA256 to be ... at most 4%. This is relative to and based on primitives in the new code. Estimate is [over-]optimistic, as it doesn't account for additional cost of data gathering, i.e. real-life coefficient will definitely be noticeably *less than 4%*. And it comes with price tag of additional temporary memory requirement, 64 vs. 2048+256 bytes, additional effort for implementation and long-term support. Bottom line is that 4xSMS SHA256 is hardly justifiable. 2xSMS SHA512 is even less justifiable, because corresponding optimistic improvement estimate is 2%. As for Haswell. As discussed it's capable of executing 8xSMS SHA256 and 4xSMS SHA512, i.e. loading 8/4x input blocks and pre-processing them simultaneously. Improvement estimates are much higher, 14% for SHA256 and 20% for SHA512. On the other hand the processor is also capable of loading 2x data and pre-processing already parallelized schedules simultaneously... More careful consideration will be given at later point. It's worth discussing AMD Bulldozer result. Note that performance for SSSE3 and XOP code paths is equivalent, despite the fact that amount of SIMD instructions is much less in XOP code. This is because Bulldozer execution ports are not functionally equivalent: there are two ports that can execute *only* integer instructions, and two ports that can execute *only* SIMD instructions. This means that if amount of SIMD instructions in mixture is low enough, then the execution time will be fully determined by integer instructions, as if SIMD instructions were not even there. It also means that at this point further reduction of amount of SIMD instructions won't improve performance. And that is what is observed. This in turn means that SMS won't make any sense on Bulldozer, because the limit is obviously surpassed already in SSSE3 code. This can be generalized and said that if integer *ILP* is limited by algorithm (as opposite to limited resources as in Bulldozer case), then reducing the amount of SIMD instructions below some limit also won't have effect on performance. Indeed, consider for example three unified execution ports, but only two being requested for integer operations per cycle... While it's hardly applicable to SHA256/SHA512, it might be partially case for SHA1. Consider "best-case" body_20_39 in sha1-x86_64.pl, which is 8 instructions long and has 3-cycles critical path. This means that 3 arithmetic/logical execution ports found in Core CPUs can't be fully utilized *all the time*. I'm not saying that SIMD instructions will "disappear", but at least reducing their amount won't have *expected* effect. Even less on Ivy Bridge [and presumably Haswell], because zero-latency move instructions will have to be discounted, i.e. body_20_39 would effectively be 7 instructions long... ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org