>  [1] S. Gueron, V. Krasnov: "Parallelizing message schedules to accelerate the
>  computations of hash functions", http://eprint.iacr.org/2012/067.pdf         
>         
> 
>  The AVX1 implementation:
>  ========================
> 
>  The speedup (measured on Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz)
>  ==================================================================
> 
>  The speedup offered by this patch (compared to OpenSSL 1.0.1) is:
> 
>      Up to 1.45X for SHA256 
>      Up to 1.30X for SHA512 

http://cvs.openssl.org/chngview?cn=22648 delivers 1.50x for SHA256 and 
1.38x for SHA512 on Sandy Bridge. This is *without* parallelizing 
message schedules for multiple blocks (a.k.a. SMS in referred article), 
instead with message schedule for single input block being parallelized. 
As for SMS. Based on amount of instructions in round calculations I 
estimate performance improvement for 4xSMS SHA256 to be ... at most 4%. 
This is relative to and based on primitives in the new code. Estimate is 
[over-]optimistic, as it doesn't account for additional cost of data 
gathering, i.e. real-life coefficient will definitely be noticeably 
*less than 4%*. And it comes with price tag of additional temporary 
memory requirement, 64 vs. 2048+256 bytes, additional effort for 
implementation and long-term support. Bottom line is that 4xSMS SHA256 
is hardly justifiable. 2xSMS SHA512 is even less justifiable, because 
corresponding optimistic improvement estimate is 2%.

As for Haswell. As discussed it's capable of executing 8xSMS SHA256 and 
4xSMS SHA512, i.e. loading 8/4x input blocks and pre-processing them 
simultaneously. Improvement estimates are much higher, 14% for SHA256 
and 20% for SHA512. On the other hand the processor is also capable of 
loading 2x data and pre-processing already parallelized schedules 
simultaneously... More careful consideration will be given at later point.

It's worth discussing AMD Bulldozer result. Note that performance for 
SSSE3 and XOP code paths is equivalent, despite the fact that amount of 
SIMD instructions is much less in XOP code. This is because Bulldozer 
execution ports are not functionally equivalent: there are two ports 
that can execute *only* integer instructions, and two ports that can 
execute *only* SIMD instructions. This means that if amount of SIMD 
instructions in mixture is low enough, then the execution time will be 
fully determined by integer instructions, as if SIMD instructions were 
not even there. It also means that at this point further reduction of 
amount of SIMD instructions won't improve performance. And that is what 
is observed. This in turn means that SMS won't make any sense on 
Bulldozer, because the limit is obviously surpassed already in SSSE3 code.

This can be generalized and said that if integer *ILP* is limited by 
algorithm (as opposite to limited resources as in Bulldozer case), then 
reducing the amount of SIMD instructions below some limit also won't 
have effect on performance. Indeed, consider for example three unified 
execution ports, but only two being requested for integer operations per 
cycle... While it's hardly applicable to SHA256/SHA512, it might be 
partially case for SHA1. Consider "best-case" body_20_39 in 
sha1-x86_64.pl, which is 8 instructions long and has 3-cycles critical 
path. This means that 3 arithmetic/logical execution ports found in Core 
CPUs can't be fully utilized *all the time*. I'm not saying that SIMD 
instructions will "disappear", but at least reducing their amount won't 
have *expected* effect. Even less on Ivy Bridge [and presumably 
Haswell], because zero-latency move instructions will have to be 
discounted, i.e. body_20_39 would effectively be 7 instructions long...


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majord...@openssl.org

Reply via email to