Interleaved are my results translated to your units, basically just multiplied by 64 and rounded to three significant digits.

                    1.5    1.6    1.7    1.8    my
P III (Coppermime) 1821 / 1850 / 1742 / 1574 / 1614
                                          1540
P4 (Prescott)      1544 / 1546 / 1541 / 1375 / 1450
                                          1510
P4 (Northwood)     2200 / 1963 / 1931 / 2483 / 1957
                                          1920
AMD Sempron        1537 / 1450 / 1394 / 1205 / 1305
                                          n/a
AMD K10            1270 / 1210 / 1215 / 988  / 1057
                                          990
Core 2             1170 / 1131 / 1130 / 985  / 984
                                          1010
i5 Lynnfield       1250 / 1426 / 1271 / 1121 / 1033
                                          1100
Sandy Bridge       1265 / 1225 / 1228 / 1115 / 981 (*) with shrd
                                          1010 (folded loop with shrd)
Atom               2300 / 2050 / 1984 / 1700 / 2455
                                          1660

Results are consistent except for P4, Core 2 and Sandy Bridge.

As for P4 it's probably just to shrug the shoulders, accept whatever the result is and forget about it. It's a bit hard to accept, but it's hardly worth figuring it out why our results vary that much.

As for Core 2. Difference is nominal and if I execute my binary with varying stack seed(*) I can also measure 990 cycles per block. In other words variation can be explained by environmental factors such as cache contention.

As for Sandy Bridge. I don't know... I could observe nominal variations, 2-3%, on my machine, but nothing close to 10%, so this is odd... If you have energy, test with varying stack seed(*)...

(*) because environment variables reside below stack simplest way to reseed stack is to 'env A=`perl -e 'print "A"x1024"'` ...' and experiment with number after x.

So, 1.8 version is quite good. It's the best for almost all old/slow
architectures,  and my version is still the best for modern/powerful ones.

Come on, apart from your Sandy Bridge result for 1.8, it's virtually equivalent. Nominal difference can be explained by environmental factors, and if not, it's really low price to pay for >40% improvement on Atom. Besides, it's actually "slow" architectures that need optimization more :-)

Cheers.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to