>> ... it's however a little bit faster. >> It means the optimization still exist, and I think there should be more >> gain if this code is used in OpenSSL SSSE3 codepath (I use completely >> different SSSE3 code generation that is possibly less effective). > > Not necessarily. Out-of-order execution logic can be forgiving (and > current Intel out-of-order logic *is* forgiving enough) in sense that > different sequences, even ones considered "less effective", can be > executed equally fast. Sometimes it's more about sheer amount of u-ops > than specific sequence of machine codes, and this case is very much like > this.
Optimization indeed exists, as it does reduce amount of u-ops by 2% (overlooked it). If forgiving enough, execution logic translates it to 2% performance improvement. http://cvs.openssl.org/chngview?cn=22725 effectively tells that Core2 (minor regression in 64-bit code), Sandy Bridge (no effect in 32-bit mode(*)) and VIA Nano (minor regression in 32-bit core) are not always forgiving. Really minor improvement otherwise, but once done it hardly make sense to let it be... (*) 32-bit AVX result changed from 5.1 to 5.2, but 5.1 must have been typo, because I couldn't reproduce it, while relative improvement, 70%, was reproducible. ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List [email protected] Automated List Manager [email protected]
