Hi, > ... This code runs > faster on Core 2, Core iX, AMD K-10 and possible other processors.
But since contemporary processors are SSSE3-capable it makes more sense to benchmark *older* processors when evaluating integer-only optimizations. And the trouble is that suggested code does not run faster on P4, PIII or Pentium, worst case being P4, whole 15% slower. It is a tad faster on Opteron, but only marginally, I measured 1%, which means that trade-off is not in suggested code's favour. > (Also, if X array setup is inserted into 0-15 rounds, the code is yet > more faster.) > > I also use this codepath in ssse3 implementation that results in 6.2 and > 4.9 (!) cycles per byte on Core 2 and Core i5-750 respectively, How do you measure on i5? Specifically if so called Turbo Boost is off or on? And if on, do you compensate for it? I mean 4.9 does sound impressive, but at the same time it sounds too good (taking into consideration that "jumping $B" *is* implemented in current SSSE3 code path), therefore the question. For reference, all results mentioned in commentaries are collected at *fixed* CPU frequency and obtained by dividing this frequency by 'openssl speed' result for largest block size. This was compared to RDTSC-based method at earlier occasions and was found to provide adequate results. As for RDTSC, keep in mind that on *contemporary processors* it returns readings of invariant oscillator, i.e. as CPU frequency increases, measured intervals decrease. > but > never can't beat current Sandy bridge implementation - the best I've got > is 5.8. Sandy Bridge is a quite strange processor ;-) Its strangeness is discussed in commentary to AVX code. It's rotate instruction that is responsible... ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List [email protected] Automated List Manager [email protected]
