Hi,

> ... This code runs
> faster on Core 2, Core iX, AMD K-10 and possible other processors.

But since contemporary processors are SSSE3-capable it makes more sense
to benchmark *older* processors when evaluating integer-only
optimizations. And the trouble is that suggested code does not run
faster on P4, PIII or Pentium, worst case being P4, whole 15% slower. It
is a tad faster on Opteron, but only marginally, I measured 1%, which
means that trade-off is not in suggested code's favour.

> (Also, if X array setup is inserted into 0-15 rounds, the code is yet
> more faster.)
> 
> I also use this codepath in ssse3 implementation that results in 6.2 and
> 4.9 (!) cycles per byte on Core 2 and Core i5-750 respectively,

How do you measure on i5? Specifically if so called Turbo Boost is off
or on? And if on, do you compensate for it? I mean 4.9 does sound
impressive, but at the same time it sounds too good (taking into
consideration that "jumping $B" *is* implemented in current SSSE3 code
path), therefore the question. For reference, all results mentioned in
commentaries are collected at *fixed* CPU frequency and obtained by
dividing this frequency by 'openssl speed' result for largest block
size. This was compared to RDTSC-based method at earlier occasions and
was found to provide adequate results. As for RDTSC, keep in mind that
on *contemporary processors* it returns readings of invariant
oscillator, i.e. as CPU frequency increases, measured intervals decrease.

> but
> never can't beat current Sandy bridge implementation - the best I've got
> is 5.8. Sandy Bridge is a quite strange processor ;-)

Its strangeness is discussed in commentary to AVX code. It's rotate
instruction that is responsible...
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Reply via email to