Re: Linus' sha1 is much faster!

2009-08-17 Thread Steven Noonan
On Mon, Aug 17, 2009 at 3:51 AM, Giuseppe Scrivanogscriv...@gnu.org wrote:
 Pádraig Brady p...@draigbrady.com writes:

   -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
   -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
   -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1

 thanks.  I did again all tests on my machine using these same options.
 I repeated each test 6 times and I took the median without consider the
 first result.  Except the first run that it is not considered, I didn't
 report a big variance on results of the same test.


 gcc 4.3.3

 gnulib sha1:            real    0m2.543s
 gnulib sha1 lookup:     real    0m1.906s (-25%)
 linus's sha1:           real    0m2.468s (-3%)
 linus's sha1 no asm:    real    0m2.289s (-9%)


 gcc 4.4.1

 gnulib sha1:            real    0m3.386s
 gnulib sha1 lookup:     real    0m3.110s (-8%)
 linus's sha1:           real    0m1.701s (-49%)
 linus's sha1 no asm:    real    0m1.284s (-62%)


 I don't see such big differences in asm generated by gcc 4.4.1 and gcc
 4.3.3 to explain this performance difference, what I noticed immediately
 is that in the gcc-4.4 generated asm there are more lea instructions
 (+30%), but I doubt this is the reason of these poor results.  Anyway, I
 haven't yet looked much in details.

 Cheers,
 Giuseppe

Interesting. I compared Linus' implementation to the public domain one
by Steve Reid[1], which is used in OpenLDAP and a few other projects.
Anyone with some experience testing these kinds of things in a
statistically sound manner want to try it out? In my tests, I got
this:

(average of 5 runs)
Linus' sha1: 283MB/s
Steve Reid's sha1: 305MB/s

- Steven

[1] 
http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/openldap-2.3.11/libraries/liblutil/sha1.c




Re: Linus' sha1 is much faster!

2009-08-17 Thread Steven Noonan
On Mon, Aug 17, 2009 at 9:22 AM, Linus
Torvaldstorva...@linux-foundation.org wrote:


 On Mon, 17 Aug 2009, Steven Noonan wrote:

 Interesting. I compared Linus' implementation to the public domain one
 by Steve Reid[1]

 You _really_ need to talk about what kind of environment you have.

 There are three major issues:
  - Netburst vs non-netburst
  - 32-bit vs 64-bit
  - compiler version

Right. I'm running a Core 2 Merom 2.33GHz. The code was compiled for
x86_64 with GCC 4.2.1. I didn't _expect_ it to compile for x86_64, but
apparently the version of GCC that ships with Xcode 3.2 defaults to
compiling 64-bit code on machines that are capable of running it.


 Steve Reid's code looks great, but the way it is coded, gcc makes a mess
 of it, which is exactly what my SHA1 tries to avoid.

 [ In contrast, gcc does very well on just about _any_ straightforward
  unrolled SHA1 C code if the target architecture is something like PPC or
  ia64 that has enough registers to keep it all in registers.

  I haven't really tested other compilers - a less aggressive compiler
  would actually do _better_ on SHA1, because the problem with gcc is that
  it turns the whole temporary 16-entry word array into register accesses,
  and tries to do register allocation on that _array_.

  That is wonderful for the above-mentioned PPC and IA64, but it makes gcc
  create totally crazy code when there aren't enough registers, and then
  gcc starts spilling randomly (ie it starts spilling a-e etc). This is
  why the compiler and version matters so much. ]

 (average of 5 runs)
 Linus' sha1: 283MB/s
 Steve Reid's sha1: 305MB/s

 So I get very different results:

        #             TIME[s] SPEED[MB/s]
        Reid            2.742       222.6
        linus           1.464         417

Added -m32:

Steve Reid: 156MB/s
Linus: 209MB/s

So on x86, your code really kicks butt.

 this is Intel Nehalem, but compiled for 32-bit mode (which is the more
 challenging one because x86-32 only has 7 general-purpose registers), and
 with gcc-4.4.0.

                        Linus