Re: Linus' sha1 is much faster!

2009-08-17 Thread Linus Torvalds


On Mon, 17 Aug 2009, Steven Noonan wrote:
 
 Interesting. I compared Linus' implementation to the public domain one
 by Steve Reid[1]

You _really_ need to talk about what kind of environment you have.

There are three major issues:
 - Netburst vs non-netburst
 - 32-bit vs 64-bit
 - compiler version

Steve Reid's code looks great, but the way it is coded, gcc makes a mess 
of it, which is exactly what my SHA1 tries to avoid.

[ In contrast, gcc does very well on just about _any_ straightforward 
  unrolled SHA1 C code if the target architecture is something like PPC or 
  ia64 that has enough registers to keep it all in registers.

  I haven't really tested other compilers - a less aggressive compiler 
  would actually do _better_ on SHA1, because the problem with gcc is that 
  it turns the whole temporary 16-entry word array into register accesses, 
  and tries to do register allocation on that _array_.

  That is wonderful for the above-mentioned PPC and IA64, but it makes gcc 
  create totally crazy code when there aren't enough registers, and then 
  gcc starts spilling randomly (ie it starts spilling a-e etc). This is 
  why the compiler and version matters so much. ]

 (average of 5 runs)
 Linus' sha1: 283MB/s
 Steve Reid's sha1: 305MB/s

So I get very different results:

# TIME[s] SPEED[MB/s]
Reid2.742   222.6
linus   1.464 417

this is Intel Nehalem, but compiled for 32-bit mode (which is the more 
challenging one because x86-32 only has 7 general-purpose registers), and 
with gcc-4.4.0.

Linus




Re: Linus' sha1 is much faster!

2009-08-16 Thread Linus Torvalds


On Sun, 16 Aug 2009, Giuseppe Scrivano wrote:
 
 My GCC version is gcc (Debian 4.3.3-14) 4.3.3 and the CPU is: Intel(R)
 Pentium(R) D CPU 3.20GHz.

Netburst is very sensitive to random spill effects, and you can basically 
tune things by just code shuffling that just has random effects on the 
generated asm code.

 I also spent some time trying to improve the gnulib SHA1 implementation
 and it seems a lookup table can improve things a bit.

I pretty much can guarantee you that it improves things only because it 
makes gcc generate crap code, which then hides some of the P4 issues.

I'd also suggest you try gcc-4.4, since that apparently fixes some of the 
oddest spill issues.

Linus




Re: Linus' sha1 is much faster!

2009-08-16 Thread Linus Torvalds


On Mon, 17 Aug 2009, Giuseppe Scrivano wrote:
 
 Thanks for the hint.  I tried gcc-4.4 and it produces slower code than
 4.3 on the gnulib SHA1 implementation and my patch makes it even more!

Check out the asm, see if you can see why. One of the most common problems 
with P4's is literally that you end up loading from the same stack slot 
that you just stored to (gcc can do some really crazy spills), and that 
causes a store buffer hazard replay.

My personal opinion is that Netburst is useless for trying to optimize C 
code for. It's just too random.

 I noticed that on my machine your implementation is ~30-40% faster using
 SHA_ROT for rol/ror instructions than inline assembly, at least with the
 test-case Pádraig wrote.  Am I the only one reporting it?

I bet it's the same thing. Small perturbations of the source causing small 
changes to register allocation and thus spilling, and then Netburst goes 
crazy one way or another. It's interestign trying to fix it, and very 
frustrating.

My workstation is a Nehalem (but Core 2 will have pretty much the same 
behavior), and it doesn't have the crazy netburst behavior. Shorter and 
simpler code generally performs better (which is _not_ true on Netburst). 

On my machine, for example, forcing gcc to do those rotates on registers 
is the difference between ~381MB/s and 415MB/s. And that's mainly because 
it makes gcc keep A-E in registers, rather than trying to cache the 
array[] references.

Linus




Re: Linus' sha1 is much faster!

2009-08-15 Thread Linus Torvalds


On Sat, 15 Aug 2009, Linus Torvalds wrote:
 
 That said, I don't know if the MPL is ok for X11. I've not looked at 
 compatibility issues with MPL. For git, we could just ignore the MPL, 
 since the GPLv2 was acceptable regardless of it.

If MPL isn't ok for X11, then we'd need to make sure that even the 
silliest Mozilla crud has been rewritten. There really isn't much, but 
hey, the _history_ is based on the mozilla code, and who knows - the 
'blk_SHA_CTX' struct has things like the fields in the same order as the 
Mozilla equivalent, for all those historical reasons.

(Heh. Looking at that, I probably should move the 'size' field first, 
since that would have different alignment rules, and the struct would be 
more tightly packed that way, and initialize better).

Afaik, none of the actual code remains (the mozilla SHA1 thing did the 
wrong thing for performance even for just the final bytes, and did those a 
byte at a time etc, so I rewrote even the trivial SHA1_Final parts).

Of course, maybe the Mozilla people would be interested in taking my 
faster version, and say that the new-BSD license is ok, and make everybody 
happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added 
him to the Cc.

Paul, for your information, we're talking about a faster rewritten mostly 
portable SHA1 routines that you can find at

http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu

(follow the blob pointers to see sha1.c and sha1.h). I don't know if 
you're active with Mozilla/Firefox or whether you even care, but you seem 
to be the logical choice of person to ask.

Linus