Re: Linus' sha1 is much faster!
On Mon, 17 Aug 2009, Steven Noonan wrote: Interesting. I compared Linus' implementation to the public domain one by Steve Reid[1] You _really_ need to talk about what kind of environment you have. There are three major issues: - Netburst vs non-netburst - 32-bit vs 64-bit - compiler version Steve Reid's code looks great, but the way it is coded, gcc makes a mess of it, which is exactly what my SHA1 tries to avoid. [ In contrast, gcc does very well on just about _any_ straightforward unrolled SHA1 C code if the target architecture is something like PPC or ia64 that has enough registers to keep it all in registers. I haven't really tested other compilers - a less aggressive compiler would actually do _better_ on SHA1, because the problem with gcc is that it turns the whole temporary 16-entry word array into register accesses, and tries to do register allocation on that _array_. That is wonderful for the above-mentioned PPC and IA64, but it makes gcc create totally crazy code when there aren't enough registers, and then gcc starts spilling randomly (ie it starts spilling a-e etc). This is why the compiler and version matters so much. ] (average of 5 runs) Linus' sha1: 283MB/s Steve Reid's sha1: 305MB/s So I get very different results: # TIME[s] SPEED[MB/s] Reid2.742 222.6 linus 1.464 417 this is Intel Nehalem, but compiled for 32-bit mode (which is the more challenging one because x86-32 only has 7 general-purpose registers), and with gcc-4.4.0. Linus
Re: Linus' sha1 is much faster!
On Sun, 16 Aug 2009, Giuseppe Scrivano wrote: My GCC version is gcc (Debian 4.3.3-14) 4.3.3 and the CPU is: Intel(R) Pentium(R) D CPU 3.20GHz. Netburst is very sensitive to random spill effects, and you can basically tune things by just code shuffling that just has random effects on the generated asm code. I also spent some time trying to improve the gnulib SHA1 implementation and it seems a lookup table can improve things a bit. I pretty much can guarantee you that it improves things only because it makes gcc generate crap code, which then hides some of the P4 issues. I'd also suggest you try gcc-4.4, since that apparently fixes some of the oddest spill issues. Linus
Re: Linus' sha1 is much faster!
On Mon, 17 Aug 2009, Giuseppe Scrivano wrote: Thanks for the hint. I tried gcc-4.4 and it produces slower code than 4.3 on the gnulib SHA1 implementation and my patch makes it even more! Check out the asm, see if you can see why. One of the most common problems with P4's is literally that you end up loading from the same stack slot that you just stored to (gcc can do some really crazy spills), and that causes a store buffer hazard replay. My personal opinion is that Netburst is useless for trying to optimize C code for. It's just too random. I noticed that on my machine your implementation is ~30-40% faster using SHA_ROT for rol/ror instructions than inline assembly, at least with the test-case Pádraig wrote. Am I the only one reporting it? I bet it's the same thing. Small perturbations of the source causing small changes to register allocation and thus spilling, and then Netburst goes crazy one way or another. It's interestign trying to fix it, and very frustrating. My workstation is a Nehalem (but Core 2 will have pretty much the same behavior), and it doesn't have the crazy netburst behavior. Shorter and simpler code generally performs better (which is _not_ true on Netburst). On my machine, for example, forcing gcc to do those rotates on registers is the difference between ~381MB/s and 415MB/s. And that's mainly because it makes gcc keep A-E in registers, rather than trying to cache the array[] references. Linus
Re: Linus' sha1 is much faster!
On Sat, 15 Aug 2009, Linus Torvalds wrote: That said, I don't know if the MPL is ok for X11. I've not looked at compatibility issues with MPL. For git, we could just ignore the MPL, since the GPLv2 was acceptable regardless of it. If MPL isn't ok for X11, then we'd need to make sure that even the silliest Mozilla crud has been rewritten. There really isn't much, but hey, the _history_ is based on the mozilla code, and who knows - the 'blk_SHA_CTX' struct has things like the fields in the same order as the Mozilla equivalent, for all those historical reasons. (Heh. Looking at that, I probably should move the 'size' field first, since that would have different alignment rules, and the struct would be more tightly packed that way, and initialize better). Afaik, none of the actual code remains (the mozilla SHA1 thing did the wrong thing for performance even for just the final bytes, and did those a byte at a time etc, so I rewrote even the trivial SHA1_Final parts). Of course, maybe the Mozilla people would be interested in taking my faster version, and say that the new-BSD license is ok, and make everybody happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added him to the Cc. Paul, for your information, we're talking about a faster rewritten mostly portable SHA1 routines that you can find at http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu (follow the blob pointers to see sha1.c and sha1.h). I don't know if you're active with Mozilla/Firefox or whether you even care, but you seem to be the logical choice of person to ask. Linus