https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167
--- Comment #3 from cqwrteur <unlvsur at live dot com> --- (In reply to Andrew Pinski from comment #1) > This is a reassociation, scheduling issue and register allocation issue. > > Plus your example might be slower due to dependencies. > > Without a full example of where gcc ra goes wrong, gcc actually produces > much better code for this example due to register renaming in hw. > Note many x86_64 also does register renaming for the stack too https://github.com/openssl/openssl/blob/a8572674f12ceb39f7e66ccbaa8918b922c76739/crypto/sha/asm/sha512-x86_64.pl#L16 They mentioned that before. 40% improvement over compiler-generated code. "I really wonder why gcc # [being armed with inline assembler] fails to generate as fast code." # sha256/512_block procedure for x86_64. # # 40% improvement over compiler-generated code on Opteron. On EM64T # sha256 was observed to run >80% faster and sha512 - >40%. No magical # tricks, just straight implementation... I really wonder why gcc # [being armed with inline assembler] fails to generate as fast code. # The only thing which is cool about this module is that it's very # same instruction sequence used for both SHA-256 and SHA-512. In # former case the instructions operate on 32-bit operands, while in # latter - on 64-bit ones. All I had to do is to get one flavor right, # the other one passed the test right away:-)
