https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559
--- Comment #11 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to Jakub Jelinek from comment #9) > Aren't these optimizations actually a pessimization for -mmovbe if the inner > bswap is on a read from memory? Assuming the load and bswap instruction is > cheap, then e.g. loading two values with bswap on them and doing say xor on > them afterwards might be cheaper than load the two values, xor them and then > bswap them (because for that bswap you don't have a load+bswap instruction). (simplify (bitop (bswap @0) (bswap @1)) (bswap (bitop @0 @1))) This one should be: (simplify (bswap (bitop (bswap @0) (bswap @1))) (bitop @0 @1)) This is what builtin-bswap-8.c tests, and I believe it will address Jakub's concerns.