https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153
--- Comment #9 from ncm at cantrip dot org --- I did experiment with -m[no-]bmi[2] a fair bit. It all made a significant difference in the instructions emitted, but exactly zero difference in runtime. That's actually not surprising at all; those instructions get decomposed into micro-ops that exactly match those from the equivalent instructions, and are cached, and the loops that dominate runtime execute out of the micro-op cache. The only real effect is maybe slightly shorter object code, which could matter in a program dominated by bus traffic with loops too big to cache well. I say "maybe slightly shorter" because instruction-set extension instructions are actually huge, mostly prefixes. I.e. most of the BMI stuff is marketing fluff, added mainly to make the competition waste money matching them instead of improving the product.