https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70557
--- Comment #4 from Albert Cahalan <acahalan at gmail dot com> --- Mostly it's more like PR58741 because of the long long issue. PR22141 (and PR23684 which is a better match) is about merging small things. Two of the six examples here show that problem, those being the ones with a loop over char. The problem that prompted this bug report and determined the bug title is different. It's in some way the opposite. When I ask gcc to store a 64-bit zero value, gcc makes a 64-bit zero value in memory (two identical 32-bit halves in a pair of 32-bit registers) and then stores that to memory. There are many ways that this is wrong, and I worry that fixing one problem may hide the other problems. Depending on compiler internals that I don't understand, this could perhaps be four bugs: 1. When the two halves of a 64-bit value are identical, there is no need to load values into two different registers. This is true for many constant values, though obviously -1 and 0 would be most popular. Other popular values would be the constants for computing a Hamming weight. AFAIK, this optimization should apply whenever dealing with values that are larger than registers, such as 128-bit values on 64-bit platforms. 2. When the address is to be encoded in the instruction that writes to memory, it is best to directly clear the memory without first generating the constant in registers. AFAIK, this optimization should apply to most CISC machines. The fact that there is a special instruction for storing a 0 makes the optimization more important. 3. When the address is to be encoded in an instruction, sometimes it is best to place the address in a register and then use that register to supply the address for storing to memory. This tends to apply when doing lots of writes, when an address register happens to be available, and when optimizing for size. AFAIK this optimization applies to most machines. 4. When using an address register to supply the location for storing, often it is best to use autoincrement addressing instead of distinct offsets. This usually generates smaller code. AFAIK this applies to many machines, including at least: arm, m68k, and ppc. (and also the store-merge issue, which makes 5)