https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77485
Bug ID: 77485 Summary: Missed dead store elimination when returning constexpr generated data Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: petschy at gmail dot com Target Milestone: --- Created attachment 39563 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39563&action=edit C++ source A string is generated at compile time with constexpr function, then it is returned by value, in a struct having a char array. There are two versions of the generator: foo() generates into a larger array, foo_sized() first calculates the needed size, then generates into an array with the exactly needed size. The char array must be initialized due to constexpr, and these zero stores are not all eliminated when overwritten with the actual generated characters. BUT, only if the array in the struct is larger than the generated data: foo_sized(): Dump of assembler code for function foo_sized(): 0x00000000004005e0 <+0>: movdqa 0xa8(%rip),%xmm0 # 0x4006b0 0x00000000004005e8 <+8>: mov %rdi,%rax 0x00000000004005eb <+11>: movups %xmm0,(%rdi) 0x00000000004005ee <+14>: movdqa 0xaa(%rip),%xmm0 # 0x4006c0 0x00000000004005f6 <+22>: movups %xmm0,0x10(%rdi) 0x00000000004005fa <+26>: retq 0x4006b0 : "0 1 2 3 4 5 6 7 8 9 10 11 12 13" gcc 6.2.1 and 7.0.0 generate the exact same code. Since the size of the buffer equals to the number of generated characters (32), two 16 byte load+stores are used. foo() with gcc 7.0: Dump of assembler code for function foo(): 0x0000000000400560 <+0>: mov %rdi,%rdx ; rdi = rdx = 0x7fffffffe1b0 0x0000000000400563 <+3>: movq $0x0,0xc0(%rdi) ; zero out the last 8 bytes of the buffer 0x000000000040056e <+14>: lea 0x8(%rdi),%rdi ; rdi += 8 => 0x7fffffffe1b8 0x0000000000400572 <+18>: mov %rdx,%rcx ; rcx = 0x7fffffffe1b0 0x0000000000400575 <+21>: movdqa 0x113(%rip),%xmm0 # 0x4006b0 0x000000000040057d <+29>: and $0xfffffffffffffff8,%rdi ; already aligned 0x0000000000400581 <+33>: xor %eax,%eax 0x0000000000400583 <+35>: sub %rdi,%rcx ; rcx = -8 0x0000000000400586 <+38>: add $0xc8,%ecx ; rcx = 0xc0, 0xc8 is the size of the buffer, ie 200 0x000000000040058c <+44>: shr $0x3,%ecx ; rcx = 0x18 0x000000000040058f <+47>: rep stos %rax,%es:(%rdi) ; rdi is buf+8, ecx is (size-8)/8, so will zero out the char buffer from index 8 to 0xbf 0x0000000000400592 <+50>: movups %xmm0,(%rdx) 0x0000000000400595 <+53>: movb $0x38,0x10(%rdx) 0x0000000000400599 <+57>: movb $0x20,0x11(%rdx) 0x000000000040059d <+61>: mov %rdx,%rax 0x00000000004005a0 <+64>: movb $0x39,0x12(%rdx) 0x00000000004005a4 <+68>: movb $0x20,0x13(%rdx) 0x00000000004005a8 <+72>: movb $0x31,0x14(%rdx) 0x00000000004005ac <+76>: movb $0x30,0x15(%rdx) 0x00000000004005b0 <+80>: movb $0x20,0x16(%rdx) 0x00000000004005b4 <+84>: movb $0x31,0x17(%rdx) 0x00000000004005b8 <+88>: movb $0x31,0x18(%rdx) 0x00000000004005bc <+92>: movb $0x20,0x19(%rdx) 0x00000000004005c0 <+96>: movb $0x31,0x1a(%rdx) 0x00000000004005c4 <+100>: movb $0x32,0x1b(%rdx) 0x00000000004005c8 <+104>: movb $0x20,0x1c(%rdx) 0x00000000004005cc <+108>: movb $0x31,0x1d(%rdx) 0x00000000004005d0 <+112>: movb $0x33,0x1e(%rdx) 0x00000000004005d4 <+116>: retq 0x4006b0 : "0 1 2 3 4 5 6 7 8 9 10 11 12 13" This is the same data used in foo_sized(). I wrote the register values at the end of each line of interest. The issues I found: 0) Not all byte stores are merged. The first 16 bytes are copied w/ xmm0, but the rest initialized byte-wise. I was told in bug 77456 that there is work in progress solving that (see [1]), but I'm a bit confused, since it seems that some merging does occur here already using xmm0, but the second 16 bytes are not merged for some reason. The last byte of the second 16 byte pack is zero, so it might be that that zero write is eliminated (no "movb 0x00, 0x1f(%rdx)" in the disasm), due to the previous zero fill, and the remaining 15 bytes written byte wise as it's not an exact fit for xmm0, but I'm just speculating here. Code generated with gcc 6.2.1 does not use xmm0, all bytes are stored with movb; the zero fill is the same, and the zero store to 0x1f is missing, too. 1) The zero fill of the buffer does not take into account that the beginning of the buffer is overwritten with the characters: - first it wipes the last 8 bytes - then aligns the buffer address to 8, and zero fills 192 bytes - I guess the zero fill is done this way so that it works efficiently even if the buffer address is not aligned to 8 - then stores the first 16 bytes w/ xmm0, and the rest w/ byte wise stores The optimal code would only zero fill the buffer starting at index 32, with the needed fiddling for the unaligned case, then the 32 bytes copied with xmm0, potentially unaligned. The platform used was Debian Jessie AMD64, the gcc versions: $ g++-6.2.1 -v Using built-in specs. COLLECT_GCC=g++-6.2.1 COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/6.2.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../configure --enable-languages=c,c++ --disable-multilib --program-suffix=-6.2.1 --disable-bootstrap CFLAGS='-O2 -march=native' CXXFLAGS='-O2 -march=native' Thread model: posix gcc version 6.2.1 20160831 (GCC) git b823cdd4ccc1499a674e3863ce875c7459207727 g++-7.0.0 -v Using built-in specs. COLLECT_GCC=g++-7.0.0 COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/7.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../configure --enable-languages=c,c++ --disable-multilib --program-suffix=-7.0.0 --disable-bootstrap CFLAGS='-O2 -march=native' CXXFLAGS='-O2 -march=native' Thread model: posix gcc version 7.0.0 20160831 (experimental) (GCC) git 14c36b15d931bf299bbc214707b903d0af124449 [1] https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01512.html