http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50246
Bug #: 50246 Summary: SRA: Writes to class members are not combined Classification: Unclassified Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: jus...@fathomdb.com Created attachment 25147 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25147 test program Modeling this bug report on the very similar Bug 36318; this issue is still around though (at least with "gcc version 4.6.1 (Debian 4.6.1-4)", uname -r "3.0.0-1-amd64") In this test file: class P { public: char s1; char s2; P(char i) : s1(i), s2(i) {} }; void f(char j, P* p) { *p = P(j); } The constructor's writes to the two fields of P should be combined, resulting in a single load instead of two loads. When run with -O3 (or -Os) two loads are produced: cc -c -O3 -g test.cc -o test.o objdump -dS test.o ... 0000000000000000 <_Z1fcP1P>: char s1; char s2; P(char i) : s1(i), s2(i) {} }; void f(char j, P* p) { *p = P(j); 0: 40 88 3e mov %dil,(%rsi) 3: 40 88 7e 01 mov %dil,0x1(%rsi) } 7: c3 retq -fno-tree-sra fixes the issue: cc -c -O3 -fno-tree-sra -g test.cc -o test.o objdump -dS test.o ... 0000000000000000 <_Z1fcP1P>: class P { public: char s1; char s2; P(char i) : s1(i), s2(i) {} 0: 31 c0 xor %eax,%eax 2: 48 89 fa mov %rdi,%rdx 5: 40 88 f8 mov %dil,%al 8: 88 d4 mov %dl,%ah }; void f(char j, P* p) { *p = P(j); a: 66 89 06 mov %ax,(%rsi) } d: c3 retq I'm not sure that this test case is actually valid, in that it's not necessarily obvious that the single load is better (given the code is bigger). This is my attempt at a highly reduced test-case from a much more severe real-world problem I encountered: class P { uint16_t a; uint8_t b; }, calling std::vector::push_back results in a 16 bit write, an 8 bit write, and then a 32 bit read on the same address, which results in a serious performance hotspot, I believe because the CPU can't figure out the memory dependencies. Doing manual bit-packing into a uint32_t (with the same memory layout) dramatically improves the performance there. If this test isn't valid, let me know why not and I'll try to find a reduction that is valid!