http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50246

             Bug #: 50246
           Summary: SRA: Writes to class members are not combined
    Classification: Unclassified
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: jus...@fathomdb.com


Created attachment 25147
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25147
test program

Modeling this bug report on the very similar Bug 36318; this issue is still
around though (at least with "gcc version 4.6.1 (Debian 4.6.1-4)", uname -r
"3.0.0-1-amd64")

In this test file:

class P {
    public:
    char s1; char s2; 
    P(char i) : s1(i), s2(i) {}
};

void f(char j, P* p) {
    *p = P(j);
}

The constructor's writes to the two fields of P should be combined, resulting
in a single load instead of two loads.

When run with -O3 (or -Os) two loads are produced:
cc -c -O3 -g test.cc -o test.o
objdump -dS test.o

...
0000000000000000 <_Z1fcP1P>:
    char s1; char s2; 
    P(char i) : s1(i), s2(i) {}
};

void f(char j, P* p) {
    *p = P(j);
   0:   40 88 3e                mov    %dil,(%rsi)
   3:   40 88 7e 01             mov    %dil,0x1(%rsi)
}
   7:   c3                      retq   


-fno-tree-sra fixes the issue:
cc -c -O3 -fno-tree-sra -g test.cc -o test.o
objdump -dS test.o

...
0000000000000000 <_Z1fcP1P>:
class P {
    public:
    char s1; char s2; 
    P(char i) : s1(i), s2(i) {}
   0:   31 c0                   xor    %eax,%eax
   2:   48 89 fa                mov    %rdi,%rdx
   5:   40 88 f8                mov    %dil,%al
   8:   88 d4                   mov    %dl,%ah
};

void f(char j, P* p) {
    *p = P(j);
   a:   66 89 06                mov    %ax,(%rsi)
}
   d:   c3                      retq   

I'm not sure that this test case is actually valid, in that it's not
necessarily obvious that the single load is better (given the code is bigger). 
This is my attempt at a highly reduced test-case from a much more severe
real-world problem I encountered: class P { uint16_t a; uint8_t b; }, calling
std::vector::push_back results in a 16 bit write, an 8 bit write, and then a 32
bit read on the same address, which results in a serious performance hotspot, I
believe because the CPU can't figure out the memory dependencies.  Doing manual
bit-packing into a uint32_t (with the same memory layout) dramatically improves
the performance there.  If this test isn't valid, let me know why not and I'll
try to find a reduction that is valid!

Reply via email to