http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57244

            Bug ID: 57244
           Summary: Missed optimization: dead register move before
                    noreturn fn call & unnecessary store/load or reg
           Product: gcc
           Version: 4.8.1
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: petschy at gmail dot com

These are two separate issues, however, both occured in the same function, so I
think it's simpler to report them together.

The reduced test case is the reader equivalent of the writer code I posted
earlier today in #57236. The workings are very similar: using a helper class to
amortize the number of buffer refills.

The compiler unrolled the loop, with five iterations. The read pointer is kept
in a register (rbx), not incremented, but used with increasing offsets. The
potential end pointer is kept in r12, updated in each iteration to the value
value corresponding to the bytes read (ebx + N), and stored back to the memory
at the end. However, I noticed a third issue now when looking through the code:

0) Just before the exit, the store of the end pointer looks like this:

   0x000000000040090a <+58>:    sub    %ebx,%r12d
   0x000000000040090d <+61>:    add    %r12,%rbx
   0x0000000000400910 <+64>:    mov    %rbx,0x10(%rbp)

ebx is the initial read pointer, r12 is the new pointer after N bytes were
read. r12d-ebx = N, rbx + N = new end = r12d. The sub and the add is
unnecessary, a single mov %r12d,0x10(%rbp) would do the very same.

+61 and +64 are not branch targets, so I think the code could be optimized
more.

1) if there are no bytes at all in the buffer after the refill, we should throw
an exception:

   0x00000000004008f1 <+33>:    cmp    %rcx,%rbx
   0x00000000004008f4 <+36>:    jae    0x4009a1 <_Z5read2R6Stream+209>
...
   0x00000000004009a1 <+209>:    mov    %rbx,%r12
   0x00000000004009a4 <+212>:    callq  0x400830 <_Z9throw_eofv>

The mov at +209 is unnecessary. r12 holds the new end pointer (which is the
same as the start, ebx, since no bytes were read), but it is only useful if the
code ever reaches +58 (see above), where it gets stored back to memory. But
that won't happen, since throw_eof() throws an exception and doesn't ever
return.

The other branches that throw jump to +212, so no dead move there.

2) the last iteration of the unrolled loop misses a check at the end, this
changes the register assignments and introduces an unnecessary extra store/load
to/from rsp.

4th iteration:
   0x000000000040096a <+154>:    movzbl 0x3(%rbx),%edx
   0x000000000040096e <+158>:    shl    $0x7,%eax
   0x0000000000400971 <+161>:    lea    0x4(%rbx),%r12
   0x0000000000400975 <+165>:    mov    %edx,%esi
   0x0000000000400977 <+167>:    and    $0x7f,%esi
   0x000000000040097a <+170>:    or     %esi,%eax
   0x000000000040097c <+172>:    test   %dl,%dl
   0x000000000040097e <+174>:    js     0x40090a <_Z5read2R6Stream+58>
5th iteration:
   0x0000000000400985 <+181>:    movzbl 0x4(%rbx),%esi
   0x0000000000400989 <+185>:    shl    $0x7,%eax
   0x000000000040098c <+188>:    lea    0x5(%rbx),%r12
   0x0000000000400990 <+192>:    mov    %sil,(%rsp)
   0x0000000000400994 <+196>:    mov    (%rsp),%edx
   0x0000000000400997 <+199>:    and    $0x7f,%edx
   0x000000000040099a <+202>:    or     %edx,%eax
   0x000000000040099c <+204>:    jmpq   0x40090a <_Z5read2R6Stream+58>

This is a regression probably, as 4.7 generates code w/o the store/load, and
the byte is at the beginning loaded into %edx, not %esi, just like in the
earlier iterations. 4.8.1, 4.9.0 generates the above suboptimal code.

The tested gcc versions and flags are the very same as in #57236.

Regards, Peter

Reply via email to