http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57244
Bug ID: 57244 Summary: Missed optimization: dead register move before noreturn fn call & unnecessary store/load or reg Product: gcc Version: 4.8.1 Status: UNCONFIRMED Severity: minor Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: petschy at gmail dot com These are two separate issues, however, both occured in the same function, so I think it's simpler to report them together. The reduced test case is the reader equivalent of the writer code I posted earlier today in #57236. The workings are very similar: using a helper class to amortize the number of buffer refills. The compiler unrolled the loop, with five iterations. The read pointer is kept in a register (rbx), not incremented, but used with increasing offsets. The potential end pointer is kept in r12, updated in each iteration to the value value corresponding to the bytes read (ebx + N), and stored back to the memory at the end. However, I noticed a third issue now when looking through the code: 0) Just before the exit, the store of the end pointer looks like this: 0x000000000040090a <+58>: sub %ebx,%r12d 0x000000000040090d <+61>: add %r12,%rbx 0x0000000000400910 <+64>: mov %rbx,0x10(%rbp) ebx is the initial read pointer, r12 is the new pointer after N bytes were read. r12d-ebx = N, rbx + N = new end = r12d. The sub and the add is unnecessary, a single mov %r12d,0x10(%rbp) would do the very same. +61 and +64 are not branch targets, so I think the code could be optimized more. 1) if there are no bytes at all in the buffer after the refill, we should throw an exception: 0x00000000004008f1 <+33>: cmp %rcx,%rbx 0x00000000004008f4 <+36>: jae 0x4009a1 <_Z5read2R6Stream+209> ... 0x00000000004009a1 <+209>: mov %rbx,%r12 0x00000000004009a4 <+212>: callq 0x400830 <_Z9throw_eofv> The mov at +209 is unnecessary. r12 holds the new end pointer (which is the same as the start, ebx, since no bytes were read), but it is only useful if the code ever reaches +58 (see above), where it gets stored back to memory. But that won't happen, since throw_eof() throws an exception and doesn't ever return. The other branches that throw jump to +212, so no dead move there. 2) the last iteration of the unrolled loop misses a check at the end, this changes the register assignments and introduces an unnecessary extra store/load to/from rsp. 4th iteration: 0x000000000040096a <+154>: movzbl 0x3(%rbx),%edx 0x000000000040096e <+158>: shl $0x7,%eax 0x0000000000400971 <+161>: lea 0x4(%rbx),%r12 0x0000000000400975 <+165>: mov %edx,%esi 0x0000000000400977 <+167>: and $0x7f,%esi 0x000000000040097a <+170>: or %esi,%eax 0x000000000040097c <+172>: test %dl,%dl 0x000000000040097e <+174>: js 0x40090a <_Z5read2R6Stream+58> 5th iteration: 0x0000000000400985 <+181>: movzbl 0x4(%rbx),%esi 0x0000000000400989 <+185>: shl $0x7,%eax 0x000000000040098c <+188>: lea 0x5(%rbx),%r12 0x0000000000400990 <+192>: mov %sil,(%rsp) 0x0000000000400994 <+196>: mov (%rsp),%edx 0x0000000000400997 <+199>: and $0x7f,%edx 0x000000000040099a <+202>: or %edx,%eax 0x000000000040099c <+204>: jmpq 0x40090a <_Z5read2R6Stream+58> This is a regression probably, as 4.7 generates code w/o the store/load, and the byte is at the beginning loaded into %edx, not %esi, just like in the earlier iterations. 4.8.1, 4.9.0 generates the above suboptimal code. The tested gcc versions and flags are the very same as in #57236. Regards, Peter