https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122529

            Bug ID: 122529
           Summary: Optimizing for size --- unnecessary x86 instructions
           Product: gcc
           Version: 15.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: zero at smallinteger dot com
  Target Milestone: ---

Created attachment 62688
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62688&action=edit
Sample code

Consider the attached code, compiled with -Oz.  Per Godbolt, the output for GCC
15.2 is as follows.

test:
        xor     eax, eax
        mov     ecx, 1024
        mov     rdx, rdi
        rep stosd
        xor     eax, eax
.L2:
        mov     DWORD PTR [rdx+rax*4], eax
        inc     rax
        cmp     rax, 160
        jne     .L2
        ret

Observe the second xor eax, eax is unnecessary because rax is still zero after
rep stosd.  Removing the for loop eliminates the second xor eax, eax.  It seems
as if GCC is assuming rax is trashed by rep stosd.

Moreover, per Godbolt the output for GCC trunk is as follows.

"test":
        xor     eax, eax
        mov     ecx, 1023
        mov     rdx, rdi
        and     DWORD PTR [rdi+4092], 0
        rep stosd
        xor     eax, eax
.L2:
        mov     DWORD PTR [rdx+rax*4], eax
        inc     rax
        cmp     rax, 160
        jne     .L2
        ret

Observe that in this case one of the writes in rep stosd has been peeled off
for some reason, resulting in even larger code.  Again with GCC trunk, -Os adds
even more instructions.

"test":
        xor     eax, eax
        mov     ecx, 1023
        mov     rdx, rdi
        mov     DWORD PTR [rdi+4092], eax
        xor     eax, eax
        rep stosd
        xor     eax, eax
.L2:
        mov     DWORD PTR [rdx+rax*4], eax
        inc     rax
        cmp     rax, 160
        jne     .L2
        ret

Now there are three cases of xor eax, eax.

I could not eliminate the additional unnecessary instructions by enabling
specific optimizations (since -Oz enables most but not all of -O2).

For comparison, per Godbolt clang trunk does this.

test:
        mov     rdx, rdi
        xor     esi, esi
        mov     ecx, 1024
        xor     eax, eax
        rep stosd es:[rdi], eax
.LBB0_1:
        cmp     rsi, 160
        je      .LBB0_3
        mov     dword ptr [rdx + 4*rsi], esi
        inc     rsi
        jmp     .LBB0_1
.LBB0_3:
        ret

Observe that now there are unnecessary instructions due to not reusing eax
after rep stosd.

Reply via email to