https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

            Bug ID: 108322
           Summary: Using __register parameter with -ftree-vectorize
                    (default with -O2) results in massive code bloat
           Product: gcc
           Version: 12.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gerbilsoft at gerbilsoft dot com
  Target Milestone: ---

While examining some code using the bloaty tool, I found that a function for
deinterleaving Super Magic Drive ROM images was taking up ~5 KB when it should
have been less than 1 KB. On examining the disassembly, there appeared to be a
lot of unnecessary instructions; compiling with clang and MSVC resulted in
significantly fewer instructions. Either removing __restrict from the function
parameters (two pointers), or specifying -fno-tree-vectorize to disable
auto-vectorization, fixes this issue with gcc-12.

The generated code isn't buggy as far as I can tell, and it benchmarks around
the same as the non-bloated version.

I've narrowed it down to the following minimal test case:

#include <stdint.h>
#define SMD_BLOCK_SIZE 16384

void decodeBlock_cpp(uint8_t *__restrict pDest, const uint8_t *__restrict pSrc)
{
        // First 8 KB of the source block is ODD bytes.
        const uint8_t *pSrc_end = pSrc + (SMD_BLOCK_SIZE / 2);
        for (uint8_t *pDest_odd = pDest + 1; pSrc < pSrc_end; pDest_odd += 2,
pSrc += 1) {
                pDest_odd[0] = pSrc[0];
        }
}

Assembly output with `g++ -O2 -fno-tree-vectorize` (or removing the __restrict
qualifiers):

decodeBlock_cpp(unsigned char*, unsigned char const*):
        xor     eax, eax
.L2:
        movzx   edx, BYTE PTR [rsi+rax]
        mov     BYTE PTR [rdi+1+rax*2], dl
        add     rax, 1
        cmp     rax, 8192
        jne     .L2
        ret

Assembly output with `g++ -O2` (implying -ftree-vectorize with gcc-12) and
__restrict qualifiers:

decodeBlock_cpp(unsigned char*, unsigned char const*):
        push    r15
        lea     rax, [rsi+8192]
        add     rdi, 1
        push    r14
        push    r13
        push    r12
        push    rbp
        push    rbx
        mov     QWORD PTR [rsp-8], rax
.L2:
        movzx   ecx, BYTE PTR [rsi+10]
        movzx   eax, BYTE PTR [rsi+14]
        add     rsi, 16
        add     rdi, 32
        movzx   edx, BYTE PTR [rsi-3]
        movzx   r15d, BYTE PTR [rsi-1]
        movzx   r11d, BYTE PTR [rsi-10]
        movzx   ebx, BYTE PTR [rsi-11]
        mov     BYTE PTR [rsp-11], cl
        movzx   ecx, BYTE PTR [rsi-16]
        movzx   ebp, BYTE PTR [rsi-12]
        mov     BYTE PTR [rsp-9], al
        movzx   r12d, BYTE PTR [rsi-13]
        movzx   eax, BYTE PTR [rsi-4]
        mov     BYTE PTR [rsp-10], dl
        movzx   r13d, BYTE PTR [rsi-14]
        movzx   edx, BYTE PTR [rsi-5]
        movzx   r14d, BYTE PTR [rsi-15]
        movzx   r8d, BYTE PTR [rsi-7]
        movzx   r9d, BYTE PTR [rsi-8]
        movzx   r10d, BYTE PTR [rsi-9]
        mov     BYTE PTR [rdi-32], cl
        movzx   ecx, BYTE PTR [rsp-11]
        mov     BYTE PTR [rdi-10], dl
        mov     BYTE PTR [rdi-30], r14b
        mov     BYTE PTR [rdi-28], r13b
        mov     BYTE PTR [rdi-26], r12b
        mov     BYTE PTR [rdi-24], bpl
        mov     BYTE PTR [rdi-22], bl
        mov     BYTE PTR [rdi-20], r11b
        mov     BYTE PTR [rdi-18], r10b
        mov     BYTE PTR [rdi-16], r9b
        mov     BYTE PTR [rdi-14], r8b
        mov     BYTE PTR [rdi-12], cl
        mov     BYTE PTR [rdi-8], al
        movzx   eax, BYTE PTR [rsp-9]
        movzx   edx, BYTE PTR [rsp-10]
        mov     BYTE PTR [rdi-2], r15b
        mov     BYTE PTR [rdi-4], al
        mov     rax, QWORD PTR [rsp-8]
        mov     BYTE PTR [rdi-6], dl
        cmp     rsi, rax
        jne     .L2
        pop     rbx
        pop     rbp
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret

$ gcc --version
gcc (Gentoo Hardened 12.2.1_p20221008 p1) 12.2.1 20221008
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Reply via email to