https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322
Bug ID: 108322 Summary: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: gerbilsoft at gerbilsoft dot com Target Milestone: --- While examining some code using the bloaty tool, I found that a function for deinterleaving Super Magic Drive ROM images was taking up ~5 KB when it should have been less than 1 KB. On examining the disassembly, there appeared to be a lot of unnecessary instructions; compiling with clang and MSVC resulted in significantly fewer instructions. Either removing __restrict from the function parameters (two pointers), or specifying -fno-tree-vectorize to disable auto-vectorization, fixes this issue with gcc-12. The generated code isn't buggy as far as I can tell, and it benchmarks around the same as the non-bloated version. I've narrowed it down to the following minimal test case: #include <stdint.h> #define SMD_BLOCK_SIZE 16384 void decodeBlock_cpp(uint8_t *__restrict pDest, const uint8_t *__restrict pSrc) { // First 8 KB of the source block is ODD bytes. const uint8_t *pSrc_end = pSrc + (SMD_BLOCK_SIZE / 2); for (uint8_t *pDest_odd = pDest + 1; pSrc < pSrc_end; pDest_odd += 2, pSrc += 1) { pDest_odd[0] = pSrc[0]; } } Assembly output with `g++ -O2 -fno-tree-vectorize` (or removing the __restrict qualifiers): decodeBlock_cpp(unsigned char*, unsigned char const*): xor eax, eax .L2: movzx edx, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+1+rax*2], dl add rax, 1 cmp rax, 8192 jne .L2 ret Assembly output with `g++ -O2` (implying -ftree-vectorize with gcc-12) and __restrict qualifiers: decodeBlock_cpp(unsigned char*, unsigned char const*): push r15 lea rax, [rsi+8192] add rdi, 1 push r14 push r13 push r12 push rbp push rbx mov QWORD PTR [rsp-8], rax .L2: movzx ecx, BYTE PTR [rsi+10] movzx eax, BYTE PTR [rsi+14] add rsi, 16 add rdi, 32 movzx edx, BYTE PTR [rsi-3] movzx r15d, BYTE PTR [rsi-1] movzx r11d, BYTE PTR [rsi-10] movzx ebx, BYTE PTR [rsi-11] mov BYTE PTR [rsp-11], cl movzx ecx, BYTE PTR [rsi-16] movzx ebp, BYTE PTR [rsi-12] mov BYTE PTR [rsp-9], al movzx r12d, BYTE PTR [rsi-13] movzx eax, BYTE PTR [rsi-4] mov BYTE PTR [rsp-10], dl movzx r13d, BYTE PTR [rsi-14] movzx edx, BYTE PTR [rsi-5] movzx r14d, BYTE PTR [rsi-15] movzx r8d, BYTE PTR [rsi-7] movzx r9d, BYTE PTR [rsi-8] movzx r10d, BYTE PTR [rsi-9] mov BYTE PTR [rdi-32], cl movzx ecx, BYTE PTR [rsp-11] mov BYTE PTR [rdi-10], dl mov BYTE PTR [rdi-30], r14b mov BYTE PTR [rdi-28], r13b mov BYTE PTR [rdi-26], r12b mov BYTE PTR [rdi-24], bpl mov BYTE PTR [rdi-22], bl mov BYTE PTR [rdi-20], r11b mov BYTE PTR [rdi-18], r10b mov BYTE PTR [rdi-16], r9b mov BYTE PTR [rdi-14], r8b mov BYTE PTR [rdi-12], cl mov BYTE PTR [rdi-8], al movzx eax, BYTE PTR [rsp-9] movzx edx, BYTE PTR [rsp-10] mov BYTE PTR [rdi-2], r15b mov BYTE PTR [rdi-4], al mov rax, QWORD PTR [rsp-8] mov BYTE PTR [rdi-6], dl cmp rsi, rax jne .L2 pop rbx pop rbp pop r12 pop r13 pop r14 pop r15 ret $ gcc --version gcc (Gentoo Hardened 12.2.1_p20221008 p1) 12.2.1 20221008 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.