https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123524
Bug ID: 123524
Summary: 6% performance regression in gcc-16 compared to gcc-15
when compiling an interpreter
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: mikulas at artax dot karlin.mff.cuni.cz
Target Milestone: ---
Hi
GCC 16 (20251214 from Debian Sid) generates worse code when compiling the
interpreter for the Ajla programming language.
I uploaded the preprocessed source code for the file ipret.c here. Compile it
with gcc -O2.
http://www.jikos.cz/~mikulas/testcases/gcc/ipret-gcc15.e
http://www.jikos.cz/~mikulas/testcases/gcc/ipret-gcc16.e
This is a piece of interpreter code that checks tags, sums two signed 64-bit
numbers, checks for overflow, stores the result and jumps to the next
instruction.
On gcc-15 we can see that the code is almost optimal:
gcc-15:
36931: 41 0f b6 54 24 02 movzbl 0x2(%r12),%edx <--- load
variable offsets from the bytecode
36937: 41 0f b6 4c 24 03 movzbl 0x3(%r12),%ecx
3693d: 41 0f b6 74 24 04 movzbl 0x4(%r12),%esi
36943: 0f b6 3c 0b movzbl (%rbx,%rcx,1),%edi <--- check
tags
36947: 40 0a 3c 13 or (%rbx,%rdx,1),%dil
3694b: 75 22 jne 3696f <u_run+0x31d7f> <---
escape if at least one argument is tagged
3694d: 48 8b 14 d3 mov (%rbx,%rdx,8),%rdx <--- load
the first argument
36951: 48 03 14 cb add (%rbx,%rcx,8),%rdx <--- add
the second argument
36955: 70 18 jo 3696f <u_run+0x31d7f> <---
escape on overflow
36957: 48 89 14 f3 mov %rdx,(%rbx,%rsi,8) <--- store
the result
3695b: 41 0f b7 54 24 06 movzwl 0x6(%r12),%edx <--- load
the next instruction opcode
36961: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax <--- load
the base of the jump table
36968: 49 83 c4 06 add $0x6,%r12 <---
increase the opcode pointer
3696c: ff 24 d0 jmp *(%rax,%rdx,8) <--- jump
to the next instruction
The equivalent code generated by gcc-16 is this:
gcc-16:
19cca: 41 0f b6 55 02 movzbl 0x2(%r13),%edx
19ccf: 41 0f b6 4d 03 movzbl 0x3(%r13),%ecx
19cd4: 41 0f b6 7d 04 movzbl 0x4(%r13),%edi
19cd9: 0f b6 34 13 movzbl (%rbx,%rdx,1),%esi
19cdd: 40 0a 34 0b or (%rbx,%rcx,1),%sil
19ce1: 0f 85 c7 81 02 00 jne 41eae <u_run+0x3d26e>
19ce7: 48 8d 34 d3 lea (%rbx,%rdx,8),%rsi
19ceb: 48 8d 14 cb lea (%rbx,%rcx,8),%rdx
19cef: 40 0f b6 cf movzbl %dil,%ecx
19cf3: 48 8d 0c cb lea (%rbx,%rcx,8),%rcx
19cf7: 48 8b 06 mov (%rsi),%rax
19cfa: 48 03 02 add (%rdx),%rax
19cfd: 0f 80 ab 81 02 00 jo 41eae <u_run+0x3d26e>
19d03: 48 89 01 mov %rax,(%rcx)
19d06: 41 0f b7 55 06 movzwl 0x6(%r13),%edx
19d0b: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 19d12
<u_run+0x150d2>
19d12: 49 83 c5 06 add $0x6,%r13
19d16: 48 8b 04 d0 mov (%rax,%rdx,8),%rax
19d1a: e9 61 af fe ff jmp 4c80 <u_run+0x40>
4c80: ba 02 00 02 00 mov $0x20002,%edx
4c85: 66 0f 6e e2 movd %edx,%xmm4
4c89: 66 0f 70 fc 00 pshufd $0x0,%xmm4,%xmm7
4c8e: 0f 29 3c 24 movaps %xmm7,(%rsp)
4c92: ff e0 jmp *%rax
We can see that gcc 16 doesn't use the scaled addressing modes when accessing
the variables and there is nonsensical code at address 4c80 that stores a
pattern to the stack frame (I don't know where does this come from, the source
code doesn't contain any attempt to store the constant 0x20002 at that point).
Note that if I use the flag -fno-tree-vectorize, the code that stores 0x20002
to the stack frame is not generated (but gcc still doesn't use the scaled
addressing modes).
Due to these regressions, the code generated by gcc-16 is bigger and slower:
"objdump -d ipret.o |wc -l"
gcc-15: 69695 lines
gcc-16: 75414 lines
Benchmark:
1. Download Ajla from https://www.ajla-lang.cz/
2. Compile it with CC='gcc-15 -DDEBUG_ENV -O2' and with CC='gcc-16 -DDEBUG_ENV
-O2'
(the DEBUG_ENV macro makes it respond to debugging environment variables)
3. Run time CG=none ./scripts/update.sh
(this compiles the language itself, I use it as a benchmark)
(the CG=none variable disables the code generator, so that it uses only the
interpreter)
The results:
Core i7-2640M:
gcc-15: 51 seconds
gcc-16: 54 seconds
Ryzen 7 PRO 7840U:
gcc-15: 9.2 seconds
gcc-16: 9.9 seconds