https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122118
Bug ID: 122118
Summary: Combining writes causes unnecessary runtime
relocations
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: andreas.rheinhardt at outlook dot com
Target Milestone: ---
Created attachment 62481
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62481&action=edit
small file with the same testcase as the bug report
static int a, b;
void init(int *ptrs[]) {
ptrs[0] = &a;
ptrs[1] = &b;
}
__attribute__((cold)) void init_cold(int *ptrs[]) {
ptrs[0] = &a;
ptrs[1] = &b;
}
When the above code is compiled for x64 with GCC from today
(c6865e7e15bc9a1337df00d2ca03604e1712a2dd) with -fpic -c -O3, it results in
Disassembly of section .text:
0000000000000000 <init>:
0: f3 0f 7e 05 00 00 00 movq 0x0(%rip),%xmm0 # 8 <init+0x8>
7: 00
4: R_X86_64_PC32 .data.rel.ro.local-0x4
8: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # f <init+0xf>
b: R_X86_64_PC32 .bss-0x4
f: 66 48 0f 6e c8 movq %rax,%xmm1
14: 66 0f 6c c1 punpcklqdq %xmm1,%xmm0
18: 0f 11 07 movups %xmm0,(%rdi)
1b: c3 ret
Disassembly of section .text.unlikely:
0000000000000000 <init_cold>:
0: f3 0f 7e 05 00 00 00 movq 0x0(%rip),%xmm0 # 8
<init_cold+0x8>
7: 00
4: R_X86_64_PC32 .data.rel.ro.local-0x4
8: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # f
<init_cold+0xf>
b: R_X86_64_PC32 .bss-0x4
f: 66 48 0f 6e c8 movq %rax,%xmm1
14: 66 0f 6c c1 punpcklqdq %xmm1,%xmm0
18: 0f 11 07 movups %xmm0,(%rdi)
1b: c3 ret
As can be seen, GCC combines the writes and this causes issues:
1. The most important one is that the first load can't be relaxed to lea at
all. It will put a pointer into .data.rel.ro and add a runtime relocation for
this pointer, both of which are of course completely unnecessary. (Given that a
and b are static here, creating code that needs relaxing by the linker would be
suboptimal in itself.)
2. When additionally combined with -fno-tree-vectorize, the stores are not
combined and the code for both functions looks like this:
0: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 7
<init_cold+0x7>
3: R_X86_64_PC32 .bss
7: 48 89 07 mov %rax,(%rdi)
a: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 11
<init_cold+0x11>
d: R_X86_64_PC32 .bss-0x4
11: 48 89 47 08 mov %rax,0x8(%rdi)
15: c3 ret
This is six bytes shorter than the function compiled with default vectorization
arguments. I'd expect that this alone means that this version is preferred for
the function with __attribute__((cold)), but it isn't.
3. I would be surprised if the vectorized version of this code is actually
faster than the unvectorized one: It performs an actual load which can be slow
if the cache line accessed is not already in a cpu cache; it has to combine the
pointers; the xmm store may cross a cache line which is slow on old cpus.
Using distro versions of GCC shows a change between GCC 13 and GCC 14. With GCC
13 it did this:
0: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 7
<init_cold+0x7>
3: R_X86_64_PC32 .bss-0x4
7: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx # e
<init_cold+0xe>
a: R_X86_64_PC32 .bss
e: 66 48 0f 6e c2 movq %rdx,%xmm0
13: 66 48 0f 6e c8 movq %rax,%xmm1
18: 66 0f 6c c1 punpcklqdq %xmm1,%xmm0
1c: 0f 11 07 movups %xmm0,(%rdi)
1f: c3 ret
Still using a vector store, but without runtime relocations.
PS: By the way, if GCC thinks that using an xmm store is beneficial here, then
why does it not initialize xmm1 the same way as xmm0 instead of going through
rax?