Due to differences between the 64-bit Microsoft and System V ABIs, any
msabi function that calls a sysv function must consider RSI, RDI and
XMM6-15 as clobbered. The result is that such functions are bloated with
SSE saves/restores costing as much as 106 bytes each (up to 200-ish
bytes per function). This patch set targets 64-bit Wine and aims to
mitigate some of those costs.
A few save & restore stubs are added to the static portion of libgcc and
the pro/epilogues of such functions uses these stubs instead, thus
reducing .text size. While we're already tinkering with stubs, it also
manages the save/restore of up to 6 additional registers. Analysis of
building Wine 64 demonstrates a reduction of .text by around 20%. While
I haven't produce performance data yet, this is my first attempt to
modify gcc so I would rather ask for comments earlier in this process.
The basic theory is that a reduction of I-cache misses will offset the
extra instructions required for implementation. In addition, since there
are only a handful of stubs that will be in memory, I'm using the larger
mov instructions instead of push/pop to facilitate better parallelization.
Here is a sample of what these prologues/epilogues look like:
Prologue (in this case, SP adjustment was properly combined with later
stack allocation):
7b833800: 48 8d 44 24 88 lea -0x78(%rsp),%rax
7b833805: 48 81 ec 58 01 00 00 sub $0x158,%rsp
7b83380c: e8 95 6f 05 00 callq 7b88a7a6 <__savms64_17>
Epilogue (r10 stores the value to restore the stack pointer to):
7b83386c: 48 8d b4 24 e0 00 00 lea 0xe0(%rsp),%rsi
7b833873: 00
7b833874: 4c 8d 56 78 lea 0x78(%rsi),%r10
7b833878: e9 c9 6f 05 00 jmpq 7b88a846 <__resms64x_17>
Prologue, stack realignment case (this shows the uncombined SP
modifications, described below):
7b833800: 55 push %rbp
7b833801: 48 8d 44 24 90 lea -0x70(%rsp),%rax
7b833806: 48 89 e5 mov %rsp,%rbp
7b833809: 48 83 e0 f0 and $0xfffffffffffffff0,%rax
7b83380d: 48 8d 60 90 lea -0x70(%rax),%rsp
7b833811: e8 cc 79 05 00 callq 7b88b1e2 <__savms64r_17>
7b833816: 48 89 cb mov %rcx,%rbx# reordered
insn from body
7b833819: 48 83 ec 70 sub $0x70,%rsp
Epilogue, stack realignment case:
7b833875: 48 8d b4 24 e0 00 00 lea 0xe0(%rsp),%rsi
7b83387c: 00
7b83387d: e9 ac 79 05 00 jmpq 7b88b22e <__resms64rx_17>
Questions and (known) outstanding issues:
1. I have added the new -f optimization to common.opt, but being that
it only impacts x86_64, should this be a machine-specific -m option
instead?
2. In the prologues that realign the stack, stack pointer modifications
aren't combining, presumably since I'm using a lea after realigning
using rax.
3. My x86 assembly expertise is limited, so I would appreciate any
feedback on my stubs & emitted code.
4. Documentation is still missing.
5. A Changelog entry is still missing.
6. This is my first major work on a GNU project and I have not yet
fully reviewed all of the relevant GNU coding conventions, so I
might still have some non-compliance code.
7. Regression tests only run on my old Phenom. Have not yet tested on
AVX cpu (which should use vmovaps instead of movaps).
8. My test program is inadequate (and is not included in this patch
set). During development it failed to produce many optimization
errors that I got when building Wine. I've been building 64-bit
Wine and running Wine's tests in the mean time.
9. I need to devise a meaningful benchmarking strategy.
10. I have not yet examined how this may or may not affect -flto or
where additional optimization opportunities in the lto driver may exist.
11. There are a few more optimization opportunities that I haven't
attempted to exploit yet and prefer to leave for later projects.
* In the case of stack realignment and all 17 registers being
clobbered, I can combine the majority of the prologue
(alignment, saving frame pointer, etc.) in the stub.
* With these stubs being in the static portion of libgcc, each
Wine "dll" gets a separate copy. The average number of dlls a
Windows program loads seems to be at least 15, allowing a
mechanism for them to be linked dynamically from libwine.so
could save a little bit more .text and icache.
* Ultimately, good static analysis of local sysv functions can
completely eliminate the need to save SSE registers in some cases.
12. Use of hard frame pointers disables the optimization unless we're
also realigning the stack. I've implemented this in another (local)
branch, but haven't tested it yet.
gcc/common.opt | 7 +
gcc/config/i386/i386.c | 729
++++++++++++++++++++++++++++++++++++++---
gcc/config/i386/i386.h | 22 +-
gcc/config/i386/predicates.md | 148 +++++++++
gcc/config/i386/sse.md | 56 ++++
libgcc/config.host | 2 +-
libgcc/config/i386/i386-asm.h | 82 +++++
libgcc/config/i386/resms64.S | 63 ++++
libgcc/config/i386/resms64f.S | 59 ++++
libgcc/config/i386/resms64fx.S | 61 ++++
libgcc/config/i386/resms64x.S | 65 ++++
libgcc/config/i386/savms64.S | 63 ++++
libgcc/config/i386/savms64f.S | 64 ++++
libgcc/config/i386/t-msabi | 7 +
14 files changed, 1379 insertions(+), 49 deletions(-)
Feedback and comments would be most appreciated!
Thanks,
Daniel