Due to differences between the 64-bit Microsoft and System V ABIs, any msabi function that calls a sysv function must consider RSI, RDI and XMM6-15 as clobbered. The result is that such functions are bloated with SSE saves/restores costing as much as 106 bytes each (up to 200-ish bytes per function). This patch set targets 64-bit Wine and aims to mitigate some of those costs.

A few save & restore stubs are added to the static portion of libgcc and the pro/epilogues of such functions uses these stubs instead, thus reducing .text size. While we're already tinkering with stubs, it also manages the save/restore of up to 6 additional registers. Analysis of building Wine 64 demonstrates a reduction of .text by around 20%. While I haven't produce performance data yet, this is my first attempt to modify gcc so I would rather ask for comments earlier in this process.

The basic theory is that a reduction of I-cache misses will offset the extra instructions required for implementation. In addition, since there are only a handful of stubs that will be in memory, I'm using the larger mov instructions instead of push/pop to facilitate better parallelization.

Here is a sample of what these prologues/epilogues look like:

Prologue (in this case, SP adjustment was properly combined with later stack allocation):
    7b833800:   48 8d 44 24 88          lea -0x78(%rsp),%rax
    7b833805:   48 81 ec 58 01 00 00    sub    $0x158,%rsp
    7b83380c:   e8 95 6f 05 00          callq  7b88a7a6 <__savms64_17>

Epilogue (r10 stores the value to restore the stack pointer to):
    7b83386c:   48 8d b4 24 e0 00 00    lea 0xe0(%rsp),%rsi
    7b833873:   00
    7b833874:   4c 8d 56 78             lea 0x78(%rsi),%r10
    7b833878:   e9 c9 6f 05 00          jmpq   7b88a846 <__resms64x_17>

Prologue, stack realignment case (this shows the uncombined SP modifications, described below):
    7b833800:   55                      push   %rbp
    7b833801:   48 8d 44 24 90          lea -0x70(%rsp),%rax
    7b833806:   48 89 e5                mov    %rsp,%rbp
    7b833809:   48 83 e0 f0             and $0xfffffffffffffff0,%rax
    7b83380d:   48 8d 60 90             lea -0x70(%rax),%rsp
    7b833811:   e8 cc 79 05 00          callq  7b88b1e2 <__savms64r_17>
7b833816: 48 89 cb mov %rcx,%rbx# reordered insn from body
    7b833819:   48 83 ec 70             sub    $0x70,%rsp

Epilogue, stack realignment case:
    7b833875:   48 8d b4 24 e0 00 00    lea 0xe0(%rsp),%rsi
    7b83387c:   00
   7b83387d:   e9 ac 79 05 00 jmpq   7b88b22e <__resms64rx_17>


Questions and (known) outstanding issues:

1. I have added the new -f optimization to common.opt, but being that
   it only impacts x86_64, should this be a machine-specific -m option
   instead?
2. In the prologues that realign the stack, stack pointer modifications
   aren't combining, presumably since I'm using a lea after realigning
   using rax.
3. My x86 assembly expertise is limited, so I would appreciate any
   feedback on my stubs & emitted code.
4. Documentation is still missing.
5. A Changelog entry is still missing.
6. This is my first major work on a GNU project and I have not yet
   fully reviewed all of the relevant GNU coding conventions, so I
   might still have some non-compliance code.
7. Regression tests only run on my old Phenom. Have not yet tested on
   AVX cpu (which should use vmovaps instead of movaps).
8. My test program is inadequate (and is not included in this patch
   set).  During development it failed to produce many optimization
   errors that I got when building Wine.  I've been building 64-bit
   Wine and running Wine's tests in the mean time.
9. I need to devise a meaningful benchmarking strategy.
10. I have not yet examined how this may or may not affect -flto or
   where additional optimization opportunities in the lto driver may exist.
11. There are a few more optimization opportunities that I haven't
   attempted to exploit yet and prefer to leave for later projects.
     * In the case of stack realignment and all 17 registers being
       clobbered, I can combine the majority of the prologue
       (alignment, saving frame pointer, etc.) in the stub.
     * With these stubs being in the static portion of libgcc, each
       Wine "dll" gets a separate copy. The average number of dlls a
       Windows program loads seems to be at least 15, allowing a
       mechanism for them to be linked dynamically from libwine.so
       could save a little bit more .text and icache.
     * Ultimately, good static analysis of local sysv functions can
       completely eliminate the need to save SSE registers in some cases.
12. Use of hard frame pointers disables the optimization unless we're
   also realigning the stack. I've implemented this in another (local)
   branch, but haven't tested it yet.


gcc/common.opt                 |   7 +
gcc/config/i386/i386.c | 729 ++++++++++++++++++++++++++++++++++++++---
 gcc/config/i386/i386.h         |  22 +-
 gcc/config/i386/predicates.md  | 148 +++++++++
 gcc/config/i386/sse.md         |  56 ++++
 libgcc/config.host             |   2 +-
 libgcc/config/i386/i386-asm.h  |  82 +++++
 libgcc/config/i386/resms64.S   |  63 ++++
 libgcc/config/i386/resms64f.S  |  59 ++++
 libgcc/config/i386/resms64fx.S |  61 ++++
 libgcc/config/i386/resms64x.S  |  65 ++++
 libgcc/config/i386/savms64.S   |  63 ++++
 libgcc/config/i386/savms64f.S  |  64 ++++
 libgcc/config/i386/t-msabi     |   7 +
 14 files changed, 1379 insertions(+), 49 deletions(-)

Feedback and comments would be most appreciated!

Thanks,
Daniel

Reply via email to