[PATCH 0/9] RFC: Add optimization -foutline-msabi-xlougues (for Wine 64)

Daniel Santos Tue, 15 Nov 2016 12:01:08 -0800

Due to differences between the 64-bit Microsoft and System V ABIs, anymsabi function that calls a sysv function must consider RSI, RDI andXMM6-15 as clobbered. The result is that such functions are bloated withSSE saves/restores costing as much as 106 bytes each (up to 200-ishbytes per function). This patch set targets 64-bit Wine and aims tomitigate some of those costs.

A few save & restore stubs are added to the static portion of libgcc andthe pro/epilogues of such functions uses these stubs instead, thusreducing .text size. While we're already tinkering with stubs, it alsomanages the save/restore of up to 6 additional registers. Analysis ofbuilding Wine 64 demonstrates a reduction of .text by around 20%. WhileI haven't produce performance data yet, this is my first attempt tomodify gcc so I would rather ask for comments earlier in this process.

The basic theory is that a reduction of I-cache misses will offset theextra instructions required for implementation. In addition, since thereare only a handful of stubs that will be in memory, I'm using the largermov instructions instead of push/pop to facilitate better parallelization.


Here is a sample of what these prologues/epilogues look like:

Prologue (in this case, SP adjustment was properly combined with laterstack allocation):

    7b833800:   48 8d 44 24 88          lea -0x78(%rsp),%rax
    7b833805:   48 81 ec 58 01 00 00    sub    $0x158,%rsp
    7b83380c:   e8 95 6f 05 00          callq  7b88a7a6 <__savms64_17>

Epilogue (r10 stores the value to restore the stack pointer to):
    7b83386c:   48 8d b4 24 e0 00 00    lea 0xe0(%rsp),%rsi
    7b833873:   00
    7b833874:   4c 8d 56 78             lea 0x78(%rsi),%r10
    7b833878:   e9 c9 6f 05 00          jmpq   7b88a846 <__resms64x_17>

Prologue, stack realignment case (this shows the uncombined SPmodifications, described below):

    7b833800:   55                      push   %rbp
    7b833801:   48 8d 44 24 90          lea -0x70(%rsp),%rax
    7b833806:   48 89 e5                mov    %rsp,%rbp
    7b833809:   48 83 e0 f0             and $0xfffffffffffffff0,%rax
    7b83380d:   48 8d 60 90             lea -0x70(%rax),%rsp
    7b833811:   e8 cc 79 05 00          callq  7b88b1e2 <__savms64r_17>

7b833816: 48 89 cb mov %rcx,%rbx# reorderedinsn from body

    7b833819:   48 83 ec 70             sub    $0x70,%rsp

Epilogue, stack realignment case:
    7b833875:   48 8d b4 24 e0 00 00    lea 0xe0(%rsp),%rsi
    7b83387c:   00
   7b83387d:   e9 ac 79 05 00 jmpq   7b88b22e <__resms64rx_17>


Questions and (known) outstanding issues:

1. I have added the new -f optimization to common.opt, but being that
   it only impacts x86_64, should this be a machine-specific -m option
   instead?
2. In the prologues that realign the stack, stack pointer modifications
   aren't combining, presumably since I'm using a lea after realigning
   using rax.
3. My x86 assembly expertise is limited, so I would appreciate any
   feedback on my stubs & emitted code.
4. Documentation is still missing.
5. A Changelog entry is still missing.
6. This is my first major work on a GNU project and I have not yet
   fully reviewed all of the relevant GNU coding conventions, so I
   might still have some non-compliance code.
7. Regression tests only run on my old Phenom. Have not yet tested on
   AVX cpu (which should use vmovaps instead of movaps).
8. My test program is inadequate (and is not included in this patch
   set).  During development it failed to produce many optimization
   errors that I got when building Wine.  I've been building 64-bit
   Wine and running Wine's tests in the mean time.
9. I need to devise a meaningful benchmarking strategy.
10. I have not yet examined how this may or may not affect -flto or
   where additional optimization opportunities in the lto driver may exist.
11. There are a few more optimization opportunities that I haven't
   attempted to exploit yet and prefer to leave for later projects.
     * In the case of stack realignment and all 17 registers being
       clobbered, I can combine the majority of the prologue
       (alignment, saving frame pointer, etc.) in the stub.
     * With these stubs being in the static portion of libgcc, each
       Wine "dll" gets a separate copy. The average number of dlls a
       Windows program loads seems to be at least 15, allowing a
       mechanism for them to be linked dynamically from libwine.so
       could save a little bit more .text and icache.
     * Ultimately, good static analysis of local sysv functions can
       completely eliminate the need to save SSE registers in some cases.
12. Use of hard frame pointers disables the optimization unless we're
   also realigning the stack. I've implemented this in another (local)
   branch, but haven't tested it yet.


gcc/common.opt                 |   7 +

gcc/config/i386/i386.c | 729++++++++++++++++++++++++++++++++++++++---

 gcc/config/i386/i386.h         |  22 +-
 gcc/config/i386/predicates.md  | 148 +++++++++
 gcc/config/i386/sse.md         |  56 ++++
 libgcc/config.host             |   2 +-
 libgcc/config/i386/i386-asm.h  |  82 +++++
 libgcc/config/i386/resms64.S   |  63 ++++
 libgcc/config/i386/resms64f.S  |  59 ++++
 libgcc/config/i386/resms64fx.S |  61 ++++
 libgcc/config/i386/resms64x.S  |  65 ++++
 libgcc/config/i386/savms64.S   |  63 ++++
 libgcc/config/i386/savms64f.S  |  64 ++++
 libgcc/config/i386/t-msabi     |   7 +
 14 files changed, 1379 insertions(+), 49 deletions(-)

Feedback and comments would be most appreciated!

Thanks,
Daniel

[PATCH 0/9] RFC: Add optimization -foutline-msabi-xlougues (for Wine 64)

Reply via email to