Dear GCC folk,
I have to say that’s GCC’s -Os caught me by surprise after several years using
Apple GCC and more recently LLVM/Clang in Xcode. Over the last year and a half
I have been working on RISC-V development and have been exclusively using GCC
for RISC-V builds, and initially I was using -Os. After performing a
qualitative/quantitative assessment I don’t believe GCC’s current -Os is
particularly useful, at least for my needs as it doesn’t provide a commensurate
saving in size given the sometimes quite huge drop in performance.
I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC
frustration thread, as I think Apple’s documentation which presumably documents
Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps using -O2 as a
starting point) with the idea that the current -Os is renamed to -Oz.
-Oz
(APPLE ONLY) Optimize for size, regardless of performance. -Oz
enables the same optimization flags that -Os uses, but -Oz also
enables other optimizations intended solely to reduce code size.
In particular, instructions that encode into fewer bytes are
preferred over longer instructions that execute in fewer cycles.
-Oz on Darwin is very similar to -Os in FSF distributions of GCC.
-Oz employs the same inlining limits and avoids string
instructions
just like -Os.
-Os
Optimize for size, but not at the expense of speed. -Os enables
all
-O2 optimizations that do not typically increase code size.
However, instructions are chosen for best performance, regardless
of size. To optimize solely for size on Darwin, use -Oz (APPLE
ONLY).
I have recently been working on a benchmark suite to test a RISC-V JIT engine.
I have performed all testing using GCC 7.1 as the baseline compiler, and during
the process I have collected several performance metrics, some that are neutral
to the JIT runtime environment. In particular I have made performance
comparisons between -Os and -O3 on x86, along with capturing executable file
sizes, dynamic retired instruction and micro-op counts for x86, dynamic retired
instruction counts for RISC-V as well as dynamic register and instruction usage
histograms for RISC-V, for both -Os and -O3.
See the Optimisation section for a charted performance comparison between -O3
and -Os. There are dozens of other plots that show the differences between -Os
and -O3.
- https://rv8.io/bench
The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The
Geomean of course smooths over some pathological cases where -Os performance is
severely degraded versus -O3 but not with significant, or commensurate savings
in size.
I don’t currently have -O2 in my results however it seems like I should add -O2
to the benchmark suite. If you take a look at the web page you’ll see that
there is already a huge amount of data given we have captured dynamic register
frequencies and dynamic instruction frequencies for -Os and -O3. The tables and
charts are all generated by scripts so if there is interest I could add -O2. I
can also pretty easily perform runs with new compiler versions as everything is
completely automated. The biggest factor is that it currently takes 4 hours for
a full run as we run all of the benchmarks in a simulator to capture dynamic
register usage and dynamic instruction usage.
After looking at the results, one has to question the utility of -Os in its
present form, and indeed question how it is actually used in practice, given
the proportion of savings in executable size. After my assessment I would not
recommend anyone to use -Os because its savings in size are not proportionate
to the loss in performance. I feel discouraged from using it after looking at
the results. I really don’t believe -Os makes the right trades e.g. reducing
icache pressure can indeed lead to better performance due to reduced code size.
I also wonder whether -O2 level optimisations may be a good starting point for
a more useful -Os and how one would proceed towards selecting optimisations to
add back to -Os to increase its usability, or rename the current -Os to -Oz and
make -Os an alias for -O2. A similar profile to -O2 would probably produce less
shock for anyone who does quantitative performance analysis of -Os.
In fact there are some interesting issues for the RISC-V backend given the
assembler performs RVC compression and GCC doesn’t really see the size of
emitted instructions. It would be an interesting backend to investigate
improving -Os presuming that a backend can opt in to various optimisations for
a given optimisation level. RISC-V would gain most of its size and runtime
icache pressure reduction improvements by getting the highest frequency
registers allocated within the 8 register set that is accessible by the RVC
instructions. Merely controlling register allocation to favour the RVC
accessible registers would produce the largest savings in executable size, and
may indeed be good for performance due to reduced icache pressure.
I have Dynamic Register Frequency Charts but they are not presently labeled or
coloured whether the registers are RVC accessible registers (x8 to x15). I did
however work on some crude ASCII histograms that indicate register access
frequency and whether the register is RVC accessible. Ideally the register
allocator would allocate highest frequency registers first from the RVC set.
The register order is already correctly defined in the RISC-V backend. I have
been experimenting with riscv_register_priority to try to nudge LRA but have
not yet had success. riscv_register_priority currently returns 1 for RVC
registers (if the C extension is present) and 0 for regular registers however
the loop frequency information is obviously not accurate enough or LRA does not
completely honour the register order and priority. It’s likely it may not make
a lot of difference on platforms with very regular register files. See this
gist for one of the benchmarks register access frequency labeled as to whether
the register is accessible from compressed instructions:
- https://gist.github.com/michaeljclark/8ba727e56084833e4f838c941eeca6be
Question. Who uses -Os on GCC?
I have for many years used -Os on macOS for Clang builds, as it has been an
Xcode default, but I’m considering using -O2 instead of -Os with FSF GCC. I was
using FSF GCC’s -Os under the mistaken impression that it operates similarly to
-Os in Xcode. i.e. produces code that performs well.
In any case, despite my rant, I hope the quantitative states in the link above
prove to be useful.
Thanks and Regards,
Michael.