Quantitative analysis of -Os vs -O3

Michael Clark Sat, 26 Aug 2017 01:25:02 -0700

Dear GCC folk,

I have to say that’s GCC’s -Os caught me by surprise after several years using 
Apple GCC and more recently LLVM/Clang in Xcode. Over the last year and a half 
I have been working on RISC-V development and have been exclusively using GCC 
for RISC-V builds, and initially I was using -Os. After performing a 
qualitative/quantitative assessment I don’t believe GCC’s current -Os is 
particularly useful, at least for my needs as it doesn’t provide a commensurate 
saving in size given the sometimes quite huge drop in performance.


I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
frustration thread, as I think Apple’s documentation which presumably documents 
Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps using -O2 as a 
starting point) with the idea that the current -Os is renamed to -Oz.

        -Oz
               (APPLE ONLY) Optimize for size, regardless of performance. -Oz
               enables the same optimization flags that -Os uses, but -Oz also
               enables other optimizations intended solely to reduce code size.
               In particular, instructions that encode into fewer bytes are
               preferred over longer instructions that execute in fewer cycles.
               -Oz on Darwin is very similar to -Os in FSF distributions of GCC.
               -Oz employs the same inlining limits and avoids string 
instructions
               just like -Os.

        -Os
               Optimize for size, but not at the expense of speed. -Os enables 
all
               -O2 optimizations that do not typically increase code size.
               However, instructions are chosen for best performance, regardless
               of size. To optimize solely for size on Darwin, use -Oz (APPLE
               ONLY).

I have recently  been working on a benchmark suite to test a RISC-V JIT engine. 
I have performed all testing using GCC 7.1 as the baseline compiler, and during 
the process I have collected several performance metrics, some that are neutral 
to the JIT runtime environment. In particular I have made performance 
comparisons between -Os and -O3 on x86, along with capturing executable file 
sizes, dynamic retired instruction and micro-op counts for x86, dynamic retired 
instruction counts for RISC-V as well as dynamic register and instruction usage 
histograms for RISC-V, for both -Os and -O3.

See the Optimisation section for a charted performance comparison between -O3 
and -Os. There are dozens of other plots that show the differences between -Os 
and -O3.

        - https://rv8.io/bench

The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
Geomean of course smooths over some pathological cases where -Os performance is 
severely degraded versus -O3 but not with significant, or commensurate savings 
in size.

I don’t currently have -O2 in my results however it seems like I should add -O2 
to the benchmark suite. If you take a look at the web page you’ll see that 
there is already a huge amount of data given we have captured dynamic register 
frequencies and dynamic instruction frequencies for -Os and -O3. The tables and 
charts are all generated by scripts so if there is interest I could add -O2. I 
can also pretty easily perform runs with new compiler versions as everything is 
completely automated. The biggest factor is that it currently takes 4 hours for 
a full run as we run all of the benchmarks in a simulator to capture dynamic 
register usage and dynamic instruction usage.

After looking at the results, one has to question the utility of -Os in its 
present form, and indeed question how it is actually used in practice, given 
the proportion of savings in executable size. After my assessment I would not 
recommend anyone to use -Os because its savings in size are not proportionate 
to the loss in performance. I feel discouraged from using it after looking at 
the results. I really don’t believe -Os makes the right trades e.g. reducing 
icache pressure can indeed lead to better performance due to reduced code size.

I also wonder whether -O2 level optimisations may be a good starting point for 
a more useful -Os and how one would proceed towards selecting optimisations to 
add back to -Os to increase its usability, or rename the current -Os to -Oz and 
make -Os an alias for -O2. A similar profile to -O2 would probably produce less 
shock for anyone who does quantitative performance analysis of -Os.

In fact there are some interesting issues for the RISC-V backend given the 
assembler performs RVC compression and GCC doesn’t really see the size of 
emitted instructions. It would be an interesting backend to investigate 
improving -Os presuming that a backend can opt in to various optimisations for 
a given optimisation level. RISC-V would gain most of its size and runtime 
icache pressure reduction improvements by getting the highest frequency 
registers allocated within the 8 register set that is accessible by the RVC 
instructions. Merely controlling register allocation to favour the RVC 
accessible registers would produce the largest savings in executable size, and 
may indeed be good for performance due to reduced icache pressure.

I have Dynamic Register Frequency Charts but they are not presently labeled or 
coloured whether the registers are RVC accessible registers (x8 to x15). I did 
however work on some crude ASCII histograms that indicate register access 
frequency and whether the register is RVC accessible. Ideally the register 
allocator would allocate highest frequency registers first from the RVC set. 
The register order is already correctly defined in the RISC-V backend. I have 
been experimenting with riscv_register_priority to try to nudge LRA but have 
not yet had success. riscv_register_priority currently returns 1 for RVC 
registers (if the C extension is present) and 0 for regular registers however 
the loop frequency information is obviously not accurate enough or LRA does not 
completely honour the register order and priority. It’s likely it may not make 
a lot of difference on platforms with very regular register files. See this 
gist for one of the benchmarks register access frequency labeled as to whether 
the register is accessible from compressed instructions:

- https://gist.github.com/michaeljclark/8ba727e56084833e4f838c941eeca6be

Question. Who uses -Os on GCC?

I have for many years used -Os on macOS for Clang builds, as it has been an 
Xcode default, but I’m considering using -O2 instead of -Os with FSF GCC. I was 
using FSF GCC’s -Os under the mistaken impression that it operates similarly to 
-Os in Xcode. i.e. produces code that performs well.

In any case, despite my rant, I hope the quantitative states in the link above 
prove to be useful.

Thanks and Regards,
Michael.

Quantitative analysis of -Os vs -O3

Reply via email to