Re: Quantitative analysis of -Os vs -O3

Michael Clark Sat, 26 Aug 2017 17:20:07 -0700

FYI - I’ve updated the stats to include -O2 in addition to -O3 and -Os:

- https://rv8.io/bench#optimisation


There are 57 plots and 31 tables. It’s quite a bit of data. It will be quite 
interesting to run these on new gcc releases to monitor changes.

The Geomean for -O2 is 0.98 of -O3 on x86-64. I probably need to add some 
tables that show file sizes per architecture side by side, versus the current 
grouping which is by optimisation level to allow comparisons between 
architectures. If I pivot the data, we can add file size ratios by optimisation 
level per architecture. Note: these are relatively small benchmark programs, 
however the stats are still interesting. I’m most interested in RISC-V register 
allocation at present.

-O2 does pretty well on file size compared to -O3, on all architectures. At a 
glance, the -O2 file sizes are slightly larger than the -Os file sizes but the 
performance increase is considerably more. I could perhaps show ratios of 
performance vs size between -O2 and -Os.

> On 26 Aug 2017, at 10:05 PM, Michael Clark <michaeljcl...@mac.com> wrote:
> 
>> 
>> On 26 Aug 2017, at 8:39 PM, Andrew Pinski <pins...@gmail.com> wrote:
>> 
>> On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark <michaeljcl...@mac.com> wrote:
>>> Dear GCC folk,
>>> I have to say that’s GCC’s -Os caught me by surprise after several years 
>>> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year 
>>> and a half I have been working on RISC-V development and have been 
>>> exclusively using GCC for RISC-V builds, and initially I was using -Os. 
>>> After performing a qualitative/quantitative assessment I don’t believe 
>>> GCC’s current -Os is particularly useful, at least for my needs as it 
>>> doesn’t provide a commensurate saving in size given the sometimes quite 
>>> huge drop in performance.
>>> 
>>> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
>>> frustration thread, as I think Apple’s documentation which presumably 
>>> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps 
>>> using -O2 as a starting point) with the idea that the current -Os is 
>>> renamed to -Oz.
>>> 
>>>       -Oz
>>>              (APPLE ONLY) Optimize for size, regardless of performance. -Oz
>>>              enables the same optimization flags that -Os uses, but -Oz also
>>>              enables other optimizations intended solely to reduce code 
>>> size.
>>>              In particular, instructions that encode into fewer bytes are
>>>              preferred over longer instructions that execute in fewer 
>>> cycles.
>>>              -Oz on Darwin is very similar to -Os in FSF distributions of 
>>> GCC.
>>>              -Oz employs the same inlining limits and avoids string 
>>> instructions
>>>              just like -Os.
>>> 
>>>       -Os
>>>              Optimize for size, but not at the expense of speed. -Os 
>>> enables all
>>>              -O2 optimizations that do not typically increase code size.
>>>              However, instructions are chosen for best performance, 
>>> regardless
>>>              of size. To optimize solely for size on Darwin, use -Oz (APPLE
>>>              ONLY).
>>> 
>>> I have recently  been working on a benchmark suite to test a RISC-V JIT 
>>> engine. I have performed all testing using GCC 7.1 as the baseline 
>>> compiler, and during the process I have collected several performance 
>>> metrics, some that are neutral to the JIT runtime environment. In 
>>> particular I have made performance comparisons between -Os and -O3 on x86, 
>>> along with capturing executable file sizes, dynamic retired instruction and 
>>> micro-op counts for x86, dynamic retired instruction counts for RISC-V as 
>>> well as dynamic register and instruction usage histograms for RISC-V, for 
>>> both -Os and -O3.
>>> 
>>> See the Optimisation section for a charted performance comparison between 
>>> -O3 and -Os. There are dozens of other plots that show the differences 
>>> between -Os and -O3.
>>> 
>>>       - https://rv8.io/bench
>>> 
>>> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
>>> Geomean of course smooths over some pathological cases where -Os 
>>> performance is severely degraded versus -O3 but not with significant, or 
>>> commensurate savings in size.
>> 
>> 
>> First let me put into some perspective on -Os usage and some history:
>> 1) -Os is not useful for non-embedded users
>> 2) the embedded folks really need the smallest code possible and
>> usually will be willing to afford the performance hit
>> 3) -Os was a mistake for Apple to use in the first place; they used it
>> and then GCC got better for PowerPC to use the string instructions
>> which is why -Oz was added :)
>> 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
>> 
>> Comparing -O3 to -Os is not totally fair on x86 due to the many
>> different instructions and encodings.
>> Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
>> big issue.
>> I soon have a need to keep overall (bare-metal) application size down
>> to just 256k.
>> Micro-controllers are places where -Os matters the most.
> 
> Fair points.
> 
> - Size at all cost is useful for the embedded case where there is a 
> restricted footprint.
> - It’s fair to compare on RISC-V which has the RVC compressed ISA extension, 
> which is conceptually similar to Thumb-2
> - Understand renaming -Os to -Oz would cause a few downstream issues for 
> those who expect size at all costs.
> - There is an achievable use-case for good RVC compression and good 
> performance on RISC-V
> 
> However the question remains, what options does one choose for size, but not 
> size at the expense of speed. -O2 and an -mtune?
> 
> I’m probably interested in an -O2 with an -mtune that can favour register 
> allocations that result in better RVC compression for RISC-V. Ideally the 
> dominant register set can be assigned to x8 through x15 using loop frequency 
> information and this  would result in better compression and also reduce 
> dynamic icache pressure. I think I should look more closely at LRA and see 
> how it uses register_priority.
> 
> There is a use case for high performance code that also makes good use of RVC 
> on RISC-V, while there may also be a use case for the current -Os for bare 
> metal where the implementor chooses to sacrifices speed for size at all 
> costs. The problem is there is only one -Os flag, versus -Oz and -Os which 
> makes the distinction between size at all costs versus size but not at the 
> expense of speed clear. i.e. the cases where reduced size improves 
> performance.
> 
> I guess an -mtune for -O2 might be something worth considering. If the C 
> extension is selected on RISC-V, the compiler should make best use of it for 
> performance reasons.
> 
> Someone has nicely summarised Clang/LLVM flags. It seems Clang/LLVM retains a 
> distinction between -Os and -Oz (which not just Apple, it is also Google 
> Chrome)
> 
> - https://stackoverflow.com/questions/15548023/clang-optimization-levels
> 
>       • -Os is the same as -O2
>       • -Oz is based on -Os
>               • opt drops: -slp-vectorizer
>               • clang drops: -vectorize-loops
> 
> There may be an argument for flag compatibility with Clang/LLVM. At present 
> one would need to pass -Oz to get the Clang/LLVM equivalent to GCC’s -Os.
> 
> I guess I should use -O2, which I think is what musl uses (which seems to 
> have the same meaning between GCC and Clang). There are embedded use cases, 
> that are not extremely constrained by size, but where size still is an issue, 
> but so is performance.
> 
>>> I don’t currently have -O2 in my results however it seems like I should add 
>>> -O2 to the benchmark suite. If you take a look at the web page you’ll see 
>>> that there is already a huge amount of data given we have captured dynamic 
>>> register frequencies and dynamic instruction frequencies for -Os and -O3. 
>>> The tables and charts are all generated by scripts so if there is interest 
>>> I could add -O2. I can also pretty easily perform runs with new compiler 
>>> versions as everything is completely automated. The biggest factor is that 
>>> it currently takes 4 hours for a full run as we run all of the benchmarks 
>>> in a simulator to capture dynamic register usage and dynamic instruction 
>>> usage.
>>> 
>>> After looking at the results, one has to question the utility of -Os in its 
>>> present form, and indeed question how it is actually used in practice, 
>>> given the proportion of savings in executable size. After my assessment I 
>>> would not recommend anyone to use -Os because its savings in size are not 
>>> proportionate to the loss in performance. I feel discouraged from using it 
>>> after looking at the results. I really don’t believe -Os makes the right 
>>> trades e.g. reducing icache pressure can indeed lead to better performance 
>>> due to reduced code size.
>> 
>> This comment does not help my application usage.  It rather hurts it
>> and goes against what -Os is really about.  It is not about reducing
>> icache pressure but overall application code size.  I really need the
>> code to fit into a specific size.
>> 
>> Thanks,
>> Andrew
>> 
>>> 
>>> I also wonder whether -O2 level optimisations may be a good starting point 
>>> for a more useful -Os and how one would proceed towards selecting 
>>> optimisations to add back to -Os to increase its usability, or rename the 
>>> current -Os to -Oz and make -Os an alias for -O2. A similar profile to -O2 
>>> would probably produce less shock for anyone who does quantitative 
>>> performance analysis of -Os.
>>> 
>>> In fact there are some interesting issues for the RISC-V backend given the 
>>> assembler performs RVC compression and GCC doesn’t really see the size of 
>>> emitted instructions. It would be an interesting backend to investigate 
>>> improving -Os presuming that a backend can opt in to various optimisations 
>>> for a given optimisation level. RISC-V would gain most of its size and 
>>> runtime icache pressure reduction improvements by getting the highest 
>>> frequency registers allocated within the 8 register set that is accessible 
>>> by the RVC instructions. Merely controlling register allocation to favour 
>>> the RVC accessible registers would produce the largest savings in 
>>> executable size, and may indeed be good for performance due to reduced 
>>> icache pressure.
>>> 
>>> I have Dynamic Register Frequency Charts but they are not presently labeled 
>>> or coloured whether the registers are RVC accessible registers (x8 to x15). 
>>> I did however work on some crude ASCII histograms that indicate register 
>>> access frequency and whether the register is RVC accessible. Ideally the 
>>> register allocator would allocate highest frequency registers first from 
>>> the RVC set. The register order is already correctly defined in the RISC-V 
>>> backend. I have been experimenting with riscv_register_priority to try to 
>>> nudge LRA but have not yet had success. riscv_register_priority currently 
>>> returns 1 for RVC registers (if the C extension is present) and 0 for 
>>> regular registers however the loop frequency information is obviously not 
>>> accurate enough or LRA does not completely honour the register order and 
>>> priority. It’s likely it may not make a lot of difference on platforms with 
>>> very regular register files. See this gist for one of the benchmarks 
>>> register access frequency labeled as to whether the register is accessible 
>>> from compressed instructions:
>>> 
>>> - https://gist.github.com/michaeljclark/8ba727e56084833e4f838c941eeca6be
>>> 
>>> Question. Who uses -Os on GCC?
>>> 
>>> I have for many years used -Os on macOS for Clang builds, as it has been an 
>>> Xcode default, but I’m considering using -O2 instead of -Os with FSF GCC. I 
>>> was using FSF GCC’s -Os under the mistaken impression that it operates 
>>> similarly to -Os in Xcode. i.e. produces code that performs well.
>>> 
>>> In any case, despite my rant, I hope the quantitative states in the link 
>>> above prove to be useful.
>>> 
>>> Thanks and Regards,
>>> Michael.

Re: Quantitative analysis of -Os vs -O3

Reply via email to