FYI - I’ve updated the stats to include -O2 in addition to -O3 and -Os: - https://rv8.io/bench#optimisation
There are 57 plots and 31 tables. It’s quite a bit of data. It will be quite interesting to run these on new gcc releases to monitor changes. The Geomean for -O2 is 0.98 of -O3 on x86-64. I probably need to add some tables that show file sizes per architecture side by side, versus the current grouping which is by optimisation level to allow comparisons between architectures. If I pivot the data, we can add file size ratios by optimisation level per architecture. Note: these are relatively small benchmark programs, however the stats are still interesting. I’m most interested in RISC-V register allocation at present. -O2 does pretty well on file size compared to -O3, on all architectures. At a glance, the -O2 file sizes are slightly larger than the -Os file sizes but the performance increase is considerably more. I could perhaps show ratios of performance vs size between -O2 and -Os. > On 26 Aug 2017, at 10:05 PM, Michael Clark <michaeljcl...@mac.com> wrote: > >> >> On 26 Aug 2017, at 8:39 PM, Andrew Pinski <pins...@gmail.com> wrote: >> >> On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark <michaeljcl...@mac.com> wrote: >>> Dear GCC folk, >>> I have to say that’s GCC’s -Os caught me by surprise after several years >>> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year >>> and a half I have been working on RISC-V development and have been >>> exclusively using GCC for RISC-V builds, and initially I was using -Os. >>> After performing a qualitative/quantitative assessment I don’t believe >>> GCC’s current -Os is particularly useful, at least for my needs as it >>> doesn’t provide a commensurate saving in size given the sometimes quite >>> huge drop in performance. >>> >>> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC >>> frustration thread, as I think Apple’s documentation which presumably >>> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps >>> using -O2 as a starting point) with the idea that the current -Os is >>> renamed to -Oz. >>> >>> -Oz >>> (APPLE ONLY) Optimize for size, regardless of performance. -Oz >>> enables the same optimization flags that -Os uses, but -Oz also >>> enables other optimizations intended solely to reduce code >>> size. >>> In particular, instructions that encode into fewer bytes are >>> preferred over longer instructions that execute in fewer >>> cycles. >>> -Oz on Darwin is very similar to -Os in FSF distributions of >>> GCC. >>> -Oz employs the same inlining limits and avoids string >>> instructions >>> just like -Os. >>> >>> -Os >>> Optimize for size, but not at the expense of speed. -Os >>> enables all >>> -O2 optimizations that do not typically increase code size. >>> However, instructions are chosen for best performance, >>> regardless >>> of size. To optimize solely for size on Darwin, use -Oz (APPLE >>> ONLY). >>> >>> I have recently been working on a benchmark suite to test a RISC-V JIT >>> engine. I have performed all testing using GCC 7.1 as the baseline >>> compiler, and during the process I have collected several performance >>> metrics, some that are neutral to the JIT runtime environment. In >>> particular I have made performance comparisons between -Os and -O3 on x86, >>> along with capturing executable file sizes, dynamic retired instruction and >>> micro-op counts for x86, dynamic retired instruction counts for RISC-V as >>> well as dynamic register and instruction usage histograms for RISC-V, for >>> both -Os and -O3. >>> >>> See the Optimisation section for a charted performance comparison between >>> -O3 and -Os. There are dozens of other plots that show the differences >>> between -Os and -O3. >>> >>> - https://rv8.io/bench >>> >>> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The >>> Geomean of course smooths over some pathological cases where -Os >>> performance is severely degraded versus -O3 but not with significant, or >>> commensurate savings in size. >> >> >> First let me put into some perspective on -Os usage and some history: >> 1) -Os is not useful for non-embedded users >> 2) the embedded folks really need the smallest code possible and >> usually will be willing to afford the performance hit >> 3) -Os was a mistake for Apple to use in the first place; they used it >> and then GCC got better for PowerPC to use the string instructions >> which is why -Oz was added :) >> 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications. >> >> Comparing -O3 to -Os is not totally fair on x86 due to the many >> different instructions and encodings. >> Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a >> big issue. >> I soon have a need to keep overall (bare-metal) application size down >> to just 256k. >> Micro-controllers are places where -Os matters the most. > > Fair points. > > - Size at all cost is useful for the embedded case where there is a > restricted footprint. > - It’s fair to compare on RISC-V which has the RVC compressed ISA extension, > which is conceptually similar to Thumb-2 > - Understand renaming -Os to -Oz would cause a few downstream issues for > those who expect size at all costs. > - There is an achievable use-case for good RVC compression and good > performance on RISC-V > > However the question remains, what options does one choose for size, but not > size at the expense of speed. -O2 and an -mtune? > > I’m probably interested in an -O2 with an -mtune that can favour register > allocations that result in better RVC compression for RISC-V. Ideally the > dominant register set can be assigned to x8 through x15 using loop frequency > information and this would result in better compression and also reduce > dynamic icache pressure. I think I should look more closely at LRA and see > how it uses register_priority. > > There is a use case for high performance code that also makes good use of RVC > on RISC-V, while there may also be a use case for the current -Os for bare > metal where the implementor chooses to sacrifices speed for size at all > costs. The problem is there is only one -Os flag, versus -Oz and -Os which > makes the distinction between size at all costs versus size but not at the > expense of speed clear. i.e. the cases where reduced size improves > performance. > > I guess an -mtune for -O2 might be something worth considering. If the C > extension is selected on RISC-V, the compiler should make best use of it for > performance reasons. > > Someone has nicely summarised Clang/LLVM flags. It seems Clang/LLVM retains a > distinction between -Os and -Oz (which not just Apple, it is also Google > Chrome) > > - https://stackoverflow.com/questions/15548023/clang-optimization-levels > > • -Os is the same as -O2 > • -Oz is based on -Os > • opt drops: -slp-vectorizer > • clang drops: -vectorize-loops > > There may be an argument for flag compatibility with Clang/LLVM. At present > one would need to pass -Oz to get the Clang/LLVM equivalent to GCC’s -Os. > > I guess I should use -O2, which I think is what musl uses (which seems to > have the same meaning between GCC and Clang). There are embedded use cases, > that are not extremely constrained by size, but where size still is an issue, > but so is performance. > >>> I don’t currently have -O2 in my results however it seems like I should add >>> -O2 to the benchmark suite. If you take a look at the web page you’ll see >>> that there is already a huge amount of data given we have captured dynamic >>> register frequencies and dynamic instruction frequencies for -Os and -O3. >>> The tables and charts are all generated by scripts so if there is interest >>> I could add -O2. I can also pretty easily perform runs with new compiler >>> versions as everything is completely automated. The biggest factor is that >>> it currently takes 4 hours for a full run as we run all of the benchmarks >>> in a simulator to capture dynamic register usage and dynamic instruction >>> usage. >>> >>> After looking at the results, one has to question the utility of -Os in its >>> present form, and indeed question how it is actually used in practice, >>> given the proportion of savings in executable size. After my assessment I >>> would not recommend anyone to use -Os because its savings in size are not >>> proportionate to the loss in performance. I feel discouraged from using it >>> after looking at the results. I really don’t believe -Os makes the right >>> trades e.g. reducing icache pressure can indeed lead to better performance >>> due to reduced code size. >> >> This comment does not help my application usage. It rather hurts it >> and goes against what -Os is really about. It is not about reducing >> icache pressure but overall application code size. I really need the >> code to fit into a specific size. >> >> Thanks, >> Andrew >> >>> >>> I also wonder whether -O2 level optimisations may be a good starting point >>> for a more useful -Os and how one would proceed towards selecting >>> optimisations to add back to -Os to increase its usability, or rename the >>> current -Os to -Oz and make -Os an alias for -O2. A similar profile to -O2 >>> would probably produce less shock for anyone who does quantitative >>> performance analysis of -Os. >>> >>> In fact there are some interesting issues for the RISC-V backend given the >>> assembler performs RVC compression and GCC doesn’t really see the size of >>> emitted instructions. It would be an interesting backend to investigate >>> improving -Os presuming that a backend can opt in to various optimisations >>> for a given optimisation level. RISC-V would gain most of its size and >>> runtime icache pressure reduction improvements by getting the highest >>> frequency registers allocated within the 8 register set that is accessible >>> by the RVC instructions. Merely controlling register allocation to favour >>> the RVC accessible registers would produce the largest savings in >>> executable size, and may indeed be good for performance due to reduced >>> icache pressure. >>> >>> I have Dynamic Register Frequency Charts but they are not presently labeled >>> or coloured whether the registers are RVC accessible registers (x8 to x15). >>> I did however work on some crude ASCII histograms that indicate register >>> access frequency and whether the register is RVC accessible. Ideally the >>> register allocator would allocate highest frequency registers first from >>> the RVC set. The register order is already correctly defined in the RISC-V >>> backend. I have been experimenting with riscv_register_priority to try to >>> nudge LRA but have not yet had success. riscv_register_priority currently >>> returns 1 for RVC registers (if the C extension is present) and 0 for >>> regular registers however the loop frequency information is obviously not >>> accurate enough or LRA does not completely honour the register order and >>> priority. It’s likely it may not make a lot of difference on platforms with >>> very regular register files. See this gist for one of the benchmarks >>> register access frequency labeled as to whether the register is accessible >>> from compressed instructions: >>> >>> - https://gist.github.com/michaeljclark/8ba727e56084833e4f838c941eeca6be >>> >>> Question. Who uses -Os on GCC? >>> >>> I have for many years used -Os on macOS for Clang builds, as it has been an >>> Xcode default, but I’m considering using -O2 instead of -Os with FSF GCC. I >>> was using FSF GCC’s -Os under the mistaken impression that it operates >>> similarly to -Os in Xcode. i.e. produces code that performs well. >>> >>> In any case, despite my rant, I hope the quantitative states in the link >>> above prove to be useful. >>> >>> Thanks and Regards, >>> Michael.