Yesterday, MIPS Tech announced the latest generation of the MIPS family of architectures called nanoMIPS [1]. As part of the development we have been designing all the open source tools necessary to support the architecture and, thanks to the speed with which we were able to prototype, we have also been using these tools to shape the architecture along the way. This has led to some really interesting improvements in the tools, which MIPS would like to contribute back to the community. While doing this work many of us have been unable to contribute to the community as actively as we would have liked, we are therefore very grateful for the community support given to the MIPS architecture over the last 18 months. This announcement has a general introduction at the start, so if you have already read it for one of the other tools, you can skip down to the information specific to GCC.
For anyone who knows the MIPS architecture you may well wonder why we are introducing another major variant and the question is perfectly valid. We do admittedly have quite a few: MIPS I through MIPS IV, MIPS32 and MIPS64 through to MIPS32R6 and MIPS64R6, MIPS16e, MIPS16e2, microMIPSR3 and microMIPSR6. Each of these serves (or served) a purpose and there is a high level of synergy between all of them. In general, they build upon the previous and there is a high level of compatibility, even when switching to a new encoding like moving from MIPS to microMIPS. The switch to MIPS32R6/MIPS64R6 was a major shift in the way the architecture innovated and drew more on the original theory of the architecture, where evolution was not expected to be limited by binary compatibility. MIPS Release 6 removed instructions and did create some very minor incompatibility but is also much cleaner to implement from a micro-architecture perspective. We have taken this idea much further with nanoMIPS and reimagined the instruction set, by drawing on all the experience gained from previous designs. Hopefully others will find it as interesting as we do. The major driving force behind the nanoMIPS architecture was to achieve outstanding code density, while also balancing out hardware and software design cost. As background MIPS has two compressed ISA variants: MIPS16e, which cannot exist without also implementing MIPS32, and microMIPS, which can exist on its own. Since MIPS16e has specific limits that cannot be engineered around, we chose to use an approach similar to the microMIPS design. nanoMIPS has a variable-length compressed instruction set that is completely standalone from the other MIPS ISAs. It is designed to compress the highest frequency instructions to 16-bits, and use 48-bit instructions to efficiently encode 32-bit constants into the instruction stream. There is also a wider range of 32-bit instructions, which merge carefully chosen high frequency instruction sequences into single operations creating more flexible addressing modes such as indexed and scaled indexed addressing, branch compare with immediate and macro style instructions. The macro like instructions compress prologue and epilogue sequences, as well as a small number of high frequency instruction pairs like two move instructions or a move and function call. nanoMIPS also totally eliminates branch delay slots which follows a precedent set by microMIPSR6. To get the best from a new ISA we also re-engineered the ABI and created a new symbiotic relationship between the ISA and ABI that pushes code density and performance further still. The ABI creates a fully link time relaxable model, which enables us to squeeze every last byte out of the code image even when deferring final addressing mode and layout decisions to link time. We have been mindful of MIPS heritage and ensured that while open to any possible change, we also have minimal impact when porting code from MIPS to nanoMIPS, and have plenty of support to achieve source compatibility between the two. The net effect of these changes leads to an average code size reduction of 20% relative to microMIPSR6. This compression could well be one of the best achieved by GNU tools for any RISC ISA. Comparing the ISA in terms of number of instructions to issue vs microMIPS we also see a reduction of between 8% and 11% of dynamic instruction count. Below we dig into some technical specifics for each of the GNU tools; we welcome any feedback and questions as we start to look at rebasing this work to the trunk/master and formally submitting it. nanoMIPS pre-built toolchains and source code tarballs are available at: http://codescape.mips.com/components/toolchain/nanomips/2018.04-02/ GCC specific details ==================== The back-end ------------ Instead of creating a new back-end for nanoMIPS, we decided to reuse the existing MIPS back-end. Starting from scratch would have required copying the majority of code but most of the logic would have remained the same. Reusing allowed us to speed up porting. Maintenance might be more difficult but a fix for nanoMIPS could automatically be a fix MIPS and vice versa. Most of the back-end is contained within a small number of files. The shared part is mostly in mips.{h,c,md,opt} files. The MIPS toolchains use mips-classic.md as the entry file (instead of mips.md) i.e. it includes shareable mips.md, processor configuration and includes all other .md files as necessary. nanomips.md is used as the entry file for nanoMIPS toolchains and similarly includes its own processors list, mips.md and other machine descriptions files as needed. Doing it in this way makes it easier to enable features which can be shared between the two. Some chunks of the back-end code had to be enabled conditionally, as the compiler would otherwise fail to build (missing patterns etc). Lastly, we needed to clean up nanoMIPS' target options by disabling them in mips.opt, and also create a number nanoMIPS specific files to keep the separation as clear as possible. The p32 ABI [2] --------------- 1. Calling convention To avoid major porting issues, the register conventions have been left mostly intact, and resemble the MIPS n32/n64 ABIs. The main difference is the removal of dedicated return registers ($2/v0, $3/v1) and using the argument registers to return values from functions. The old return registers have been re-purposed and have now become temporaries. This allowed us to achieve better code density because of more efficient data passing between functions e.g. foo(bar()). This is particularly visible in soft-float mode, where more complex expressions require multiple library calls. The nanoMIPS ABI requires the usage of either named registers, such as $a0..$a7 for arguments, $s0..$s7 for saved temporaries, etc. or of the $r0..$r31 format. Using $0..$31 is no longer supported by default, but can be re-enabled by using -mlegacyregs. 2. Stack frame organization The major change in comparison to the previous MIPS ABIs is changing the location of the frame pointer. The frame pointers now form a chain that will allow an efficient stack unwinding. Previously, in order to find the location of the frame-pointer, the instructions had to be scanned at the current program counter, going backwards. With this change, finding the location is trivial, however, it's important to point out that the frame pointer is biased by 4096 bytes i.e. logical_frame_pointer = $fp + 4096. The rationale was to enable full use of the unsigned 12-bit offsets in memory instructions when using the frame pointer as the base. Another notable difference is in the order of general-purpose registers on the stack, which now reflects the operation of the SAVE/RESTORE instructions from the nanoMIPS ISA. 3. Code and data models The automatic model (-mcmodel=auto) produces the most compact code possible by relying on the linker to do further size optimizations on the compiler-generated code. The linker will also expand the code when symbols end up being out of range. This model has been designed to keep the size difference between the intermediate objects and the fully linked object as small as possible, although having the linker perform too many expansions will widen that gap. It can be used only with a linker which is capable of performing relaxations and expansions. The medium model (-mcmodel=medium) is somewhat similar to the automatic model in terms of the range and size of the generated code, but it does not rely on linker relaxations and expansions. This lack of linker transformations makes the size of the fully linked object more predictable, even though it squanders some opportunities for further size optimization and it introduces inherent limitations in the fully linked code. The large model (-mcmodel=large) produces code which has an unlimited range by only using instructions which cover the entire address space. Because these instructions tend to be bigger, this model sacrifices code size in order to guarantee that code sequences will work regardless of where the symbol is placed in memory. The large model also does not rely on linker relaxations and expansions. In addition to the models, there are 4 addressing modes: - absolute: addresses are fixed at link-time. This mode is rarely necessary but has some potential for energy efficiency. - PC-relative: addresses appear as offsets from the PC and are used in PC-relative instructions. This mode produces position-independent code. - GP-relative: addresses appear as offsets from the GP and are used in GP-relative instructions. Symbols are placed in the small data section, also known as .sdata. This mode produces position-independent data for some or all symbols of an application. - GOT-dependent: addresses are kept in the GOT and are loaded by using offsets between the GP and a given symbol's entry in the GOT. This mode produces dynamically linkable code. 4. Thread Local Storage The nanoMIPS TLS ABI has support for both the traditional TLS models and TLS descriptors. All of the TLS models have been adapted to the nanoMIPS ISA following an approach similar to the one taken for the code and data models. The runtime TLS layout has also been redesigned to take advantage of the unsigned offset LW[U12] nanoMIPS instruction, thus extending the possible range of symbols inside the TLS block. Target-independent optimizations -------------------------------- In addition to these ABI improvements, we have also developed various target-independent and nanoMIPS-specific compiler optimizations, in order to further improve code size and performance. 1. LRA: use equivalences to help with frame pointer elimination (currently enabled by -mlra-equiv) The patch has already been posted [3] and went through some additional changes since posting. A case was found where LRA produced suboptimal code for a large frame and frame growing downward. The code size was affected particularly in cases where the offset was large and could not be used in an add operation directly introducing more instructions for a single frame access. Using the equivalences, the frame pointer gets eliminated more often, resulting in smaller code. The reasons are twofold: register pressure drops resulting in fewer spills and the offset might be smaller fitting into a single add instruction for every frame access. 2. IRA register recoloring (-fadjust-costs) The goal of register cost adjustment optimization is to make better usage of instructions that improve code density. This group of instructions includes 16-bit instructions and 32-bit nanoMIPS instructions which replace two other instructions (e.g. movep, move.balc, etc). Most of these instructions can use only a subset of all available registers and the purpose of this optimization is to increase the chances that pseudo registers used inside these instructions are assigned to the required hard registers. This is achieved by introducing a new target hook through which the cost of corresponding hard registers is modified just before allocation of a pseudo register. Cost modification is based on the properties of all instructions in which a pseudo register is used. If assigning a pseudo to some hard registers would lead to more dense code e.g. by being able to generate 16-bit instructions, then the cost of these hard registers is decreased. Otherwise, the cost of the hard registers is increased, thus improving the chances that these hard registers will be available for pseudos that are allocated later in the process. 3. Jump-table optimization (-fjump-table-clusters) The optimization enables the splitting of a single switch statement into a combination of multiple jump tables and decision trees. GCC currently emits either a single jump table or decision tree. The optimization can be enabled by the command line option -fjump-table-clusters and is target-independent. A MIPS specific option has been added (-mjump-table-density=DENSITY) to change the default density. DENSITY is the minimum percentage of non-default case indices in a jump table. If not specified, GCC will use the default density of 40%, if optimizing for size, and 10%, if optimizing for speed. The target option will be later replaced by an appropriate --param jump-table-density option or something similar. 4. Edge sorting for -Os during basic block reordering (-freorder-blocks-edge-sort=[one|two|all|default]) When reordering blocks using the `simple' algorithm edges are sorted for speed optimized functions and not sorted for size optimized functions. However, sorting the edges for size optimized functions can significantly improve performance with some code size cost. Inner loops show the greatest benefit with `level' set to `one'. Further improvement is possible by sorting one level of nested loops (`level' set to `two') with additional cost in size. Finally, all edges can be sorted (`level' set to `all'). This option overrides the normal sorting choice for both size and speed optimized functions. Target-specific optimizations ----------------------------- 1. Optimized inline memcpy (-mmemcpy-interleave=NUM/-mmulti-memcpy) These options have been introduced to control the inlined memcpy. -mmulti-memcpy attempts to exploit Load/Store Word Multiple instructions and -mmemcpy-interleave=NUM controls how loads and stores are interleaved i.e. how many NUM words are loaded first before storing them. 2. MOVEP/MOVE.BALC/RESTORE.JRC A machine-dependent hook will attempt to find opportunities in the instruction stream to combine instructions into MOVEP, MOVE.BALC or RESTORE.JRC to improve code density. MOVE.BALC can be controlled with -m[no-]opt-movebalc switch. 3. Offset shrinking pass (-m[no-]shrink-offsets) This pass processes the instruction stream, extracts offsets from memory accesses, and then tries to figure out the best offset adjustment to get the maximum potential code size savings. We take into account the cost of introducing a new add instruction that could undo the code size savings. As the pass is run before the register allocation, we can only speculate and be optimistic about the potential code size improvement. These guesses appear to be relatively good on average but might need to be considered on a case by case basis. 4. Jump-table optimization (-mjump-table-opt) This switch enables jump-tables which contain relative addresses. 5. BALC stubs (-m[no-]balc-stubs) This code size optimization is not performed by the compiler, but by the the assembler. It controls out-of-range call optimization through trampoline stubs. It is enabled by default when optimizing for size. Note that support for 64-bit and floating-point is not finalized and still unofficial. GCC contributors ================ - nanoMIPS port, ABI, code and data models, TLS, bugfixes: Robert Suchanek Toma Tabacu Matthew Fortune - IRA register recosting, edge sorting: Zoran Jovanovic - Jump-table optimization, scheduler, MOVE.BALC/MOVEP optimization: Prachi Godbole - RESTORE.JRC optimization: Robert Suchanek - Lightweight sync codes: Faraz Shahbazker - Offset shrinking pass: Robert Suchanek Steve Ellcey - Exception handling: Jack Romo - Dejagnu tests, bugfixes: Stefan Markovic Sara Popadic References: [1] https://www.mips.com/press/new-mips-i7200-processor-core-delivers-unmatched-performance-and-efficiency-for-advanced-lte5g-communications-and-networking-ic-designs/ [2] Codescape GNU tools for nanoMIPS: ELF ABI Supplement, https://codescape.mips.com/components/toolchain/nanomips/2018.04-02/docs/MIPS_nanoMIPS_ABI_supplement_01_02_DN00179.pdf [3] https://patchwork.ozlabs.org/patch/666637/