Hi Michael,

Thanks for your sharing!

I will read the papers as you suggested, and I asked Quantum Computing professionals about solving the NP-complete problem https://github.com/epiqc/ScaffCC/issues/14 but GCC's IRA and LRA proved that there is a solution in SSA form for classical computer.

Sorting MachineBasicBlock by frequency firstly is a good idea, and I only experienced some PASS, such as LoopUnroll http://lists.llvm.org/pipermail/llvm-dev/2017-October/118419.html but I will practice your idea in my solutionByHEA https://github.com/xiangzhai/llvm/blob/avr/lib/CodeGen/RegAllocGraphColoring.cpp#L327

I experienced binary translation from X86 to Mips, QEMU is expensive https://www.leetcode.cn/2017/09/tcg-ir-to-llvm-ir.html I will bookmark your JIT engine to my blog :)


在 2017年12月19日 08:07, Michael Clark 写道:
Hi Leslie,

I suggest adding these 3 papers to your reading list.

        Register allocation for programs in SSA-form
        Sebastian Hack, Daniel Grund, and Gerhard Goos
        http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf

        Simple and Efficient Construction of Static Single Assignment Form
        Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa , 
Christoph Mallon , and Andreas Zwinkau
        https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf

        Optimal register allocation for SSA-form programs in polynomial time
        Sebastian Hack, Gerhard Goos
        http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf

A lot of the earlier literature regarding the register allocation problem 
describes the general graph colouring problem as NP-complete, however previous 
research in Register Allocation has been in the context of programs that were 
not in SSA form. i.e. the Chaitin-Briggs paper states that register allocation 
is NP-complete and proposes an iterative algorithm.

If one reads Hack Goos, there is the discovery that programs that are in SSA 
form have chordal graphs, and the colouring algorithm for chordal graphs can be 
completed in polynomial time. After the cost of building the interference 
graph, it is possible to perform register allocation in a single pass. The key 
is in not modifying the graph.

If one has frequency for each basic block, then one can sort basic blocks by 
frequency, allocating the highest frequency blocks first, and where possible 
assigning the same physcial register to all virtual registers in each phi node 
(to avoid copies). At program points where the live set is larger than k, the 
set of physical registers, one spills the the register that has the largest 
distance between its next use. At least that is how I am thinking about this 
problem. I am also a novice with regards to register allocation.

I intend to develop a register allocator for use in this JIT engine:
        
        rv8: a high performance RISC-V to x86 binary translator
        Michael Clark, Bruce Hoult
        https://carrv.github.io/papers/clark-rv8-carrv2017.pdf

Our JIT already performs almost twice as fast a QEMU and we are using a static 
register allocation, and QEMU i believe has a register allocator. We are 
mapping a 31 register RISC architecture to a 16 register CISC architecture. I 
have statistics on the current slow-down, and it appears to mainly be L1 
accesses to spilled registers. I believe after we have implemented a register 
allocator in our JIT we will be able to run RISC-V binaries at near native 
performance on x86-64. In fact given we are performing runtime profile guided 
optimisation, it may even be possible for some benchmarks to run faster than 
register allocations that are based on static estimates of loop frequencies.

        https://anarch128.org/~mclark/rv8-slides.pdf
        https://rv8.io/bench

We’ve still got a long way to go to get to ~1:1 performance for RISC-V on 
x86-64, but I think it is possible, at least for some codes…

Regards,
Michael.

On 15/12/2017, at 4:18 PM, Leslie Zhai <[email protected]> wrote:

Hi GCC and LLVM developers,

I am learning Register Allocation algorithms and I am clear that:

* Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg (hard)

* Memory (20 - 100 cycles) is expensive than Register (1 cycle), but it has to 
spill code when PhysReg is unavailable

* Folding spill code into instructions, handling register coallescing, 
splitting live ranges, doing rematerialization, doing shrink wrapping are 
harder than RegAlloc

* LRA and IRA is default Passes in RA for GCC:

$ /opt/gcc-git/bin/gcc hello.c
DEBUG: ../../gcc/lra.c, lra_init_once, line 2441
DEBUG: ../../gcc/ira-build.c, ira_build, line 3409

* Greedy is default Pass for LLVM

But I have some questions, please give me some hint, thanks a lot!

* IRA is regional register allocator performing graph coloring on a top-down 
traversal of nested regions, is it Global? compares with Local LRA

* The papers by Briggs and Chaiten contradict[2] themselves when examine the 
text of the paper vs. the pseudocode provided?

* Why  interference graph is expensive to build[3]?

And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, for LLVM 
firstly.


[1] https://reviews.llvm.org/D39712

[2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html

[3] https://github.com/joaotavio/llvm-register-allocator

[4] https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol

--
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/




--
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/



Reply via email to