On Thu, Mar 1, 2018 at 11:26 AM, Emilio G. Cota <c...@braap.org> wrote:
> On Wed, Feb 28, 2018 at 13:09:11 +1300, Michael Clark wrote:
> > BTW somewhat coincidentally, the binary translator I wrote; RV8, which is
> > practicaly twice as fast as QEMU only supports privileged ISA v1.9.1 and
> > personally want to keep binary compatiblity with it.
> > - https://rv8.io/
> > - https://rv8.io/bench
> > - https://anarch128.org/~mclark/rv8-carrv.pdf
> > - https://anarch128.org/~mclark/rv8-slides.pdf
> What QEMU versions did you use for those comparisons? I wonder if
> the recent indirect branch handling improvements were included in those
> (this work was merged in 2.10 for aarch64). Also, 2.7 is quite a bit
> faster than previous versions for user-mode due to the use of QHT,
> although you probably used a later version.
Yes I noticed indirect branch handling was very slow in the QEMU verison I
tested. I have highly optimised assembly stubs that implement a direct
mapped translation cache and translation cache miss fallback to C code that
does a fast hash map lookup.
> BTW after the merge you might want to look into optimizing indirect
> branches (and cross-page direct jumps in softmmu) for riscv in qemu.
> See examples with
> $ git log -Stcg_gen_lookup_and_goto_ptr
It was qemu-2.7.50 (late 2016). The benchmarks were generated mid last year.
I can run the benchmarks again... Has it doubled in speed?
Note: I don't even have a register allocator. I've assigned RISC-V RVC
register to hard registers (compiler optimises to choose compressable
registers first) and the remainder are in spill slots that in many cases
can be embedded as memory operands in the x86_64 translation. i.e. no
explicit reload, we let the micro-architecture crack these into micro-ops,
as they help to keep up code density.
I think I can get close to double again with tiered optimization and a good
register allocator (lift RISC-V asm to SSA form). It's also a hotspot
interpreter, which is definately faster than compiling all code, as I
benchmarked it. It profiles and only translates hot paths, so code that
only runs a few iterations is not translated. When I did eager transaltion
I got a slow-down.