Re: [Qemu-devel] [patches] Re: [PULL] RISC-V QEMU Port Submission

Emilio G. Cota Mon, 05 Mar 2018 11:01:14 -0800

On Sat, Mar 03, 2018 at 02:26:12 +1300, Michael Clark wrote:
> It was qemu-2.7.50 (late 2016). The benchmarks were generated mid last year.
> 
> I can run the benchmarks again... Has it doubled in speed?


It depends on the benchmarks. Small-ish benchmarks such as rv8-bench
show about a 1.5x speedup since QEMU v2.6.0 for Aarch64:

                Aarch64 rv8-bench performance under QEMU user-mode              
                  Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz                
                                                                                
  4.5 +-+----+------+------+------+------+------+------+------+------+----+-+   
      |                                                 ++                  |   
    4 +-+..........v2.8.0.........v2.9.0........v2.10.0.%%.....v2.11.0....+-+   
  3.5 +-+...............................................%%@...............+-+   
      |                                                 %%@                 |   
    3 +-+...............................................%%@...............+-+   
  2.5 +-+...............................................%%@...............+-+   
      |                     ++                        $$$%@                 |   
    2 +-+.................$$$%@......................##.$%@...............+-+   
      |                  ##+$%@ ##$$%@               ## $%@                 |   
  1.5 +-+..+++%%@.......**#.$%@.##.$%@++++%%@........##.$%@........##$$%@.+-+   
    1 +-+.**#$$%@+##$$%@**#.$%@**#.$%@**#$$%@**#$$%@**#.$%@**#$$%@**#.$%@.+-+   
      |   **# $%@**# $%@**# $%@**# $%@**# $%@**#+$%@**# $%@**# $%@**# $%@   |   
  0.5 +-+-**#$$%@**#$$%@**#$$%@**#$$%@**#$$%@**#$$%@**#$$%@**#$$%@**#$$%@-+-+   
          aes bigidhrystone  miniz   norx primes  qsort sha512geomean           
  png: https://imgur.com/Agr5CJd

SPEC06int shows a larger improvement, up to ~2x avg speedup for the train
set:
          Aarch64 SPEC06int (train set) performance under QEMU user-mode        
                  Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz                
                                                                                
    4 +-+--+----+----+----+----+----+----+----+----+----+----+----+----+--+-+   
      |    %%                                           ++                  |   
  3.5 +-+..%%@.....v2.8.0.........v2.9.0........v2.10.0.%%+....v2.11.0....+-+   
      |    %%@                                          %%@       ++        |   
    3 +-+..%%@.......++.................................%%@.......%%+.....+-+   
      |   +$%@        |+                                %%@       %%@       |   
  2.5 +-+.##%@.......%%@...............................+$%@.......%%@.....+-+   
    2 +-+.##%@......+%%@.......%%@.................%%@.##%@.......%%@..++.+-+   
      |   ##%@  %%@ ##%@      +$%@       %%@       %%@ ##%@       $%@  %%@  |   
  1.5 +-+**#%@.##%@.##%@......##%@.......$%@.......$%@.##%@..%%+.##%@.##%@+-+   
      |  **#%@**#%@**#%@  +++**#%@      ##%@  ++  ##%@**#%@ ##%@ ##%@ ##%@  |   
    1 +-+**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@+##%@**#%@+##%@**#%@**#%@+-+   
      |  **#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@  |   
  0.5 +-+**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@**#%@+-+   
       401.bzi403.g429445.g456.h462.libq464.h471.omn4483.xalancbgeomean         
  png: https://imgur.com/JknVT5H

Note that the test set is less sensitive to the changes:
  https://imgur.com/W7CT0eO

Running small benchmarks (such as SPEC "test" or rv8-bench) is
very useful to get quick feedback on optimizations. However, some
of these runs are still dominated by parts of the code that aren't
that relevant -- for instance, some of them take so little time to
run that the major contributor to execution time is memory allocation.
Therefore, when publishing results it's best to stick with larger
benchmarks that run for longer (e.g. SPEC "train" set), which are more
sensitive to DBT performance.

I tried running some other benchmarks, such as nbench[1], under rv-jit.
I quickly get a "bus error" though -- don't know if I'm doing anything
wrong, or maybe compiling with the glibc cross-compiler I used
to build riscv linux isn't supported.
I managed though to run rv8-bench on both rv-jit and qemu (v8 patchset);
rv-jit is 1.30x faster on average for those, although note I dropped
qsort because it wasn't working properly on rv-jit:

               rv8-bench performance under rv-jit and QEMU user-mode            
                  Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz                
             [qsort does not finish cleanly for rv8, so I dropped it.]          
                                                                                
    3 +-+-----+-------+------+-------+-------+-------+------+-------+-----+-+   
  2.5 +-+..................*****..........................................+-+   
      |                    *-+-* b1bae23b7c2                                |   
    2 +-+..................*...*...................+-+-+..................+-+   
  1.5 +-+...........*****..*...*...................*****..................+-+   
      |     *****   *-+-*  *   *   *****           *   *   ++-+   *****     |   
    1 +-+...*-+-*...*...*..*...*...*...*...*****...*...*...****...*...*...+-+   
  0.5 +-+---*****---*****--*****---*****---*****---*****---****---*****---+-+   
           aes  bigidhrystone   miniz    norx  primes sha512 geomean            
  png: https://imgur.com/rLmTH3L

> I think I can get close to double again with tiered optimization and a good
> register allocator (lift RISC-V asm to SSA form). It's also a hotspot
> interpreter, which is definately faster than compiling all code, as I
> benchmarked it. It profiles and only translates hot paths, so code that
> only runs a few iterations is not translated. When I did eager transaltion
> I got a slow-down.

Yes, hotspot is great for real-life workloads (e.g. booting a system). Note
though that most benchmarks (e.g. SPEC) don't translate code that often;
most execution time is spent in loops and therefore the quality of
the generated code does matter. Hotspot detection of TBs/traces is great
for this as well, because it allows you to spend more resources generating
higher-quality code--for instance, see HQEMU[2].

Thanks,

                Emilio

[1] https://github.com/cota/nbench
[2] http://www.iis.sinica.edu.tw/papers/dyhong/18243-F.pdf
PS. One page with all the png's: https://imgur.com/a/5P5zj

Re: [Qemu-devel] [patches] Re: [PULL] RISC-V QEMU Port Submission

Reply via email to