Hi.

I came across a case when OVPSim shamelessly outperforms QEMU. In 8 CPUs test,
OPVSim single-thread is faster than QEMU tcg-single 4 times, and faster than 
QEMU mttcg by ~30%.

I constructed a simple test case that reproduces it.
When I profiled the test I saw that ~50% of all time QEMU spends inside 
function  victim_tlb_hit (according to perf tool).

Setup:
1. For both QEMU and OPVSim I made simple machine with 8 RISC-V CPUs and one 
RAM (system mode).
2. Host machine is x86 with 4 Cores, but only 1 thread per Core, so 4 HW 
threads only.
3. The test is "bare metal", no OS.
4. All CPUs run the same program, no explicit synchronizations in the code.
5. Both QEMU and OPVSim use semihosting EXIT and simulation ends when "last" 
exit happens.

Test:

```
#define N (10000000ul * 60ul)
#define M (1024*1024)

int my_main(int argc, char* argv[]) {

  volatile long unsigned int a = 0;
  volatile long unsigned int b[M] = {};
  volatile long unsigned int c[M] = {};

  for (long unsigned int i = 1; i < N; i++) {
      int j = i % M;
      a += i;
      a |= (b[j] * i);
      b[j] += a & (c[j] / i);
      c[j] += i + a;
      a += b[j] - c[j];
  }

  //consume a
```

Perf report:

```
  46.78%  qemu-system-riscv64      [.] victim_tlb_hit
  23.68%  qemu-system-riscv64      [.] helper_le_ldq_mmu
   4.46%  qemu-system-riscv64      [.] helper_latch_ld_dest_reg_id
```

victim_tlb_hit
```
       │    jne    1f9
       │    lea    (%rax,%r9,1),%rcx
       │    add    $0x130,%rcx
  0.25 │    mov    $0x7,%edi
  0.29 │126:shl    $0x4,%rsi
  0.39 │    mov    %rdx,%r8
  1.65 │    shl    $0x5,%r8
  0.35 │    add    0x1fa8(%rax,%rsi,1),%r8
  0.32 │139:mov    $0x1,%esi
  0.37 │    xchg   %esi,(%rax)
 51.86 │    test   %esi,%esi
       │    je     150
       │    jmp    148
       │146:pause
```


Results:
1. OPVSim single 4 times faster than QEMU tcg-single.
2. OPVSim single ~30% times faster than QEMU mttcg.
3. When M changed from 1M to 2, OPVSim single 2 times faster than QEMU 
tcg-single,
   and 2 time slower than QEMU mttcg.

Question: does someone have an idea/intuition how QEMU code can be improved to 
speed up the simulation in cases like this?

Thanks,
Igor

Reply via email to