Hi. I came across a case when OVPSim shamelessly outperforms QEMU. In 8 CPUs test, OPVSim single-thread is faster than QEMU tcg-single 4 times, and faster than QEMU mttcg by ~30%.
I constructed a simple test case that reproduces it. When I profiled the test I saw that ~50% of all time QEMU spends inside function victim_tlb_hit (according to perf tool). Setup: 1. For both QEMU and OPVSim I made simple machine with 8 RISC-V CPUs and one RAM (system mode). 2. Host machine is x86 with 4 Cores, but only 1 thread per Core, so 4 HW threads only. 3. The test is "bare metal", no OS. 4. All CPUs run the same program, no explicit synchronizations in the code. 5. Both QEMU and OPVSim use semihosting EXIT and simulation ends when "last" exit happens. Test: ``` #define N (10000000ul * 60ul) #define M (1024*1024) int my_main(int argc, char* argv[]) { volatile long unsigned int a = 0; volatile long unsigned int b[M] = {}; volatile long unsigned int c[M] = {}; for (long unsigned int i = 1; i < N; i++) { int j = i % M; a += i; a |= (b[j] * i); b[j] += a & (c[j] / i); c[j] += i + a; a += b[j] - c[j]; } //consume a ``` Perf report: ``` 46.78% qemu-system-riscv64 [.] victim_tlb_hit 23.68% qemu-system-riscv64 [.] helper_le_ldq_mmu 4.46% qemu-system-riscv64 [.] helper_latch_ld_dest_reg_id ``` victim_tlb_hit ``` │ jne 1f9 │ lea (%rax,%r9,1),%rcx │ add $0x130,%rcx 0.25 │ mov $0x7,%edi 0.29 │126:shl $0x4,%rsi 0.39 │ mov %rdx,%r8 1.65 │ shl $0x5,%r8 0.35 │ add 0x1fa8(%rax,%rsi,1),%r8 0.32 │139:mov $0x1,%esi 0.37 │ xchg %esi,(%rax) 51.86 │ test %esi,%esi │ je 150 │ jmp 148 │146:pause ``` Results: 1. OPVSim single 4 times faster than QEMU tcg-single. 2. OPVSim single ~30% times faster than QEMU mttcg. 3. When M changed from 1M to 2, OPVSim single 2 times faster than QEMU tcg-single, and 2 time slower than QEMU mttcg. Question: does someone have an idea/intuition how QEMU code can be improved to speed up the simulation in cases like this? Thanks, Igor