[OpenRISC] The reason why QEMU is such a slow OpenRISC emulator.

Sebastian Macke Wed, 16 Oct 2013 20:55:44 -0700

Hi,

after I played a little bit around with the QEMU OpenRISC emulator Irealized that it is extremely slow in comparison to the i386 emulation.It is in fact around 45-60 times slower. I asked me why and I came tosome conclusions which I would summarize here.

The first thing is of course the missing hardware tlb refill. And indeedafter applying the patches from Stefan Kristiansson the speed gain wasaround a factor of 2-4.

The other part I realized is the tiny-code-generator (tcg). I made sometests which you can find in the attachment which does not involve tlbrefill and does not involve little-endian big-endian swappings.For this case the speed is reduced by factor of 10 in comparison to thei386 implementation.

I have reviewed and compared the different code generators andsummarized the optimizations we can perform.Unfortunately most of them would violate the specification. But only inrare cases. So rare, that you have to do it on purpose with hand writtenassembly code.The i386 code generator is also off specification by using the lazy flagoptimization for example (http://qemu.weilnetz.de/qemu-tech.html) . I amnot sure if old hand written assembler DOS games would run in QEMU.

What you should know is, that the code generator constructs small codefragments with some strict properties

  - The code is bound to an absolute virtual pc address.

- This code fragment is never interrupted. It contains usually 5-20opcodes.- The code fragment almost always ends with a jump or somethingsimilar (bf, syscall, ...). There are no jumps in between.


Here are the findings.

1.
At the moment the overflow and carry flags are set every time.

This is one of the biggest speed issues. Instead of one converted l.addiopcode we need more than 20.

The best thing is. These flags are *never* used.

I think the main purpose of these FLAGS is to check after the overflowexception what happened.In my opinion we can remove them or only set them if the overflowexception flag is set.Additional I would strongly suggest that the overflow exception flag ischecked statically instead dynamically.

I know the l.addc instruction is special. This instruction is never usedby the way. But there are ways

to treat it if it follows directly after an l.add instruction.

Such a patch would increase the speed at least by 30%.

2.
The OpenRISC architecture does not have its own move register instruction.
Instead it uses either "l.addi rd, rs, 0", "l.ori rd, rs, 0"
We can check for this special cases and use a move instruction instead.

3.

Define r0 as hardwired. Then we can define the instructions "l.or rd,rx, r0", "l.addi rd, r0, x" or "l.ori rd, r0, x"

additionally as move instructions.

This would add a little bit more logic to the code generator. But I ampretty sure the

emulation speed would profit. The instructions above are very common.

4. The npc (next pc) and ppc (previous pc) variables are updated beforeevery instructionThis is extremely inefficient. And what makes things worse. They are*never* (for a very close approximation for never) used.At the moment the npc is used in some way for delayed instructions but Iam sure we can get rid of them because every code

fragment uses fixed virtual addresses.

There is still a way to keep the registers with some special checks forl.mtspr and l.mfspr.


5. Check and branch instructions are independent at the moment

What makes things worse. The sf... instructions sets the Branch flag andthe next instructions reads it. Everytime with lots

of ands and ors opcodes.

Additionally the branch flag could be separated from the SR register tospeed things up.

6.

As far as I see it is almost impossible to get an exception during adelayed instruction by design. Only in single-step mode.

I think it is possible to remove some logic here.

Before I am start writing even one patch I would like know if there is achance to get them accepted as they will highly likely break thespecification. In my opinion QEMU is for speed while or1ksim is foraccuracy. At the moment QEMU is only a little bit faster than or1ksimwhich makes it almost useless.


Sebastian Macke

The code:

for(int i=0; i<a; i++)
{
    x +=a;
}


----------------
i386 tcg code generator 

IN: 
0x080483f0:  add    %edx,%eax
0x080483f2:  add    $0x1,%edx
0x080483f5:  cmp    %ecx,%edx
0x080483f7:  jne    0x80483f0

OP:
 ld_i32 tmp12,env,$0xffffffbc
 movi_i32 tmp13,$0x0
 brcond_i32 tmp12,tmp13,ne,$0x0
 ---- 0x80483f0
 mov_i32 tmp1,edx
 mov_i32 tmp0,eax
 add_i32 tmp0,tmp0,tmp1
 mov_i32 eax,tmp0
 mov_i32 cc_src,tmp1
 mov_i32 cc_dst,tmp0
 discard cc_src2
 discard cc_op

 ---- 0x80483f2
 movi_i32 tmp1,$0x1
 mov_i32 tmp0,edx
 add_i32 tmp0,tmp0,tmp1
 mov_i32 edx,tmp0
 mov_i32 cc_src,tmp1
 mov_i32 cc_dst,tmp0

 ---- 0x80483f5
 mov_i32 tmp1,ecx
 mov_i32 tmp0,edx
 mov_i32 cc_src,tmp1
 mov_i32 loc11,tmp0
 sub_i32 cc_dst,tmp0,tmp1

 ---- 0x80483f7
 movi_i32 cc_op,$0x10
 discard loc11
 movi_i32 tmp12,$0x0
 brcond_i32 cc_dst,tmp12,ne,$0x1
 goto_tb $0x0
 movi_i32 tmp3,$0x80483f9
 st_i32 tmp3,env,$0x20
 exit_tb $0xabeab220
 set_label $0x1
 goto_tb $0x1
 movi_i32 tmp3,$0x80483f0
 st_i32 tmp3,env,$0x20
 exit_tb $0xabeab221
 set_label $0x0
 exit_tb $0xabeab223

OUT: [size=109]
0xb290c440:  mov    -0x44(%ebp),%ebx
0xb290c443:  test   %ebx,%ebx
0xb290c445:  jne    0xb290c4a3
0xb290c44b:  mov    0x0(%ebp),%ebx
0xb290c44e:  mov    0x8(%ebp),%esi
0xb290c451:  add    %esi,%ebx
0xb290c453:  mov    %ebx,0x0(%ebp)
0xb290c456:  lea    0x1(%esi),%ebx
0xb290c459:  mov    %ebx,0x8(%ebp)
0xb290c45c:  mov    0x4(%ebp),%esi
0xb290c45f:  mov    %esi,0x2c(%ebp)
0xb290c462:  sub    %esi,%ebx
0xb290c464:  mov    %ebx,0x28(%ebp)
0xb290c467:  mov    $0x10,%esi
0xb290c46c:  mov    %esi,0x34(%ebp)
0xb290c46f:  test   %ebx,%ebx
0xb290c471:  jne    0xb290c48d
0xb290c477:  jmp    0xb290c47c
0xb290c47c:  movl   $0x80483f9,0x20(%ebp)
0xb290c483:  mov    $0xabeab220,%eax
0xb290c488:  jmp    0xb779bc15
0xb290c48d:  jmp    0xb290c492
0xb290c492:  movl   $0x80483f0,0x20(%ebp)
0xb290c499:  mov    $0xabeab221,%eax
0xb290c49e:  jmp    0xb779bc15
0xb290c4a3:  mov    $0xabeab223,%eax
0xb290c4a8:  jmp    0xb779bc15

#-----------------------------------------------------------
#-----------------------------------------------------------
#-----------------------------------------------------------
#-----------------------------------------------------------

Openrisc tcg code generator.
The carry and overflow flag part is removed.
Otherwise the code generated would be twice or three times bigger and two times 
slower.

0x00000660: l.addi r4, r4, 1
0x00000664: l.sfeq  r4, r3
0x00000668: l.bnf 67108862
0x0000066C: l.add r11, r11, r4


isize=16 osize=42
OP:
 ld_i32 tmp0,env,$0xffffffc8
 movi_i32 tmp1,$float_eq_s
 brcond_i32 tmp0,tmp1,ne,$0x0
 ---- 0x660
 movi_i32 ppc,$0x65c
 movi_i32 npc,$0x664
 movi_i32 tmp0,$0x1
 add_i32 r4,r4,tmp0

 ---- 0x664
 movi_i32 ppc,$0x660
 movi_i32 npc,$0x668
 movi_i32 btaken,$float_eq_s
 setcond_i32 btaken,r4,r3,eq
 movi_i32 tmp0,$0xfffffdff
 and_i32 sr,sr,tmp0
 movi_i32 tmp0,$float_eq_s
 brcond_i32 btaken,tmp0,eq,$0x1
 movi_i32 tmp0,$0x200
 or_i32 sr,sr,tmp0
 set_label $0x1

 ---- 0x668
 movi_i32 ppc,$0x664
 movi_i32 npc,$0x66c
 movi_i32 tmp1,$0x200
 and_i32 tmp0,sr,tmp1
 movi_i32 jmp_pc,$0x670
 movi_i32 tmp1,$0x200
 brcond_i32 tmp0,tmp1,eq,$0x2
 movi_i32 jmp_pc,$0x660
 set_label $0x2
 movi_i32 flags,$0x1

 ---- 0x66c
 movi_i32 ppc,$0x668
 movi_i32 npc,$0x670
 add_i32 r11,r11,r4
 movi_i32 flags,$float_eq_s
 mov_i32 pc,jmp_pc
 mov_i32 npc,jmp_pc
 movi_i32 jmp_pc,$float_eq_s
 exit_tb $0x0
 set_label $0x0
 exit_tb $0xb68684cb

OUT: [size=242]
0x0dbb39e0:  mov    -0x38(%ebp),%ebx
0x0dbb39e3:  test   %ebx,%ebx
0x0dbb39e5:  jne    0xdbb3ac8
0x0dbb39eb:  mov    0x10(%ebp),%ebx
0x0dbb39ee:  inc    %ebx
0x0dbb39ef:  mov    %ebx,0x10(%ebp)
0x0dbb39f2:  mov    $0x660,%esi
0x0dbb39f7:  mov    %esi,0x88(%ebp)
0x0dbb39fd:  mov    $0x668,%esi
0x0dbb3a02:  mov    %esi,0x84(%ebp)
0x0dbb3a08:  mov    0xc(%ebp),%esi
0x0dbb3a0b:  cmp    %esi,%ebx
0x0dbb3a0d:  sete   %bl
0x0dbb3a10:  movzbl %bl,%ebx
0x0dbb3a13:  mov    %ebx,0xd4(%ebp)
0x0dbb3a19:  mov    0xa8(%ebp),%esi
0x0dbb3a1f:  and    $0xfffffdff,%esi
0x0dbb3a25:  mov    %esi,0xa8(%ebp)
0x0dbb3a2b:  test   %ebx,%ebx
0x0dbb3a2d:  je     0xdbb3a45
0x0dbb3a33:  mov    0xa8(%ebp),%ebx
0x0dbb3a39:  or     $0x200,%ebx
0x0dbb3a3f:  mov    %ebx,0xa8(%ebp)
0x0dbb3a45:  mov    $0x664,%ebx
0x0dbb3a4a:  mov    %ebx,0x88(%ebp)
0x0dbb3a50:  mov    $0x66c,%ebx
0x0dbb3a55:  mov    %ebx,0x84(%ebp)
0x0dbb3a5b:  mov    0xa8(%ebp),%ebx
0x0dbb3a61:  and    $0x200,%ebx
0x0dbb3a67:  mov    $0x670,%esi
0x0dbb3a6c:  mov    %esi,0x8c(%ebp)
0x0dbb3a72:  cmp    $0x200,%ebx
0x0dbb3a78:  je     0xdbb3a89
0x0dbb3a7e:  mov    $0x660,%ebx
0x0dbb3a83:  mov    %ebx,0x8c(%ebp)
0x0dbb3a89:  mov    $0x668,%ebx
0x0dbb3a8e:  mov    %ebx,0x88(%ebp)
0x0dbb3a94:  mov    0x2c(%ebp),%ebx
0x0dbb3a97:  mov    0x10(%ebp),%esi
0x0dbb3a9a:  add    %esi,%ebx
0x0dbb3a9c:  mov    %ebx,0x2c(%ebp)
0x0dbb3a9f:  xor    %ebx,%ebx
0x0dbb3aa1:  mov    %ebx,0xd0(%ebp)
0x0dbb3aa7:  mov    0x8c(%ebp),%ebx
0x0dbb3aad:  mov    %ebx,0x80(%ebp)
0x0dbb3ab3:  mov    %ebx,0x84(%ebp)
0x0dbb3ab9:  xor    %ebx,%ebx
0x0dbb3abb:  mov    %ebx,0x8c(%ebp)
0x0dbb3ac1:  xor    %eax,%eax
0x0dbb3ac3:  jmp    0xefecc15
0x0dbb3ac8:  mov    $0xb68684cb,%eax
0x0dbb3acd:  jmp    0xefecc15

_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc

[OpenRISC] The reason why QEMU is such a slow OpenRISC emulator.

Reply via email to