Hi,
after I played a little bit around with the QEMU OpenRISC emulator I
realized that it is extremely slow in comparison to the i386 emulation.
It is in fact around 45-60 times slower. I asked me why and I came to
some conclusions which I would summarize here.
The first thing is of course the missing hardware tlb refill. And indeed
after applying the patches from Stefan Kristiansson the speed gain was
around a factor of 2-4.
The other part I realized is the tiny-code-generator (tcg). I made some
tests which you can find in the attachment which does not involve tlb
refill and does not involve little-endian big-endian swappings.
For this case the speed is reduced by factor of 10 in comparison to the
i386 implementation.
I have reviewed and compared the different code generators and
summarized the optimizations we can perform.
Unfortunately most of them would violate the specification. But only in
rare cases. So rare, that you have to do it on purpose with hand written
assembly code.
The i386 code generator is also off specification by using the lazy flag
optimization for example (http://qemu.weilnetz.de/qemu-tech.html) . I am
not sure if old hand written assembler DOS games would run in QEMU.
What you should know is, that the code generator constructs small code
fragments with some strict properties
- The code is bound to an absolute virtual pc address.
- This code fragment is never interrupted. It contains usually 5-20
opcodes.
- The code fragment almost always ends with a jump or something
similar (bf, syscall, ...). There are no jumps in between.
Here are the findings.
1.
At the moment the overflow and carry flags are set every time.
This is one of the biggest speed issues. Instead of one converted l.addi
opcode we need more than 20.
The best thing is. These flags are *never* used.
I think the main purpose of these FLAGS is to check after the overflow
exception what happened.
In my opinion we can remove them or only set them if the overflow
exception flag is set.
Additional I would strongly suggest that the overflow exception flag is
checked statically instead dynamically.
I know the l.addc instruction is special. This instruction is never used
by the way. But there are ways
to treat it if it follows directly after an l.add instruction.
Such a patch would increase the speed at least by 30%.
2.
The OpenRISC architecture does not have its own move register instruction.
Instead it uses either "l.addi rd, rs, 0", "l.ori rd, rs, 0"
We can check for this special cases and use a move instruction instead.
3.
Define r0 as hardwired. Then we can define the instructions "l.or rd,
rx, r0", "l.addi rd, r0, x" or "l.ori rd, r0, x"
additionally as move instructions.
This would add a little bit more logic to the code generator. But I am
pretty sure the
emulation speed would profit. The instructions above are very common.
4. The npc (next pc) and ppc (previous pc) variables are updated before
every instruction
This is extremely inefficient. And what makes things worse. They are
*never* (for a very close approximation for never) used.
At the moment the npc is used in some way for delayed instructions but I
am sure we can get rid of them because every code
fragment uses fixed virtual addresses.
There is still a way to keep the registers with some special checks for
l.mtspr and l.mfspr.
5. Check and branch instructions are independent at the moment
What makes things worse. The sf... instructions sets the Branch flag and
the next instructions reads it. Everytime with lots
of ands and ors opcodes.
Additionally the branch flag could be separated from the SR register to
speed things up.
6.
As far as I see it is almost impossible to get an exception during a
delayed instruction by design. Only in single-step mode.
I think it is possible to remove some logic here.
Before I am start writing even one patch I would like know if there is a
chance to get them accepted as they will highly likely break the
specification. In my opinion QEMU is for speed while or1ksim is for
accuracy. At the moment QEMU is only a little bit faster than or1ksim
which makes it almost useless.
Sebastian Macke
The code:
for(int i=0; i<a; i++)
{
x +=a;
}
----------------
i386 tcg code generator
IN:
0x080483f0: add %edx,%eax
0x080483f2: add $0x1,%edx
0x080483f5: cmp %ecx,%edx
0x080483f7: jne 0x80483f0
OP:
ld_i32 tmp12,env,$0xffffffbc
movi_i32 tmp13,$0x0
brcond_i32 tmp12,tmp13,ne,$0x0
---- 0x80483f0
mov_i32 tmp1,edx
mov_i32 tmp0,eax
add_i32 tmp0,tmp0,tmp1
mov_i32 eax,tmp0
mov_i32 cc_src,tmp1
mov_i32 cc_dst,tmp0
discard cc_src2
discard cc_op
---- 0x80483f2
movi_i32 tmp1,$0x1
mov_i32 tmp0,edx
add_i32 tmp0,tmp0,tmp1
mov_i32 edx,tmp0
mov_i32 cc_src,tmp1
mov_i32 cc_dst,tmp0
---- 0x80483f5
mov_i32 tmp1,ecx
mov_i32 tmp0,edx
mov_i32 cc_src,tmp1
mov_i32 loc11,tmp0
sub_i32 cc_dst,tmp0,tmp1
---- 0x80483f7
movi_i32 cc_op,$0x10
discard loc11
movi_i32 tmp12,$0x0
brcond_i32 cc_dst,tmp12,ne,$0x1
goto_tb $0x0
movi_i32 tmp3,$0x80483f9
st_i32 tmp3,env,$0x20
exit_tb $0xabeab220
set_label $0x1
goto_tb $0x1
movi_i32 tmp3,$0x80483f0
st_i32 tmp3,env,$0x20
exit_tb $0xabeab221
set_label $0x0
exit_tb $0xabeab223
OUT: [size=109]
0xb290c440: mov -0x44(%ebp),%ebx
0xb290c443: test %ebx,%ebx
0xb290c445: jne 0xb290c4a3
0xb290c44b: mov 0x0(%ebp),%ebx
0xb290c44e: mov 0x8(%ebp),%esi
0xb290c451: add %esi,%ebx
0xb290c453: mov %ebx,0x0(%ebp)
0xb290c456: lea 0x1(%esi),%ebx
0xb290c459: mov %ebx,0x8(%ebp)
0xb290c45c: mov 0x4(%ebp),%esi
0xb290c45f: mov %esi,0x2c(%ebp)
0xb290c462: sub %esi,%ebx
0xb290c464: mov %ebx,0x28(%ebp)
0xb290c467: mov $0x10,%esi
0xb290c46c: mov %esi,0x34(%ebp)
0xb290c46f: test %ebx,%ebx
0xb290c471: jne 0xb290c48d
0xb290c477: jmp 0xb290c47c
0xb290c47c: movl $0x80483f9,0x20(%ebp)
0xb290c483: mov $0xabeab220,%eax
0xb290c488: jmp 0xb779bc15
0xb290c48d: jmp 0xb290c492
0xb290c492: movl $0x80483f0,0x20(%ebp)
0xb290c499: mov $0xabeab221,%eax
0xb290c49e: jmp 0xb779bc15
0xb290c4a3: mov $0xabeab223,%eax
0xb290c4a8: jmp 0xb779bc15
#-----------------------------------------------------------
#-----------------------------------------------------------
#-----------------------------------------------------------
#-----------------------------------------------------------
Openrisc tcg code generator.
The carry and overflow flag part is removed.
Otherwise the code generated would be twice or three times bigger and two times
slower.
0x00000660: l.addi r4, r4, 1
0x00000664: l.sfeq r4, r3
0x00000668: l.bnf 67108862
0x0000066C: l.add r11, r11, r4
isize=16 osize=42
OP:
ld_i32 tmp0,env,$0xffffffc8
movi_i32 tmp1,$float_eq_s
brcond_i32 tmp0,tmp1,ne,$0x0
---- 0x660
movi_i32 ppc,$0x65c
movi_i32 npc,$0x664
movi_i32 tmp0,$0x1
add_i32 r4,r4,tmp0
---- 0x664
movi_i32 ppc,$0x660
movi_i32 npc,$0x668
movi_i32 btaken,$float_eq_s
setcond_i32 btaken,r4,r3,eq
movi_i32 tmp0,$0xfffffdff
and_i32 sr,sr,tmp0
movi_i32 tmp0,$float_eq_s
brcond_i32 btaken,tmp0,eq,$0x1
movi_i32 tmp0,$0x200
or_i32 sr,sr,tmp0
set_label $0x1
---- 0x668
movi_i32 ppc,$0x664
movi_i32 npc,$0x66c
movi_i32 tmp1,$0x200
and_i32 tmp0,sr,tmp1
movi_i32 jmp_pc,$0x670
movi_i32 tmp1,$0x200
brcond_i32 tmp0,tmp1,eq,$0x2
movi_i32 jmp_pc,$0x660
set_label $0x2
movi_i32 flags,$0x1
---- 0x66c
movi_i32 ppc,$0x668
movi_i32 npc,$0x670
add_i32 r11,r11,r4
movi_i32 flags,$float_eq_s
mov_i32 pc,jmp_pc
mov_i32 npc,jmp_pc
movi_i32 jmp_pc,$float_eq_s
exit_tb $0x0
set_label $0x0
exit_tb $0xb68684cb
OUT: [size=242]
0x0dbb39e0: mov -0x38(%ebp),%ebx
0x0dbb39e3: test %ebx,%ebx
0x0dbb39e5: jne 0xdbb3ac8
0x0dbb39eb: mov 0x10(%ebp),%ebx
0x0dbb39ee: inc %ebx
0x0dbb39ef: mov %ebx,0x10(%ebp)
0x0dbb39f2: mov $0x660,%esi
0x0dbb39f7: mov %esi,0x88(%ebp)
0x0dbb39fd: mov $0x668,%esi
0x0dbb3a02: mov %esi,0x84(%ebp)
0x0dbb3a08: mov 0xc(%ebp),%esi
0x0dbb3a0b: cmp %esi,%ebx
0x0dbb3a0d: sete %bl
0x0dbb3a10: movzbl %bl,%ebx
0x0dbb3a13: mov %ebx,0xd4(%ebp)
0x0dbb3a19: mov 0xa8(%ebp),%esi
0x0dbb3a1f: and $0xfffffdff,%esi
0x0dbb3a25: mov %esi,0xa8(%ebp)
0x0dbb3a2b: test %ebx,%ebx
0x0dbb3a2d: je 0xdbb3a45
0x0dbb3a33: mov 0xa8(%ebp),%ebx
0x0dbb3a39: or $0x200,%ebx
0x0dbb3a3f: mov %ebx,0xa8(%ebp)
0x0dbb3a45: mov $0x664,%ebx
0x0dbb3a4a: mov %ebx,0x88(%ebp)
0x0dbb3a50: mov $0x66c,%ebx
0x0dbb3a55: mov %ebx,0x84(%ebp)
0x0dbb3a5b: mov 0xa8(%ebp),%ebx
0x0dbb3a61: and $0x200,%ebx
0x0dbb3a67: mov $0x670,%esi
0x0dbb3a6c: mov %esi,0x8c(%ebp)
0x0dbb3a72: cmp $0x200,%ebx
0x0dbb3a78: je 0xdbb3a89
0x0dbb3a7e: mov $0x660,%ebx
0x0dbb3a83: mov %ebx,0x8c(%ebp)
0x0dbb3a89: mov $0x668,%ebx
0x0dbb3a8e: mov %ebx,0x88(%ebp)
0x0dbb3a94: mov 0x2c(%ebp),%ebx
0x0dbb3a97: mov 0x10(%ebp),%esi
0x0dbb3a9a: add %esi,%ebx
0x0dbb3a9c: mov %ebx,0x2c(%ebp)
0x0dbb3a9f: xor %ebx,%ebx
0x0dbb3aa1: mov %ebx,0xd0(%ebp)
0x0dbb3aa7: mov 0x8c(%ebp),%ebx
0x0dbb3aad: mov %ebx,0x80(%ebp)
0x0dbb3ab3: mov %ebx,0x84(%ebp)
0x0dbb3ab9: xor %ebx,%ebx
0x0dbb3abb: mov %ebx,0x8c(%ebp)
0x0dbb3ac1: xor %eax,%eax
0x0dbb3ac3: jmp 0xefecc15
0x0dbb3ac8: mov $0xb68684cb,%eax
0x0dbb3acd: jmp 0xefecc15
_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc