Hi,
I have to clarify myself, especially on point no. 1.
After I have found a nice documentation about the lazy flag optimization
I am almost sure that my previous statement that the i386 emulates
violates specification is wrong.
http://bochs.sourceforge.net/How%20the%20Bochs%20works%20under%20the%20hood%202nd%20edition.pdf
I think we can implement it in the same way. But the Exception Overflow
Flag is a small problem here. There are ways to implement it correctly
by invalidating all compiled code if this flag is changed.
The points 2-6 can be implemented by not violating the specification.
Sebastian
On 16/10/2013 8:54 PM, Sebastian Macke wrote:
Hi,
after I played a little bit around with the QEMU OpenRISC emulator I
realized that it is extremely slow in comparison to the i386 emulation.
It is in fact around 45-60 times slower. I asked me why and I came to
some conclusions which I would summarize here.
The first thing is of course the missing hardware tlb refill. And
indeed after applying the patches from Stefan Kristiansson the speed
gain was around a factor of 2-4.
The other part I realized is the tiny-code-generator (tcg). I made
some tests which you can find in the attachment which does not involve
tlb refill and does not involve little-endian big-endian swappings.
For this case the speed is reduced by factor of 10 in comparison to
the i386 implementation.
I have reviewed and compared the different code generators and
summarized the optimizations we can perform.
Unfortunately most of them would violate the specification. But only
in rare cases. So rare, that you have to do it on purpose with hand
written assembly code.
The i386 code generator is also off specification by using the lazy
flag optimization for example (http://qemu.weilnetz.de/qemu-tech.html)
. I am not sure if old hand written assembler DOS games would run in
QEMU.
What you should know is, that the code generator constructs small code
fragments with some strict properties
- The code is bound to an absolute virtual pc address.
- This code fragment is never interrupted. It contains usually 5-20
opcodes.
- The code fragment almost always ends with a jump or something
similar (bf, syscall, ...). There are no jumps in between.
Here are the findings.
1.
At the moment the overflow and carry flags are set every time.
This is one of the biggest speed issues. Instead of one converted
l.addi opcode we need more than 20.
The best thing is. These flags are *never* used.
I think the main purpose of these FLAGS is to check after the overflow
exception what happened.
In my opinion we can remove them or only set them if the overflow
exception flag is set.
Additional I would strongly suggest that the overflow exception flag
is checked statically instead dynamically.
I know the l.addc instruction is special. This instruction is never
used by the way. But there are ways
to treat it if it follows directly after an l.add instruction.
Such a patch would increase the speed at least by 30%.
2.
The OpenRISC architecture does not have its own move register
instruction.
Instead it uses either "l.addi rd, rs, 0", "l.ori rd, rs, 0"
We can check for this special cases and use a move instruction instead.
3.
Define r0 as hardwired. Then we can define the instructions "l.or rd,
rx, r0", "l.addi rd, r0, x" or "l.ori rd, r0, x"
additionally as move instructions.
This would add a little bit more logic to the code generator. But I am
pretty sure the
emulation speed would profit. The instructions above are very common.
4. The npc (next pc) and ppc (previous pc) variables are updated
before every instruction
This is extremely inefficient. And what makes things worse. They are
*never* (for a very close approximation for never) used.
At the moment the npc is used in some way for delayed instructions but
I am sure we can get rid of them because every code
fragment uses fixed virtual addresses.
There is still a way to keep the registers with some special checks
for l.mtspr and l.mfspr.
5. Check and branch instructions are independent at the moment
What makes things worse. The sf... instructions sets the Branch flag
and the next instructions reads it. Everytime with lots
of ands and ors opcodes.
Additionally the branch flag could be separated from the SR register
to speed things up.
6.
As far as I see it is almost impossible to get an exception during a
delayed instruction by design. Only in single-step mode.
I think it is possible to remove some logic here.
Before I am start writing even one patch I would like know if there is
a chance to get them accepted as they will highly likely break the
specification. In my opinion QEMU is for speed while or1ksim is for
accuracy. At the moment QEMU is only a little bit faster than or1ksim
which makes it almost useless.
Sebastian Macke
_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc
_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc