Hi,

I have to clarify myself, especially on point no. 1.

After I have found a nice documentation about the lazy flag optimization I am almost sure that my previous statement that the i386 emulates violates specification is wrong.
http://bochs.sourceforge.net/How%20the%20Bochs%20works%20under%20the%20hood%202nd%20edition.pdf

I think we can implement it in the same way. But the Exception Overflow Flag is a small problem here. There are ways to implement it correctly by invalidating all compiled code if this flag is changed.

The points 2-6 can be implemented by not violating the specification.

Sebastian



On 16/10/2013 8:54 PM, Sebastian Macke wrote:
Hi,

after I played a little bit around with the QEMU OpenRISC emulator I realized that it is extremely slow in comparison to the i386 emulation. It is in fact around 45-60 times slower. I asked me why and I came to some conclusions which I would summarize here.

The first thing is of course the missing hardware tlb refill. And indeed after applying the patches from Stefan Kristiansson the speed gain was around a factor of 2-4.

The other part I realized is the tiny-code-generator (tcg). I made some tests which you can find in the attachment which does not involve tlb refill and does not involve little-endian big-endian swappings. For this case the speed is reduced by factor of 10 in comparison to the i386 implementation.

I have reviewed and compared the different code generators and summarized the optimizations we can perform. Unfortunately most of them would violate the specification. But only in rare cases. So rare, that you have to do it on purpose with hand written assembly code. The i386 code generator is also off specification by using the lazy flag optimization for example (http://qemu.weilnetz.de/qemu-tech.html) . I am not sure if old hand written assembler DOS games would run in QEMU.

What you should know is, that the code generator constructs small code fragments with some strict properties
  - The code is bound to an absolute virtual pc address.
- This code fragment is never interrupted. It contains usually 5-20 opcodes. - The code fragment almost always ends with a jump or something similar (bf, syscall, ...). There are no jumps in between.

Here are the findings.

1.
At the moment the overflow and carry flags are set every time.
This is one of the biggest speed issues. Instead of one converted l.addi opcode we need more than 20.
The best thing is. These flags are *never* used.
I think the main purpose of these FLAGS is to check after the overflow exception what happened. In my opinion we can remove them or only set them if the overflow exception flag is set. Additional I would strongly suggest that the overflow exception flag is checked statically instead dynamically.

I know the l.addc instruction is special. This instruction is never used by the way. But there are ways
to treat it if it follows directly after an l.add instruction.

Such a patch would increase the speed at least by 30%.

2.
The OpenRISC architecture does not have its own move register instruction.
Instead it uses either "l.addi rd, rs, 0", "l.ori rd, rs, 0"
We can check for this special cases and use a move instruction instead.

3.
Define r0 as hardwired. Then we can define the instructions "l.or rd, rx, r0", "l.addi rd, r0, x" or "l.ori rd, r0, x"
additionally as move instructions.
This would add a little bit more logic to the code generator. But I am pretty sure the
emulation speed would profit. The instructions above are very common.

4. The npc (next pc) and ppc (previous pc) variables are updated before every instruction This is extremely inefficient. And what makes things worse. They are *never* (for a very close approximation for never) used. At the moment the npc is used in some way for delayed instructions but I am sure we can get rid of them because every code
fragment uses fixed virtual addresses.

There is still a way to keep the registers with some special checks for l.mtspr and l.mfspr.

5. Check and branch instructions are independent at the moment
What makes things worse. The sf... instructions sets the Branch flag and the next instructions reads it. Everytime with lots
of ands and ors opcodes.
Additionally the branch flag could be separated from the SR register to speed things up.

6.
As far as I see it is almost impossible to get an exception during a delayed instruction by design. Only in single-step mode.
I think it is possible to remove some logic here.


Before I am start writing even one patch I would like know if there is a chance to get them accepted as they will highly likely break the specification. In my opinion QEMU is for speed while or1ksim is for accuracy. At the moment QEMU is only a little bit faster than or1ksim which makes it almost useless.

Sebastian Macke









_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc

_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc

Reply via email to