Re: [OpenRISC] The reason why QEMU is such a slow OpenRISC emulator.

Sebastian Macke Thu, 17 Oct 2013 09:44:23 -0700

Hi,

I have to clarify myself, especially on point no. 1.

After I have found a nice documentation about the lazy flag optimizationI am almost sure that my previous statement that the i386 emulatesviolates specification is wrong.

http://bochs.sourceforge.net/How%20the%20Bochs%20works%20under%20the%20hood%202nd%20edition.pdf

I think we can implement it in the same way. But the Exception OverflowFlag is a small problem here. There are ways to implement it correctlyby invalidating all compiled code if this flag is changed.


The points 2-6 can be implemented by not violating the specification.

Sebastian



On 16/10/2013 8:54 PM, Sebastian Macke wrote:

Hi,
after I played a little bit around with the QEMU OpenRISC emulator Irealized that it is extremely slow in comparison to the i386 emulation.It is in fact around 45-60 times slower. I asked me why and I came tosome conclusions which I would summarize here.
The first thing is of course the missing hardware tlb refill. Andindeed after applying the patches from Stefan Kristiansson the speedgain was around a factor of 2-4.
The other part I realized is the tiny-code-generator (tcg). I madesome tests which you can find in the attachment which does not involvetlb refill and does not involve little-endian big-endian swappings.For this case the speed is reduced by factor of 10 in comparison tothe i386 implementation.
I have reviewed and compared the different code generators andsummarized the optimizations we can perform.Unfortunately most of them would violate the specification. But onlyin rare cases. So rare, that you have to do it on purpose with handwritten assembly code.The i386 code generator is also off specification by using the lazyflag optimization for example (http://qemu.weilnetz.de/qemu-tech.html). I am not sure if old hand written assembler DOS games would run inQEMU.
What you should know is, that the code generator constructs small codefragments with some strict properties
  - The code is bound to an absolute virtual pc address.
- This code fragment is never interrupted. It contains usually 5-20opcodes.- The code fragment almost always ends with a jump or somethingsimilar (bf, syscall, ...). There are no jumps in between.
Here are the findings.

1.
At the moment the overflow and carry flags are set every time.
This is one of the biggest speed issues. Instead of one convertedl.addi opcode we need more than 20.
The best thing is. These flags are *never* used.
I think the main purpose of these FLAGS is to check after the overflowexception what happened.In my opinion we can remove them or only set them if the overflowexception flag is set.Additional I would strongly suggest that the overflow exception flagis checked statically instead dynamically.
I know the l.addc instruction is special. This instruction is neverused by the way. But there are ways
to treat it if it follows directly after an l.add instruction.

Such a patch would increase the speed at least by 30%.

2.
The OpenRISC architecture does not have its own move registerinstruction.
Instead it uses either "l.addi rd, rs, 0", "l.ori rd, rs, 0"
We can check for this special cases and use a move instruction instead.

3.
Define r0 as hardwired. Then we can define the instructions "l.or rd,rx, r0", "l.addi rd, r0, x" or "l.ori rd, r0, x"
additionally as move instructions.
This would add a little bit more logic to the code generator. But I ampretty sure the
emulation speed would profit. The instructions above are very common.
4. The npc (next pc) and ppc (previous pc) variables are updatedbefore every instructionThis is extremely inefficient. And what makes things worse. They are*never* (for a very close approximation for never) used.At the moment the npc is used in some way for delayed instructions butI am sure we can get rid of them because every code
fragment uses fixed virtual addresses.
There is still a way to keep the registers with some special checksfor l.mtspr and l.mfspr.
5. Check and branch instructions are independent at the moment
What makes things worse. The sf... instructions sets the Branch flagand the next instructions reads it. Everytime with lots
of ands and ors opcodes.
Additionally the branch flag could be separated from the SR registerto speed things up.
6.
As far as I see it is almost impossible to get an exception during adelayed instruction by design. Only in single-step mode.
I think it is possible to remove some logic here.
Before I am start writing even one patch I would like know if there isa chance to get them accepted as they will highly likely break thespecification. In my opinion QEMU is for speed while or1ksim is foraccuracy. At the moment QEMU is only a little bit faster than or1ksimwhich makes it almost useless.
Sebastian Macke









_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc

_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc

Re: [OpenRISC] The reason why QEMU is such a slow OpenRISC emulator.

Reply via email to