On 17/10/2013 9:49 AM, Julius Baxter wrote:
On Thu, Oct 17, 2013 at 4:54 AM, Sebastian Macke <[email protected]> wrote:
Hi,
after I played a little bit around with the QEMU OpenRISC emulator I
realized that it is extremely slow in comparison to the i386 emulation.
It is in fact around 45-60 times slower. I asked me why and I came to some
conclusions which I would summarize here.
The first thing is of course the missing hardware tlb refill. And indeed
after applying the patches from Stefan Kristiansson the speed gain was
around a factor of 2-4.
The other part I realized is the tiny-code-generator (tcg). I made some
tests which you can find in the attachment which does not involve tlb refill
and does not involve little-endian big-endian swappings.
For this case the speed is reduced by factor of 10 in comparison to the i386
implementation.
I have reviewed and compared the different code generators and summarized
the optimizations we can perform.
Unfortunately most of them would violate the specification. But only in rare
cases. So rare, that you have to do it on purpose with hand written assembly
code.
The i386 code generator is also off specification by using the lazy flag
optimization for example (http://qemu.weilnetz.de/qemu-tech.html) . I am not
sure if old hand written assembler DOS games would run in QEMU.
What you should know is, that the code generator constructs small code
fragments with some strict properties
- The code is bound to an absolute virtual pc address.
- This code fragment is never interrupted. It contains usually 5-20
opcodes.
- The code fragment almost always ends with a jump or something similar
(bf, syscall, ...). There are no jumps in between.
Here are the findings.
1.
At the moment the overflow and carry flags are set every time.
This is one of the biggest speed issues. Instead of one converted l.addi
opcode we need more than 20.
The best thing is. These flags are *never* used.
I think the main purpose of these FLAGS is to check after the overflow
exception what happened.
In my opinion we can remove them or only set them if the overflow exception
flag is set.
Additional I would strongly suggest that the overflow exception flag is
checked statically instead dynamically.
I know the l.addc instruction is special. This instruction is never used by
the way. But there are ways
to treat it if it follows directly after an l.add instruction.
Such a patch would increase the speed at least by 30%.
2.
The OpenRISC architecture does not have its own move register instruction.
Instead it uses either "l.addi rd, rs, 0", "l.ori rd, rs, 0"
We can check for this special cases and use a move instruction instead.
3.
Define r0 as hardwired. Then we can define the instructions "l.or rd, rx,
r0", "l.addi rd, r0, x" or "l.ori rd, r0, x"
additionally as move instructions.
This would add a little bit more logic to the code generator. But I am
pretty sure the
emulation speed would profit. The instructions above are very common.
4. The npc (next pc) and ppc (previous pc) variables are updated before
every instruction
This is extremely inefficient. And what makes things worse. They are *never*
(for a very close approximation for never) used.
At the moment the npc is used in some way for delayed instructions but I am
sure we can get rid of them because every code
fragment uses fixed virtual addresses.
There is still a way to keep the registers with some special checks for
l.mtspr and l.mfspr.
5. Check and branch instructions are independent at the moment
What makes things worse. The sf... instructions sets the Branch flag and the
next instructions reads it. Everytime with lots
of ands and ors opcodes.
Additionally the branch flag could be separated from the SR register to
speed things up.
6.
As far as I see it is almost impossible to get an exception during a delayed
instruction by design. Only in single-step mode.
I think it is possible to remove some logic here.
Before I am start writing even one patch I would like know if there is a
chance to get them accepted as they will highly likely break the
specification. In my opinion QEMU is for speed while or1ksim is for
accuracy. At the moment QEMU is only a little bit faster than or1ksim which
makes it almost useless.
Hi Sebastian
Great detective work here.
I wouldn't mind if QEMU didn't support a particular subset of the
or1k instruction set, assuming it supports whatever the compiler is
kicking out, and a big warning is included with it. it. Do you think,
though, over time an efficient implementation could be created?
What I expect QEMU could be good for is in its Linux user mode where,
correct me if I'm wrong here, it's possible to run or1k-linux ELFs on
any system capable of running QEMU. As well as full system simulation
to run a full Linux-based system. Is this the goal?
Cheers
Julius
Hi Julius,
you are right. At the moment you can execute or1k-linux ELFs on any
system. I have tried it with static binaries e. g. busybox.
It runs without problems as far as I can see. Would be nice to use it
together with the gcc testsuite.
My original goal was to build an image with compiler tool to get rid of
the cross-compiling problems and to provide others such an image for
easy development. Unfortunately it is impossible right now because of
the speed. It seems that at the moment jor1k is the fastest emulator
which is not acceptable.
So yes, I will try to understand the tiny code generator from QEMU and
try to optimize it a little bit.
Sebastian
_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc