On Thu, Oct 17, 2013 at 4:54 AM, Sebastian Macke <[email protected]> wrote:
> Hi,
>
> after I played a little bit around with the QEMU OpenRISC emulator I
> realized that it is extremely slow in comparison to the i386 emulation.
> It is in fact around 45-60 times slower. I asked me why and I came to some
> conclusions which I would summarize here.
>
> The first thing is of course the missing hardware tlb refill. And indeed
> after applying the patches from Stefan Kristiansson the speed gain was
> around a factor of 2-4.
>
> The other part I realized is the tiny-code-generator (tcg). I made some
> tests which you can find in the attachment which does not involve tlb refill
> and does not involve little-endian big-endian swappings.
> For this case the speed is reduced by factor of 10 in comparison to the i386
> implementation.
>
> I have reviewed and compared the different code generators and summarized
> the optimizations we can perform.
> Unfortunately most of them would violate the specification. But only in rare
> cases. So rare, that you have to do it on purpose with hand written assembly
> code.
> The i386 code generator is also off specification by using the lazy flag
> optimization for example (http://qemu.weilnetz.de/qemu-tech.html) . I am not
> sure if old hand written assembler DOS games would run in QEMU.
>
> What you should know is, that the code generator constructs small code
> fragments with some strict properties
>   - The code is bound to an absolute virtual pc address.
>   - This code fragment is never interrupted. It contains usually 5-20
> opcodes.
>   - The code fragment almost always ends with a jump or something similar
> (bf, syscall, ...). There are no jumps in between.
>
> Here are the findings.
>
> 1.
> At the moment the overflow and carry flags are set every time.
> This is one of the biggest speed issues. Instead of one converted l.addi
> opcode we need more than 20.
> The best thing is. These flags are *never* used.
> I think the main purpose of these FLAGS is to check after the overflow
> exception what happened.
> In my opinion we can remove them or only set them if the overflow exception
> flag is set.
> Additional I would strongly suggest that the overflow exception flag is
> checked statically instead dynamically.
>
> I know the l.addc instruction is special. This instruction is never used by
> the way. But there are ways
> to treat it if it follows directly after an l.add instruction.
>
> Such a patch would increase the speed at least by 30%.
>
> 2.
> The OpenRISC architecture does not have its own move register instruction.
> Instead it uses either "l.addi rd, rs, 0", "l.ori rd, rs, 0"
> We can check for this special cases and use a move instruction instead.
>
> 3.
> Define r0 as hardwired. Then we can define the instructions "l.or rd, rx,
> r0",  "l.addi rd, r0, x" or "l.ori rd, r0, x"
> additionally as move instructions.
> This would add a little bit more logic to the code generator. But I am
> pretty sure the
> emulation speed would profit. The instructions above are very common.
>
> 4. The npc (next pc) and ppc (previous pc) variables are updated before
> every instruction
> This is extremely inefficient. And what makes things worse. They are *never*
> (for a very close approximation for never) used.
> At the moment the npc is used in some way for delayed instructions but I am
> sure we can get rid of them because every code
> fragment uses fixed virtual addresses.
>
> There is still a way to keep the registers with some special checks for
> l.mtspr and l.mfspr.
>
> 5. Check and branch instructions are independent at the moment
> What makes things worse. The sf... instructions sets the Branch flag and the
> next instructions reads it. Everytime with lots
> of ands and ors opcodes.
> Additionally the branch flag could be separated from the SR register to
> speed things up.
>
> 6.
> As far as I see it is almost impossible to get an exception during a delayed
> instruction by design. Only in single-step mode.
> I think it is possible to remove some logic here.
>
>
> Before I am start writing even one patch I would like know if there is a
> chance to get them accepted as they will highly likely break the
> specification. In my opinion QEMU is for speed while or1ksim is for
> accuracy. At the moment QEMU is only a little bit faster than or1ksim which
> makes it almost useless.

Hi Sebastian

Great detective work here.

I  wouldn't mind if QEMU didn't support a particular subset of the
or1k instruction set, assuming it supports whatever the compiler is
kicking out, and a big warning is included with it. it. Do you think,
though, over time an efficient implementation could be created?

What I expect QEMU could be good for is in its Linux user mode where,
correct me if I'm wrong here, it's possible to run or1k-linux ELFs on
any system capable of running QEMU. As well as full system simulation
to run a full Linux-based system. Is this the goal?

Cheers

Julius
_______________________________________________
OpenRISC mailing list
[email protected]
http://lists.openrisc.net/listinfo/openrisc

Reply via email to