On Thu, Oct 17, 2013 at 4:54 AM, Sebastian Macke <[email protected]> wrote: > Hi, > > after I played a little bit around with the QEMU OpenRISC emulator I > realized that it is extremely slow in comparison to the i386 emulation. > It is in fact around 45-60 times slower. I asked me why and I came to some > conclusions which I would summarize here. > > The first thing is of course the missing hardware tlb refill. And indeed > after applying the patches from Stefan Kristiansson the speed gain was > around a factor of 2-4. > > The other part I realized is the tiny-code-generator (tcg). I made some > tests which you can find in the attachment which does not involve tlb refill > and does not involve little-endian big-endian swappings. > For this case the speed is reduced by factor of 10 in comparison to the i386 > implementation. > > I have reviewed and compared the different code generators and summarized > the optimizations we can perform. > Unfortunately most of them would violate the specification. But only in rare > cases. So rare, that you have to do it on purpose with hand written assembly > code. > The i386 code generator is also off specification by using the lazy flag > optimization for example (http://qemu.weilnetz.de/qemu-tech.html) . I am not > sure if old hand written assembler DOS games would run in QEMU. > > What you should know is, that the code generator constructs small code > fragments with some strict properties > - The code is bound to an absolute virtual pc address. > - This code fragment is never interrupted. It contains usually 5-20 > opcodes. > - The code fragment almost always ends with a jump or something similar > (bf, syscall, ...). There are no jumps in between. > > Here are the findings. > > 1. > At the moment the overflow and carry flags are set every time. > This is one of the biggest speed issues. Instead of one converted l.addi > opcode we need more than 20. > The best thing is. These flags are *never* used. > I think the main purpose of these FLAGS is to check after the overflow > exception what happened. > In my opinion we can remove them or only set them if the overflow exception > flag is set. > Additional I would strongly suggest that the overflow exception flag is > checked statically instead dynamically. > > I know the l.addc instruction is special. This instruction is never used by > the way. But there are ways > to treat it if it follows directly after an l.add instruction. > > Such a patch would increase the speed at least by 30%. > > 2. > The OpenRISC architecture does not have its own move register instruction. > Instead it uses either "l.addi rd, rs, 0", "l.ori rd, rs, 0" > We can check for this special cases and use a move instruction instead. > > 3. > Define r0 as hardwired. Then we can define the instructions "l.or rd, rx, > r0", "l.addi rd, r0, x" or "l.ori rd, r0, x" > additionally as move instructions. > This would add a little bit more logic to the code generator. But I am > pretty sure the > emulation speed would profit. The instructions above are very common. > > 4. The npc (next pc) and ppc (previous pc) variables are updated before > every instruction > This is extremely inefficient. And what makes things worse. They are *never* > (for a very close approximation for never) used. > At the moment the npc is used in some way for delayed instructions but I am > sure we can get rid of them because every code > fragment uses fixed virtual addresses. > > There is still a way to keep the registers with some special checks for > l.mtspr and l.mfspr. > > 5. Check and branch instructions are independent at the moment > What makes things worse. The sf... instructions sets the Branch flag and the > next instructions reads it. Everytime with lots > of ands and ors opcodes. > Additionally the branch flag could be separated from the SR register to > speed things up. > > 6. > As far as I see it is almost impossible to get an exception during a delayed > instruction by design. Only in single-step mode. > I think it is possible to remove some logic here. > > > Before I am start writing even one patch I would like know if there is a > chance to get them accepted as they will highly likely break the > specification. In my opinion QEMU is for speed while or1ksim is for > accuracy. At the moment QEMU is only a little bit faster than or1ksim which > makes it almost useless.
Hi Sebastian Great detective work here. I wouldn't mind if QEMU didn't support a particular subset of the or1k instruction set, assuming it supports whatever the compiler is kicking out, and a big warning is included with it. it. Do you think, though, over time an efficient implementation could be created? What I expect QEMU could be good for is in its Linux user mode where, correct me if I'm wrong here, it's possible to run or1k-linux ELFs on any system capable of running QEMU. As well as full system simulation to run a full Linux-based system. Is this the goal? Cheers Julius _______________________________________________ OpenRISC mailing list [email protected] http://lists.openrisc.net/listinfo/openrisc
