On 2015-09-10 19:48, Aurelien Jarno wrote: > On 2015-09-01 22:51, Richard Henderson wrote: > > I've been looking at this problem off and on for the last week or so, > > prompted by the sparc performance work. Although I havn't been able > > to get a proper sparc64 guest install working, I see the exact same > > problem with a mips guest. > > > > On alpha or x86, which seem to perform well, perf numbers for the > > executable have about 30% of the execution time spent in cpu_exec. > > For mips, on the other hand, we spend about 30% of the time in > > routines related to tcg (re-)translation. > > Indeed the problem happens on CPUs which implement the MMU as a > "software assisted TLB" (or any other marketing name), as opposed to > hardware page walk MMU. They can hold a limited number of TLB entry > at a given time, and require the OS to do the page walk to refill the > TLB. For that an exception is generated, and the faulting address has > to be determined. That's were the TB retranslation takes place, and > that's why it happens a lot more on these CPUS. > > A few years ago, I measured about 45% of the TB translation actually > being retranslation for mips and 60% for SH4 for a standard workload. > For a comparison, these value around 1% on i386 and around 5% on ARM. > > That's why each time we add an optimization to the optimize, we get > faster code, but we might loose because it takes longer to generate. > > > Aurelien has a patch in his own branches that attempts to mitigate this > > on mips by shadow caching more tlb entries. While this does improve > > performace a bit, it employs a linear search through a large buffer, > > with the effect of 30-ish % perf numbers for r4k_map_address. > > (One could probably improve things by hashing the data in that array, > > rather than a linear search, but...) > > Yes, that is just a workaround and probably highly workload dependent, > that's why I never submitted it. > > > In the past we've talked about getting rid of retranslation entirely. > > It's clever, but it certainly has its share of problems. I gave it > > a go this weekend. > > Really great that you have been able to implement that. > > > The following isn't quite right. It fails to boot on sparc even with > > our tiny test kernel. It also triggers an abort on mips, eventually. > > But it's able to get all the way through to a prompt, and in the > > process I can see that perf results are quite different -- much more > > like results I see for alpha. > > > > Thoughts on the approach? > > It looks like the approach we discussed with Paolo back in June: > > http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg04885.html > > For me it looks like the good way to proceed, we just have to take care > that the informations to store do not take too much space compared to > the actual translated code. > > I'll give a look and a test asap.
I haven't really reviewed the code yet, but I have been able to test your tcg-search-2 branch. First of all I have tested half of the targets (alpha, arm, cris, i386, mips, ppc, s390x, sh4 and sparc), and I haven't noticed any regression. They now have more than 50 hours of uptime, some of them have been building stuff most of the time, so they are quite stable. That said I have only tested your branch on an x86-64 host, and it might be a good idea to test it in one or two different host architectures (I put that on my todo list, but no promise there). On the performance side, I have done real measurements only on i386 and mips. On i386, I haven't seen any measurable difference. On mips, the boot time is unchanged, but then some workloads are quite faster. The best I have measured is on perl code, with a x2.4 improvements, while on an average workload, the gain is around x1.5. With all that said, you can get: Tested-by: Aurelien Jarno <aurel...@aurel32.net> I hope to give you the corresponding reviewed-by in the next days. Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net