Il 05/09/2014 14:11, Paolo Bonzini ha scritto: > > > ----- Messaggio originale ----- >> Da: "Alexander Graf" <ag...@suse.de> >> A: "Paolo Bonzini" <pbonz...@redhat.com>, qemu-devel@nongnu.org >> Cc: dgib...@redhat.com, qemu-...@nongnu.org, tommu...@gmail.com >> Inviato: Venerdì, 5 settembre 2014 9:10:01 >> Oggetto: Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing >> >> >> >> On 28.08.14 19:14, Paolo Bonzini wrote: >>> PowerPC TCG flushes the TLB on every IR/DR change, which basically >>> means on every user<->kernel context switch. Use the 6-element >>> TLB array as a cache, where each MMU index is mapped to a different >>> state of the IR/DR/PR/HV bits. >>> >>> This brings the number of TLB flushes down from ~900000 to ~50000 >>> for starting up the Debian installer, which is in line with x86 >>> and gives a ~10% performance improvement. >>> >>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> >>> --- >>> cputlb.c | 19 +++++++++++++++++ >>> hw/ppc/spapr_hcall.c | 6 +++++- >>> include/exec/exec-all.h | 5 +++++ >>> target-ppc/cpu.h | 4 +++- >>> target-ppc/excp_helper.c | 6 +----- >>> target-ppc/helper_regs.h | 52 >>> +++++++++++++++++++++++++++++++-------------- >>> target-ppc/translate_init.c | 5 +++++ >>> 7 files changed, 74 insertions(+), 23 deletions(-) >>> >>> diff --git a/cputlb.c b/cputlb.c >>> index afd3705..17e1b03 100644 >>> --- a/cputlb.c >>> +++ b/cputlb.c >>> @@ -67,6 +67,25 @@ void tlb_flush(CPUState *cpu, int flush_global) >>> tlb_flush_count++; >>> } >>> >>> +void tlb_flush_idx(CPUState *cpu, int mmu_idx) >>> +{ >>> + CPUArchState *env = cpu->env_ptr; >>> + >>> +#if defined(DEBUG_TLB) >>> + printf("tlb_flush_idx %d:\n", mmu_idx); >>> +#endif >>> + /* must reset current TB so that interrupts cannot modify the >>> + links while we are modifying them */ >>> + cpu->current_tb = NULL; >>> + >>> + memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[mmu_idx])); >>> + memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache)); >>> + >>> + env->tlb_flush_addr = -1; >>> + env->tlb_flush_mask = 0; >>> + tlb_flush_count++; >>> +} >>> + >>> static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong >>> addr) >>> { >>> if (addr == (tlb_entry->addr_read & >>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c >>> index 467858c..b95961c 100644 >>> --- a/hw/ppc/spapr_hcall.c >>> +++ b/hw/ppc/spapr_hcall.c >>> @@ -556,13 +556,17 @@ static target_ulong h_cede(PowerPCCPU *cpu, >>> sPAPREnvironment *spapr, >>> { >>> CPUPPCState *env = &cpu->env; >>> CPUState *cs = CPU(cpu); >>> + bool flush; >>> >>> env->msr |= (1ULL << MSR_EE); >>> - hreg_compute_hflags(env); >>> + flush = hreg_compute_hflags(env); >>> if (!cpu_has_work(cs)) { >>> cs->halted = 1; >>> cs->exception_index = EXCP_HLT; >>> cs->exit_request = 1; >>> + } else if (flush) { >>> + cs->interrupt_request |= CPU_INTERRUPT_EXITTB; >>> + cs->exit_request = 1; >> >> Can this ever happen? > > No, I think it can't. > >> Ok, so this basically changes the semantics of mmu_idx from a static >> array with predefined meanings to a dynamic array with runtime changing >> semantics. >> >> The first thing that comes to mind here is why we're not just extending >> the existing array? After all, we have 4 bits -> 16 states minus one for >> PR+HV. Can our existing logic not deal with this? > > Yeah, that would require 12 MMU indices. Right now, include/exec/cpu_ldst.h > only supports 6 but that's easy to extend. > > tlb_flush becomes progressively more expensive as you add more MMU modes, > but it may work. This patch removes 98.8% of the TLB flushes, makes the > remaining ones twice as slow (NB_MMU_MODES goes from 3 to 6), and speeds > up QEMU by 10%. You can solve this: > > 0.9 = 0.988 * 0 + 0.012 * tlb_time * 2 + (1 - tlb_time) * 1 > tlb_time = 0.1 / 0.98 = 0.102 > > to compute that the time spent in TLB flushes before the patch is 10.2% of the > whole emulation time. > > Doubling the NB_MMU_MODES further from 6 to 12 would still save 98.8% of the > TLB > flushes, while making the remaining ones even more expensive. The savings > will be > smaller, but actually not by much: > > 0.988 * 0 + 0.012 * tlb_time * 4 + (1 - tlb_time) * 1 = 0.903 > > i.e. what you propose would still save 9.7%. Still, having 12 modes seemed > like a > waste, since only 4 or 5 are used in practice...
The 12 MMU modes work just fine. Thanks for the suggestion! Paolo