> Am 09.09.2014 um 18:42 schrieb Paolo Bonzini <pbonz...@redhat.com>: > > Il 05/09/2014 14:11, Paolo Bonzini ha scritto: >> >> >> ----- Messaggio originale ----- >>> Da: "Alexander Graf" <ag...@suse.de> >>> A: "Paolo Bonzini" <pbonz...@redhat.com>, qemu-devel@nongnu.org >>> Cc: dgib...@redhat.com, qemu-...@nongnu.org, tommu...@gmail.com >>> Inviato: Venerdì, 5 settembre 2014 9:10:01 >>> Oggetto: Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing >>> >>> >>> >>>> On 28.08.14 19:14, Paolo Bonzini wrote: >>>> PowerPC TCG flushes the TLB on every IR/DR change, which basically >>>> means on every user<->kernel context switch. Use the 6-element >>>> TLB array as a cache, where each MMU index is mapped to a different >>>> state of the IR/DR/PR/HV bits. >>>> >>>> This brings the number of TLB flushes down from ~900000 to ~50000 >>>> for starting up the Debian installer, which is in line with x86 >>>> and gives a ~10% performance improvement. >>>> >>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> >>>> --- >>>> cputlb.c | 19 +++++++++++++++++ >>>> hw/ppc/spapr_hcall.c | 6 +++++- >>>> include/exec/exec-all.h | 5 +++++ >>>> target-ppc/cpu.h | 4 +++- >>>> target-ppc/excp_helper.c | 6 +----- >>>> target-ppc/helper_regs.h | 52 >>>> +++++++++++++++++++++++++++++++-------------- >>>> target-ppc/translate_init.c | 5 +++++ >>>> 7 files changed, 74 insertions(+), 23 deletions(-) >>>> >>>> diff --git a/cputlb.c b/cputlb.c >>>> index afd3705..17e1b03 100644 >>>> --- a/cputlb.c >>>> +++ b/cputlb.c >>>> @@ -67,6 +67,25 @@ void tlb_flush(CPUState *cpu, int flush_global) >>>> tlb_flush_count++; >>>> } >>>> >>>> +void tlb_flush_idx(CPUState *cpu, int mmu_idx) >>>> +{ >>>> + CPUArchState *env = cpu->env_ptr; >>>> + >>>> +#if defined(DEBUG_TLB) >>>> + printf("tlb_flush_idx %d:\n", mmu_idx); >>>> +#endif >>>> + /* must reset current TB so that interrupts cannot modify the >>>> + links while we are modifying them */ >>>> + cpu->current_tb = NULL; >>>> + >>>> + memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[mmu_idx])); >>>> + memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache)); >>>> + >>>> + env->tlb_flush_addr = -1; >>>> + env->tlb_flush_mask = 0; >>>> + tlb_flush_count++; >>>> +} >>>> + >>>> static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong >>>> addr) >>>> { >>>> if (addr == (tlb_entry->addr_read & >>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c >>>> index 467858c..b95961c 100644 >>>> --- a/hw/ppc/spapr_hcall.c >>>> +++ b/hw/ppc/spapr_hcall.c >>>> @@ -556,13 +556,17 @@ static target_ulong h_cede(PowerPCCPU *cpu, >>>> sPAPREnvironment *spapr, >>>> { >>>> CPUPPCState *env = &cpu->env; >>>> CPUState *cs = CPU(cpu); >>>> + bool flush; >>>> >>>> env->msr |= (1ULL << MSR_EE); >>>> - hreg_compute_hflags(env); >>>> + flush = hreg_compute_hflags(env); >>>> if (!cpu_has_work(cs)) { >>>> cs->halted = 1; >>>> cs->exception_index = EXCP_HLT; >>>> cs->exit_request = 1; >>>> + } else if (flush) { >>>> + cs->interrupt_request |= CPU_INTERRUPT_EXITTB; >>>> + cs->exit_request = 1; >>> >>> Can this ever happen? >> >> No, I think it can't. >> >>> Ok, so this basically changes the semantics of mmu_idx from a static >>> array with predefined meanings to a dynamic array with runtime changing >>> semantics. >>> >>> The first thing that comes to mind here is why we're not just extending >>> the existing array? After all, we have 4 bits -> 16 states minus one for >>> PR+HV. Can our existing logic not deal with this? >> >> Yeah, that would require 12 MMU indices. Right now, include/exec/cpu_ldst.h >> only supports 6 but that's easy to extend. >> >> tlb_flush becomes progressively more expensive as you add more MMU modes, >> but it may work. This patch removes 98.8% of the TLB flushes, makes the >> remaining ones twice as slow (NB_MMU_MODES goes from 3 to 6), and speeds >> up QEMU by 10%. You can solve this: >> >> 0.9 = 0.988 * 0 + 0.012 * tlb_time * 2 + (1 - tlb_time) * 1 >> tlb_time = 0.1 / 0.98 = 0.102 >> >> to compute that the time spent in TLB flushes before the patch is 10.2% of >> the >> whole emulation time. >> >> Doubling the NB_MMU_MODES further from 6 to 12 would still save 98.8% of the >> TLB >> flushes, while making the remaining ones even more expensive. The savings >> will be >> smaller, but actually not by much: >> >> 0.988 * 0 + 0.012 * tlb_time * 4 + (1 - tlb_time) * 1 = 0.903 >> >> i.e. what you propose would still save 9.7%. Still, having 12 modes seemed >> like a >> waste, since only 4 or 5 are used in practice... > > The 12 MMU modes work just fine. Thanks for the suggestion!
Awesome! While slightly more wasteful than the dynamic approach, I'm sure it's a lot easier to understand and debug :). Alex > > Paolo