Emilio G. Cota <c...@braap.org> writes: > On Mon, Jul 04, 2016 at 12:45:52 +0100, Alex Bennée wrote: >> >> Emilio G. Cota <c...@braap.org> writes: >> >> > On Fri, Jul 01, 2016 at 17:16:10 +0100, Alex Bennée wrote: >> >> Lock contention in the hot path of moving between existing patched >> >> TranslationBlocks is the main drag in multithreaded performance. This >> >> patch pushes the tb_lock() usage down to the two places that really need >> >> it: >> >> >> >> - code generation (tb_gen_code) >> >> - jump patching (tb_add_jump) >> >> >> >> The rest of the code doesn't really need to hold a lock as it is either >> >> using per-CPU structures, atomically updated or designed to be used in >> >> concurrent read situations (qht_lookup). >> >> >> >> To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the >> >> locks become NOPs anyway until the MTTCG work is completed. >> > >> > From a scalability point of view it would be better to have a single >> > critical section. >> >> You mean merge the critical region for patching and code-generation? > > Yes, I'd keep the lock held and drop it (if it was held) after the patching > is done, like IIRC we used to do: > (snip) >> > I propose to just extend the critical section, like we used to >> > do with tcg_lock_reset.
Hmm, so I came up with this: /* * Patch the last TB with a jump to the current TB. * * Modification of the TB has to be protected with tb_lock which can * either be already held or taken here. */ static inline void maybe_patch_last_tb(CPUState *cpu, TranslationBlock *tb, TranslationBlock **last_tb, int tb_exit, bool locked) { if (cpu->tb_flushed) { /* Ensure that no TB jump will be modified as the * translation buffer has been flushed. */ *last_tb = NULL; cpu->tb_flushed = false; } #ifndef CONFIG_USER_ONLY /* We don't take care of direct jumps when address mapping changes in * system emulation. So it's not safe to make a direct jump to a TB * spanning two pages because the mapping for the second page can change. */ if (tb->page_addr[1] != -1) { *last_tb = NULL; } #endif /* See if we can patch the calling TB. */ if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) { if (!locked) { tb_lock(); } tb_add_jump(*last_tb, tb_exit, tb); if (!locked) { tb_unlock(); } } } /* * tb_find - find next TB, possibly generating it * * There is a multi-level lookup for finding the next TB which avoids * locks unless generation is required. * * 1. Lookup via the per-vcpu tb_jmp_cache * 2. Lookup via tb_find_physical (using QHT) * * If both of those fail then we need to grab the mmap_lock and * tb_lock and do code generation. * * As the jump patching of code also needs to be protected by locks we * have multiple paths into maybe_patch_last_tb taking advantage of * the fact we may already have locks held for code generation. */ static TranslationBlock *tb_find(CPUState *cpu, TranslationBlock **last_tb, int tb_exit) { CPUArchState *env = (CPUArchState *)cpu->env_ptr; TranslationBlock *tb; target_ulong cs_base, pc; unsigned int h; uint32_t flags; /* we record a subset of the CPU state. It will always be the same before a given translated block is executed. */ cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags); h = tb_jmp_cache_hash_func(pc); tb = atomic_read(&cpu->tb_jmp_cache[h]); if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base || tb->flags != flags)) { /* Ensure that we won't find a TB in the shared hash table * if it is being invalidated by some other thread. * Otherwise we'd put it back to CPU's local cache. * Pairs with smp_wmb() in tb_phys_invalidate(). */ smp_rmb(); tb = tb_find_physical(cpu, pc, cs_base, flags); if (!tb) { /* mmap_lock is needed by tb_gen_code, and mmap_lock must be * taken outside tb_lock. As system emulation is currently * single threaded the locks are NOPs. */ mmap_lock(); tb_lock(); /* There's a chance that our desired tb has been translated while * taking the locks so we check again inside the lock. */ tb = tb_find_physical(cpu, pc, cs_base, flags); if (!tb) { /* if no translated code available, then translate it now */ tb = tb_gen_code(cpu, pc, cs_base, flags, 0); } maybe_patch_last_tb(cpu, tb, last_tb, tb_exit, true); tb_unlock(); mmap_unlock(); } else { maybe_patch_last_tb(cpu, tb, last_tb, tb_exit, false); } /* We update the TB in the virtual pc hash table for the fast lookup */ atomic_set(&cpu->tb_jmp_cache[h], tb); } else { maybe_patch_last_tb(cpu, tb, last_tb, tb_exit, false); } return tb; } But it doesn't seem to make much difference to the microbenchmark and I think makes the code a lot trickier to follow. Is it worth it? -- Alex Bennée