On Wed, Aug 17, 2016 at 13:58:00 -0400, Emilio G. Cota wrote: > due to my glaring lack of TCG competence.
A related note that might be of interest. I benchmarked an alternative implementation that *does* instrument stores. I wrapped every tcg_gen_qemu_st_i64 (those are enough, right? tcg_gen_st_i64 are stores for the host memory, which I presume are not "explicit" guest stores and therefore would not go through the soft TLB) with a pre/post pair of helpers. These helpers first check a bitmap given a masked subset of the physical address of the access, and if the bit is set, then check a QHT with the full physaddr. If an entry exists, they lock/unlock the entry's spinlock around the store, so that no race is possible with an ongoing atomic (atomics always take their corresponding lock). Overhead is not too bad over cmpxchg, but most of it comes from the helpers--see these numbers for SPEC: (NB. the "QEMU" baseline does *not* include QHT for tb_htable and therefore takes tb_lock around tb_find_fast, that's why it's so slow) http://imgur.com/a/SoSHQ "QHT only" means a QHT lookup is performed on every guest store. The win of having the bitmap before hitting the QHT is quite large. I wonder if things could be sped up further by performing the bitmap check in TCG code. Would that be worth exploring? If so, any help on that would be appreciated (i386 host at least)--I tried, but I'm way out of my element. E.