On Fri, Jun 15, 2012 at 12:30 AM, Lluís Vilanova <vilan...@ac.upc.edu> wrote: [...] > Now that I think of it, you will have problems generating code to surround > each > qemu_ld/st with a lightweight mechanism to get the time. In x86 it would be > rdtsc, but you want to generate a host rdtsc instruction inside the code > generated by QEMU's TCG, so you should also have to hack TCG (or the code > generation pointers) to issue an rdtsc instruction.
Even rdtsc would introduce enough noise that it wouldn't be reliable for such a micro measurement: as far as I understand it, this instruction can be reordered, so you need to flush the pipeline before issuing it. Intel has a document about that: download.intel.com/embedded/software/IA/324264.pdf The overhead of their proposed method is so high that it's likely it would take longer than the execution of the fast path itself. IMHO a mix of YeongKyoon Lee way to count ld/st and comparison between user mode and softmmu still seems to be the best approach (well unless you have access to a cycle accurate simulator :-). Laurent