On 01/16/2011 10:01 AM, Raphaël Lefèvre wrote: > On Sun, Jan 16, 2011 at 11:21 PM, Stefano Bonifazi > <stefboombas...@gmail.com> wrote: > 2. "how can I check the number of target cpu cycles or target > instructions executed inside qemu-user (i.e. qemu-ppc)? > Is there any variable I can inspect for such informations?" at Dec, 2010
Keep in mind I'm a bit rusty and not an expert, but I'll give a stab at answering: You can't, because QEMU doesn't work that way. QEMU isn't an instruction level emulator, it's closer to a Java JIT. It doesn't translate one instruction at a time but instead translates large blocks of code all at once, and keeps a cache of translated blocks around. Execution jumps into each block and either waits for it to exit again (meaning it jumped out of that page and QEMU's main execution loop has to look up what page to execute next, possibly translating it first if it's not in the cache yet), or else QEMU interrupts it after while to fake an IRQ of some kind (such as a timer interrupt). You may want to read Fabrice Bellard's original paper on the QEMU design: http://www.usenix.org/event/usenix05/tech/freenix/full_papers/bellard/bellard.pdf Since that was written, dyngen was replaced with tcg, but that does the same thing in a slightly different way. Building a QEMU with dyngen support used to use the host compiler to compile chunks of code corresponding to the target operations it would see at runtime, and then strip the machine language out of the resulting .o files and save them in a table. Then at runtime dyngen could generate translated pages by gluing together the resulting saved machine language snippets the host compiler had produced when qemu was built. The problem was, beating the right kind of machine language snippets out of the .o files the compiler produced from the example code turned out to be VERY COMPILER DEPENDENT. This is why you couldn't build qemu with gcc 4.x for the longest time, gcc's code generator and the layout of the .o files changed in a bunch of subtle ways which broke dyngen's ability to extract usable machine code snippets to put 'em into the table so it could translate pages at runtime. TCG stands for "Tiny Code Generator". It just hardwires a code generator into QEMU. They wrote a mini-compiler in C, which knows what instructions to output for each host qemu supports. If QEMU understands target instructions well enough to _read_ them, it's not a big stretch to be able to _write_ them when running on that kind of host. (It's more or less the same operation in reverse.) This means that QEMU can no longer run on a type of host it can't execute target code for, but the solution is to just add support for all the interesting machines out there, on both sides. So, when QEMU executes code, the virtual MMU faults a new page into the virtual TLB, and goes "I can't execute this, fix it up!" And the fixup handler looks for a translation of the page in the cache of translated pages, and if it can't find it it calls the translator to convert the target code into a page of corresponding host code. Which may involve discarding an existing entry out of the cache, but this is how instruction caches work on real hardware anyway so the delays in QEMU are where they'd be on real hardware anyway, and optimizing for one is pretty close to optimizing for the other, so life is good. The chunk you found earlier is a function pointer typecast: #define tcg_qemu_tb_exec(tb_ptr) \ ((long REGPARM (*)(void *))code_gen_prologue)(tb_ptr) Which looks like it's calling code_gen_prologue() with tp_ptr as its argument (typecast to a void *), and it returns a long. That calls a translated page, and when the function returns that means the page of code needs to jump to code somewhere outside of that page, and we go back to the main loop to figure out where to go next. The reason QEMU is as fast as it is is because once it has a page of translated code, actually _running_ it is entirely native. It jumps into the page, and executes natively until it leaves the page. Control only goes back to QEMU to switch pages or to handle I/O and interrupts and such. So when you ask "how many clock cycles did that instruction take", the answer is "it doesn't work that way". QEMU emulates at memory page level (generally 4k of target code), not at individual instruction level. (Oh, and the worst thing you can do to QEMU from a performance perspective is self-modifying code. Because the virtual MMU has to strip the executable bit off the TLB entry and re-translate the entire page next time something tries to execute it. It _works_, it's just slow. But again, real hardware can hiccup a bit on this too.) Does that answer your question? Rob