Re: [Qemu-devel] TCG flow vs dyngen

Rob Landley Sun, 23 Jan 2011 13:50:40 -0800

On 01/16/2011 10:01 AM, Raphaël Lefèvre wrote:
> On Sun, Jan 16, 2011 at 11:21 PM, Stefano Bonifazi
> <stefboombas...@gmail.com> wrote:
> 2. "how can I check the number of target cpu cycles or target
> instructions executed inside qemu-user (i.e. qemu-ppc)?
> Is there any variable I can inspect for such informations?" at Dec, 2010


Keep in mind I'm a bit rusty and not an expert, but I'll give a stab at
answering:

You can't, because QEMU doesn't work that way.  QEMU isn't an
instruction level emulator, it's closer to a Java JIT.  It doesn't
translate one instruction at a time but instead translates large blocks
of code all at once, and keeps a cache of translated blocks around.
Execution jumps into each block and either waits for it to exit again
(meaning it jumped out of that page and QEMU's main execution loop has
to look up what page to execute next, possibly translating it first if
it's not in the cache yet), or else QEMU interrupts it after while to
fake an IRQ of some kind (such as a timer interrupt).

You may want to read Fabrice Bellard's original paper on the QEMU design:

http://www.usenix.org/event/usenix05/tech/freenix/full_papers/bellard/bellard.pdf

Since that was written, dyngen was replaced with tcg, but that does the
same thing in a slightly different way.

Building a QEMU with dyngen support used to use the host compiler to
compile chunks of code corresponding to the target operations it would
see at runtime, and then strip the machine language out of the resulting
.o files and save them in a table.  Then at runtime dyngen could
generate translated pages by gluing together the resulting saved machine
language snippets the host compiler had produced when qemu was built.
The problem was, beating the right kind of machine language snippets out
of the .o files the compiler produced from the example code turned out
to be VERY COMPILER DEPENDENT.  This is why you couldn't build qemu with
gcc 4.x for the longest time, gcc's code generator and the layout of the
.o files changed in a bunch of subtle ways which broke dyngen's ability
to extract usable machine code snippets to put 'em into the table so it
could translate pages at runtime.

TCG stands for "Tiny Code Generator".  It just hardwires a code
generator into QEMU.  They wrote a mini-compiler in C, which knows what
instructions to output for each host qemu supports.  If QEMU understands
target instructions well enough to _read_ them, it's not a big stretch
to be able to _write_ them when running on that kind of host.  (It's
more or less the same operation in reverse.)  This means that QEMU can
no longer run on a type of host it can't execute target code for, but
the solution is to just add support for all the interesting machines out
there, on both sides.

So, when QEMU executes code, the virtual MMU faults a new page into the
virtual TLB, and goes "I can't execute this, fix it up!"  And the fixup
handler looks for a translation of the page in the cache of translated
pages, and if it can't find it it calls the translator to convert the
target code into a page of corresponding host code.  Which may involve
discarding an existing entry out of the cache, but this is how
instruction caches work on real hardware anyway so the delays in QEMU
are where they'd be on real hardware anyway, and optimizing for one is
pretty close to optimizing for the other, so life is good.

The chunk you found earlier is a function pointer typecast:

#define tcg_qemu_tb_exec(tb_ptr) \
  ((long REGPARM (*)(void *))code_gen_prologue)(tb_ptr)

Which looks like it's calling code_gen_prologue() with tp_ptr as its
argument (typecast to a void *), and it returns a long.  That calls a
translated page, and when the function returns that means the page of
code needs to jump to code somewhere outside of that page, and we go
back to the main loop to figure out where to go next.

The reason QEMU is as fast as it is is because once it has a page of
translated code, actually _running_ it is entirely native.  It jumps
into the page, and executes natively until it leaves the page.   Control
only goes back to QEMU to switch pages or to handle I/O and interrupts
and such.  So when you ask "how many clock cycles did that instruction
take", the answer is "it doesn't work that way".  QEMU emulates at
memory page level (generally 4k of target code), not at individual
instruction level.

(Oh, and the worst thing you can do to QEMU from a performance
perspective is self-modifying code.  Because the virtual MMU has to
strip the executable bit off the TLB entry and re-translate the entire
page next time something tries to execute it.  It _works_, it's just
slow.  But again, real hardware can hiccup a bit on this too.)

Does that answer your question?

Rob

Re: [Qemu-devel] TCG flow vs dyngen

Reply via email to