Here's some more text for the virtualization paper. It talks about virtualizing pages in the guest address space. Jens and others, you were interested in virtualizing the page tables. I'd be interested in hearing any feedback on this stuff. -Kevin
Virtualizing code/data pages: ============================= This section talks about virtualizing one page worth of code or data in the guest OS space, by mapping it into the monitor space. To help visualize things, let's assume we start out in the following state. We have allocated space for, and setup all of the necessary monitor system structures; for example, the TSS, page directory, page tables, GDT, IDT, etc. To be clear, these structures belong to and are used by the monitor to implement an environment where the guest can run, but are not accessed directly by the guest OS. The pages used by these structures are marked with supervisor privilege, so any access to them by the guest (all rings are pushed down to ring3) will generate a page fault. All of the monitor structures are mapped into an address space spanned by one page table (a 4MByte span). This is done for convenience so we can migrate the such structures within the address space efficiently and easily, if the guest ever requires the use of a linear address within that range. Likely we have placed the monitor in a portion of the address space in which the guest doesn't use. We have also allocated memory for the guest's physical memory, though for now let's say we have not yet mapped any of it into the monitor's address space. The guest is to begin execution at a given address in it's linear address space. As the guest begins executing at this address which is not yet mapped into the monitor, the monitor will receive a page fault. This is given that we mark all unused address space with an entry in the page tables such that a page fault occurs. The monitor uses the page fault opportunity to map the needed page into memory, at the actual linear address expected by the guest but at the physical address of the page of memory allocated for the guest by the host. This mapping takes place in the monitor's page tables, which are the ones really used by the CPU. The guest page tables are used only for reference, so the monitor knows which physical guest page to map to. We could continue this process, mapping in new pages on demand, only when we encounter guest execution in new pages. As the guest executes, it will emit many data accesses to various memory pages. As these pages have not yet been mapped into the monitor's memory space (we are starting with a blank slate), they will generate page faults, in much the same way as did accesses to code pages above. So, we can map these data pages into the monitor's space on demand, as we encounter their use in the guest. To recap, other than some pages which hold the monitor's data structures, we have started out with a blank address space from the point of view of the guest OS, and dynamically created a page table, as the guest executes. There are a couple points to make here. First, is that we have to rebuild the page table this way upon every implicit or explicit change to the PDBR (CR3) register. (perhaps there is some room to optimize here, but for now...) Second is that we don't necessarily have to build the page tables, one page at a time. We could map in bigger chunks, or whole address spaces at one time, depending upon other considerations. Virtualizing guest system data structures: ========================================== Previously, we talked about how to virtualize code/data pages, mapping them into the monitor's address space. Now let's look at how to virtualize important guest OS data structures such as the GDT, IDT, page tables etc. It's important to keep in mind, that the ones really used by the CPU are the monitor structures, stored in supervisor permission pages, and which are thus inaccessible from the guest being at ring3. As the guest makes a mode transition (for example into protected mode), or attempts to change a value of a register which points to these structures (for example the LGDT instruction) the monitor will receive an exception, since these instructions are all protected from being executed in ring3. (We can also virtualize arbitrary instructions with the SBE logic) The monitor uses the exception as a chance to emulate the offending instruction. At this point, we can see where the new value in the register points. By examing the data at that address in the guest address space, we can build 'virtualized' values in the corresponding monitor structures. As we know the size of a given data structure, we can also determine the range of guest address space occupied, and thus the pages which are spanned. Since the monitor needs to be aware of any read or write accesses to such regions to virtualize the guest GDT, IDT, etc, it must mark these pages as inaccessible by the guest running at ring3. The monitor will then receive a page fault at any time that the guest attempts an access to a protected region. The fault handler in the monitor will have to carry out the access on behalf of the guest, and then update it's corresponding entries in whatever structure was modified, knowing the affected addresses. Virtualization a real page fault in the guest: ============================================== As page faults are a real part of normal OS protection mechanisms and part of a paging strategy, they will occur normally in an OS. Since our virtualization strategies rely heavily on the paging protections, our page fault handler needs to discern between valid page faults and ones generated for virtualization purposes. Fortunately, this is not difficult, since we have access to the guest page tables and our monitor data. We can simply examine the guest page tables to determine if a page fault should have been generated naturally by the guest. If this is the case, we have to effect (emulate) a page fault for the guest. Virtualization of non-existent physical guest memory: ===================================================== For situations where the guest OS gropes physical memory to determine the amount of memory installed, we must handle this in such a way that the guest determines that memory beyond the amount which we have allocated, does not exist. One approach would be to make sure that all such accesses are covered by protections in the page tables, and result in a page fault. We would then have to virtualize the access on behalf of the guest, and continue. For writes, we would ignore the access, for reads we would return a value reprentative of non-existent memory. The problem with this approach is that it involves heavy overhead to handle the execution of such guest code. A different and more efficient approach would be to find a truly unused physical memory region (spanning 1 aligned page) in the host, and then map this page-size region into the address space wherever the guest page tables point to non-existent physical memory. Or for guest code running in non-paged mode, map this page to all of the linear address space above the size of physical guest memory. Then we could let the guest access such non-existent memory from thereon, and it will truly be accessing non-existent memory. Any comments/caveats/warnings here?
