I am sponsoring the following fast-track for Vikram Hegde. This case presents an overview and architectural details on an IOMMU for AMD cpus. This information is being presented for information only, for discussion and future reference. As there's nothing to approve, I believe Closed Approved Automatic is appropriate.
-jg AMD IOMMU ========= Introduction ============ A Memory Management Unit (CPU or I/O) is hardware that translates virtual memory addresses into "real" or physical memory addresses. A physical memory address typically reflects the actual memory installed on the system. Virtual memory on the other hand is a complete fabrication valid only for the process or other entity (such as a kernel) on the system. Using virtual memory has several benefits including providing isolation, providing the illusion of contiguous flat memory to a process, providing a large address space (that may or may not be backed by actual physical memory), providing a process the complete freedom to load code and data anywhere in the flat virtual address space. MMU for CPUs are available in almost all modern general purpose CPUs and almost all non-embedded operating systems support these MMUs. However IOMMUs i.e. MMUs for I/O devices is not yet common on most OSs. The one notable exception is SPARC CPUs and SPARC Solaris for which IOMMU has been available for quite some time now. IOMMUs are only now making their appearance on x86 (intel and AMD) CPUs and this PSARC case discusses providing Solaris x86 support for the AMD IOMMU. Background ========== An I/O MMU or IOMMU as it is commonly called provides a translator from device virtual addresses to system physical addresses. Until now on Solaris x86 a device DMA engine was programmed with physical memory addresses so that when it performed DMA reads and writes it was directly accessing physical memory. With an IOMMU, a device DMA engine is provided with device or domain specific virtual addresses and DMA accesses by the device are done with these virtual addresses. An IOMMU intercepts theses accesses and directs them to the correct physical addresses. Using an IOMMU provides several benefits including 1. The ability to isolate a device's memory access to certain limited areas of system physical memory preventing device hardware or driver software from corrupting memory belonging to the kernel or other I/O devices 2. The ability to provide the illusion of a flat contiguous virtual address space for device DMA when in fact the backing physical memory is "scattered" all over the system physical memory. This is useful for devices that don't have scatter-gather capability and cannot deal with discontiguous memory. 3. For certain legacy devices which have restrictions on memory they can access (such as only low memory), the virtual memory address space can be used to provide that illusion while mapping it to high addresses in physical memory. This allows for better use of 64 bit address spaces without needing expensive copying using "bounce buffers" 4. For virtualization software it provides the ability to isolate devices belonging to different virtual machines so that a malicious OS cannot bring down the entire system. Technical Details ================= On AMD CPUs that have IOMMU support, the IOMMU is integrated into the I/O hub. The AMD IOMMU is a standard PCI function. There may be more than 1 IOMMU per PCI function. The IOMMUs have the following capabilities 1. The ability to remap DMA accesses (Read and Write) from virtual to system physical addresses 2. The ability to remap interrupts routing them as desired (such as to VMs that controls those devices) 3. The ability to record and report faults encountered during the above remapping steps. 4. The ability to parcel out devices to various VMs. 5. The ability to virtualize the IOMMU for use by VMs and their OSes. Of these features the initial implementation in Solaris will only enable 1 and 3 i.e. DMA remapping and the ability to report faults encountered during DMA remapping. The following hardware and software elements are used for DMA remapping 1. Capability registers - There are a bunch of capability registers implemented in the IOMMU's (PCI function) configuration space. These registers point to the virtual address of the memory mapped control registers of the IOMMU. 2. Control Registers - These are a bunch of memory mapped registers which include a. Device Table base address register - The Device Table is the primary software data structure used for DMA and Interrupt remapping. The Device Table base address register contains the location and size of the Device Table b. Command Buffer Base Address Register - This register contains the base address and size of the Command Buffer - a circular buffer in system memory used to send commands to the IOMMU. c. Event Log Base Address Register - This register contains the base address and size of the Event Log - a circular buffer in system memory used by the IOMMU to report and record faults. d. Control Register - A control register used to send control commands to the IOMMU e. Status Register - A status register used by the IOMMU to report status information 3. Device Table - A software table setup in main memory by the OS for the IOMMU. The Device Table is indexed by the DeviceID a 16 bit device identifier. Each entry in the Device Table includes the following information: a. The Page Table Root Pointer - A pointer to a page table for that device b. The Interrupt table Root pointer - A pointer to the interrupt mapping table c. A mode field - This reports the number of levels in the page table d. A DomainId field - This contains the domainID i.e. the domain which contains the device. Two devices in the same domain have the same page tables. e. Read/Write Permission bits for this Translation f. Fields which indicate if the Interrupt, Translation and Device table entry valid. 4. Page Tables - These are tables in memory that are used to map specific bits in the virtual address to the next level Page Table if a Page Directory Entry (PDE) or the physical page frame if a Page Table Entry (PTE). Each level in the page tables takes as input the base address of a page table and 9 bits from the virtual address (from HI to LO). The 9 bits are used as an index into the page table to get the physical address of the next lower level page table or the final physical page frame. With 64 bits in the virtual address this yields a 6-level page table with the lowest 12 bits used as an offset into the final 4KB physical page frame. Each PDE has a next level field. If set to 0 it indicates that it is a PTE and translation has ended. Else it indicates that the level of the next table. Using 0 early allows the use of large pages (similar to the Super Pages field in Intel IOMMU). 5. Interrupt Remapping Table - This is a table in physical memory that is indexed into by using bits from the MSI interrupt data. Since this feature will not be enabled in the first phase of this project we wont discuss this further. 6. Command Buffer - The command buffer is a circular buffer in memory that is written to by the OS/driver and is read by the IOMMU. The IOMMU uses a head pointer register to get the next location to read while system software uses a tail pointer register to determine the next location to write to. The IOMMU provides a completion wait command that allows the system to wait on an interrupt until all commands prior to the command completion wait command have been completed. The command buffer is architecturally similar to the Queued Invalidation Interface used by Intel IOMMUs. 7. Event Log - The event log is a circular buffer in system memory that is written to by the IOMMU to report faults encounterred during remapping of DMA and Interrupts. There is a tail pointer register which points to the next location to write to the IOMMU and a head pointer register to be used by system software to locate the next event to read. The IOMMU can be programmed to generate an interrupt when an event occurs and the Event Log is updated. The AMD Event Log is architecturally similar to the Advanced Fault Logging capability provided by Intel IOMMUs. Operation of the IOMMU ====================== An IOMMU driver initially sets up the various data structures in memory including page tables, device table, event log and command buffer. The IOMMU is then started. When a device sets up DMA it makes a call to ddi_dma_addr_bind_handle() or ddi_dma_buf_bind_handle(). The address passed into these routines is typically the kernel virtual address. The DDI framework translates these virtual addresses to physical addresses and passes them onto the IOMMU driver. The IOMMU driver maps these into device virtual addresses, updates the device's I/O page tables and then sends a command via the command buffer to the IOMMU to invalidate any internal TLBs it may have. The OS then passes the device virtual addresses in the form of DMA cookies back to the DMA requester (the device driver). The device driver programs the device's DMA engine and starts up the DMA. The device sends data to the device virtual address programmed into it and this is then translated by the IOMMU after walking the device I/O page tables. Any errors encountered during this process are recorded in the Event Log and an interrupt is generated to notify the IOMMU driver.