I am sponsoring the following fast-track for Vikram Hegde. This case presents an overview and architectural details on an IOMMU for Intel cpus. This information is being presented for information only, for discussion and future reference. As there's nothing to approve, I believe Closed Approved Automatic is appropriate.
-jg Intel IOMMU =========== Introduction ============ A Memory Management Unit (CPU or I/O) is hardware that translates virtual memory addresses into "real" or physical memory addresses. A physical memory address typically reflects the actual memory installed on the system. Virtual memory on the other hand is a complete fabrication valid only for the process or other entity (such as a kernel) on the system. Using virtual memory has several benefits including providing isolation, the illusion of contiguous flat memory to a process, providing a large address space (that may or may not be backed by actual physical memory), providing a process the complete freedom to load code and data anywhere in the flat virtual address space. MMU for CPUs are available in almost all modern general purpose CPUs and almost all non-embedded operating systems support these MMUs. However IOMMUs i.e. MMUs for I/O devices is not yet common on most OSs. The one notable exception is SPARC CPUs and SPARC Solaris which have had an IOMMU for quite some time now. IOMMUs are only now making their appearance on x86 (Intel and AMD) CPUs and this PSARC case discusses providing Solaris x86 support for the Intel IOMMU. Background ========== An I/O MMU or IOMMU as it is commonly called provides a translator from device virtual addresses to system physical addresses. Until now on Solaris x86 a device DMA engine was programmed with physical memory addresses so that when it performed DMA reads and writes it was directly accessing physical memory. With an IOMMU, a device DMA engine is provided with device or domain specific virtual addresses and DMA accesses by the device are done with these virtual addresses. An IOMMU intercepts theses accesses and directs them to the correct physical addresses. Using an IOMMU provides several benefits including 1. The ability to isolate a device's memory access to certain limited areas of system physical memory preventing device hardware or driver software from corrupting memory belonging to the kernel or other I/O devices 2. The ability to provide the illusion of a flat contiguous virtual address space for device DMA when in fact the backing physical memory is "scattered" all over the system physical memory. This is useful for devices that don't have scatter-gather capability and cannot deal with discontiguous memory. 3. For certain legacy devices which have restrictions on memory they can access such as low memory, the virtual memory address space can be used to provide that illusion while mapping it to high addresses in physical memory. This allows for better use of 64 bit address spaces without needing expensive copying using "bounce buffers" 4. For virtualization software provides the ability to isolate devices belonging to different virtual machines so that a malicious OS cannot bring down the entire system. Technical Details ================= On Intel CPUs that have IOMMU support, the IOMMU is typically integrated into the Memory Controller hub. The Intel IOMMU is not a PCI device. The IOMMUs have the following capabilities: 1. The ability to remap DMA accesses (Read and Write) from virtual to system physical addresses 2. The ability to remap interrupts routing them as desired (such as to VMs that controls those devices) 3. The ability to record and report faults encountered during the above remapping steps. There are two modes of fault reporting - Primary Fault Logging and Advanced Fault Logging. 4. The ability to parcel out devices to various VMs. Of these features the initial implementation in Solaris will only enable 1 and 3 i.e. DMA remapping and the ability to report faults encountered during DMA remapping. The following hardware and software elements are used for DMA remapping 1. ACPI tables - The remapping (i.e. IOMMU) hardware in a system is reported through the DMAR (DMA Remapping Reporting) ACPI table. Each DMAR table has a header followed by one or more DRHD (DMA Remapping Hardware Unit Definition) structures and zero or more other structures. These structures start with a type field followed by a Length field in Bytes. The DRHD structure includes the following information a. PCI segment number associated with the unit. b. The base address of the registers associated with this unit c. Device scope structure - indicates devices coming under the scope of the unit. 2. IOMMU Registers - These are a bunch of registers which are pointed to by the DMAR ACPI table. These include a. Capability Register - This register contains among other things the following fields: i. Number of Fault Recording Registers ii. HW support for Page Selective Invalidation iii. Super Page support iv. Advanced Fault Logging Support v. Number of Domains supported b. Extended capability Register - This register includes i. IOTLB Register offset - offset to IOTLB registers from register base address. ii. Interrupt Remapping Support iii. Device IOTLB support iv. Queued Invalidation Support c. Global Command Register - This register controls remapping hardware and includes i. Translation enable - Enables DM mapping ii. Enable advanced fault logging iii. Queued invalidation enable iv. Interrupt Remapping enable d. Global Status Register - This register reports status information and includes i. Translation enable Status - indicates if DMA remapping is enabled ii. Fault Log Status - indicates that fault log is enabled iii. Advanced Fault Log status - indicates that Advanced Fault Logging is enabled iv. Queued invalidation enable status - set if this is enabled v. Interrupt remapping enable status e. Root-Entry Table Address register - used to set address of Root Table f. Context cache Command Register - used to invalidate the context cache g. IOTLB Invalidation registers - used to invalidate the IOMMU TLB cache h. Fault status registers - includes i. Fault record Index - index of first pending fault ii. Indicates invalidation errors. iii. Pending primary fault iv. Pending advanced fault i. Fault Event control register - includes i. Mask fault interrupts ii. Fault interrupt pending j. Fault Recording register - used to record fault information k. Advanced fault Log register - Base address and size of Advanced Fault Log in system memory. 3. Root Entry Table - The root entry table is indexed by PCI bus number making for a total of 256 Root Entries. Each Root Entry Table entry contains the following fields a. The Context Entry Table Pointer - A pointer to the context entry table for the PCI bus. b. The Present field - which is used to indicate if an entry is valid or not. 4. Context entry Table - These are tables in memory that are indexed by the combination of device# and function#. Therefore each entry corresponds to an individual PCI function. There are 256 context entries in each table (32 devices and 8 functions in a PCI bus). Each context table entry has the following fields. a. Domainid - The unique identifier for a domain. Two devices in the same domain have the same address translation structures. b. Present flag - indicate if context entry is valid or not. c. Address translation root - Pointer to the address translation structures (I/O page Tables) for this device. 5. I/O Page Tables - The Intel IOMMU uses multi level page tables to translate virtual memory to physical memory. The normal page size is 4 KB. The system uses 9 bits in the virtual address for each level and a base pointer from the PDE (Page Directory Entry) from the previous level to get a new PDE/PTE. For example, the Address Space Root in the context entry for the device is used to locate a Page Table. The highest 9 bits of the virtual address is used to index into the Page table to locate a PDE or PTE. The PDE/PTE contains a pointer to the next lower level Page Table or translated physical page. A PDE/PTE also contains a Super Pages field which if set indicates that this is a PTE and that the remaining virtual address bits are used to index into a "large" page whose base address is pointed to by the PTE. In this initial deliverable of the project Super Pages will not be enabled. 6. Interrupt Remapping Table - This is a single level table in physical memory setup by system software. The table's base address and and size is specified through the Interrupt Remap Address Register. Each IRTE (Interrupt Remapping Table Entry) is 128 bits in size. The interrupt table index is computed using interrupt address and data information. Since this feature will not be enabled in the first phase of this project we wont discuss this further. 7. Context Command Register - The context command register provides the ability to invalidate the context cache in the IOMMU. The following invalidation granularities are supported a. Global invalidation - Invalidate all context caches entries b. Domain Selective invalidation - Invalidate all context caches entries for a domain (specified via domainid) c. Device Selective invalidation - Invalidate all context caches entries for a specific device within a domain. 8. IOTLB invalidation register - The IOTLB invalidation register is used to invalidate all translation entries in the TLB within the IOMMU (not the IOTLB within a device). The following granularities are supported: a. Global invalidation - Invalidate all entries in the IOTLB b. Domain selective invalidation - Invalidate all IOTLB entries for a specific domain c Page selective invalidation - Invalidate all IOTLB entries for a specified DMA virtual address(es). 9. Invalidation Queued Interface - For batching invalidation commands. Uses an architecture similar to the AMD IOMMU command buffer where there is a circular buffer of invalidation commands in system memory indexed by a head and tail pointer. The Queued Invalidation interface will not be enabled in the first delivery of the Intel IOMMU support and will not be discussed further. 10. Fault Logging - Faults generated by the IOMMU can be classified into two types - DMA remapping faults and Interrupt remapping faults. Two types of fault logging facilities are available a. Primary Fault Logging - This mechanism uses an array of Fault Recording Registers. The register array has an index that points to the register where the next fault is recorded. The index wraps around when it reaches the end of the array. When the circular array overflows, fault recording is suspended until the faults in the array are cleared. b. Advanced Fault Logging - In advanced fault logging a circular buffer in memory is used to record faults (similar to the Event Log in AMD IOMMU). The base and size of the buffer is specified via the Advanced Fault Log register. In either mechanism when a fault is recorded the PPF (Primary Pending Fault) or the APF (Advanced Pending Fault) fields in the Fault Status Register is set and an interrupt is generated. Operation of the IOMMU ====================== An IOMMU driver initially uses ACPI to locate the number and scope of IOMMU units using the DMAR ACPI table. It then sets up the various data structures in memory including page tables, root-entry table and context entry table. The IOMMU is then started by programming the appropriate control registers. When a device sets up DMA it makes a call to ddi_dma_addr_bind_handle() or ddi_dma_buf_bind_handle(). The address passed into these routines is typically the kernel virtual address. The DDI framework translates these virtual addresses to physical addresses and passes them onto the IOMMU driver. The IOMMU driver maps these into device virtual addresses, updates the device's I/O page tables and then programs the context cache invalidation register and the IOTLB invalidation registers to invalidate cached context entries and IOTLB entries. The OS then passes the device virtual addresses in the form of DMA cookies back to the DMA requester (the device driver). The device driver programs the device's DMA engine and starts up the DMA. The device sends data to the device virtual address programmed into it and this is then translated by the IOMMU after walking the device I/O page tables or more often by using its IOTLB cache. Any errors encountered during this process are recorded either in Fault Recording registers (if Primary Fault Logging is used) or in Advanced Fault Log Buffer in system memory (if Advanced Fault Logging is used) and an interrupt is generated to notify the IOMMU driver.