I am sponsoring the following fast-track for Vikram Hegde.  This
case presents an overview and architectural details on an IOMMU
for Intel cpus.  This information is being presented for information
only, for discussion and future reference.  As there's nothing to
approve, I believe Closed Approved Automatic is appropriate.

-jg


                                Intel IOMMU
                                ===========

Introduction
============
A Memory Management Unit (CPU or I/O) is hardware that translates virtual
memory addresses into "real" or physical memory addresses. A physical memory
address typically reflects the actual memory installed on the system. Virtual
memory on the other hand is a complete fabrication valid only for the process
or other entity (such as a kernel) on the system. Using virtual memory has
several benefits including providing isolation, the illusion of contiguous flat
memory to a process, providing a large address space (that may or may not be
backed by actual physical memory), providing a process the complete freedom
to load code and data anywhere in the flat virtual address space. MMU for
CPUs are available in almost all modern general purpose CPUs and almost
all non-embedded operating systems support these MMUs. However IOMMUs i.e. 
MMUs for I/O devices is not yet common on most OSs. The one notable exception is
SPARC CPUs and SPARC Solaris which have had an IOMMU for quite some time now.
IOMMUs are only now making their appearance on x86 (Intel and AMD) CPUs and
this PSARC case discusses providing Solaris x86 support for the Intel IOMMU.

Background
==========
An I/O MMU or IOMMU as it is commonly called provides a translator from device
virtual addresses to system physical addresses. Until now on Solaris x86
a device DMA engine was programmed with physical memory addresses so that
when it performed DMA reads and writes it was directly accessing physical
memory. With an IOMMU, a device DMA engine is provided with device or
domain specific virtual addresses and DMA accesses by the device are done
with these virtual addresses. An IOMMU intercepts theses accesses and directs
them to the correct physical addresses.

Using an IOMMU provides several benefits including

1. The ability to isolate a device's memory access to certain limited areas
of system physical memory preventing device hardware or driver software from
corrupting memory belonging to the kernel or other I/O devices

2. The ability to provide the illusion of a flat contiguous virtual address
space for device DMA when in fact the backing physical memory is "scattered"
all over the system physical memory. This is useful for devices that don't
have scatter-gather capability and cannot deal with discontiguous memory.

3. For certain legacy devices which have restrictions on memory they can access
such as low memory, the virtual memory address space can be used to
provide that illusion while mapping it to high addresses in physical memory.
This allows for better use of 64 bit address spaces without needing expensive
copying using "bounce buffers"

4. For virtualization software provides the ability to isolate devices
belonging to different virtual machines so that a malicious OS cannot bring
down the entire system.


Technical Details
=================
On Intel CPUs that have IOMMU support, the IOMMU is typically integrated into
the Memory Controller hub. The Intel IOMMU is not a PCI device.
The IOMMUs have the following capabilities:

1. The ability to remap DMA accesses (Read and Write) from virtual to
system physical addresses

2. The ability to remap interrupts routing them as desired (such as to VMs that
controls those devices)

3. The ability to record and report faults encountered during the above
remapping steps. There are two modes of fault reporting - Primary Fault Logging
and Advanced Fault Logging.

4. The ability to parcel out devices to various VMs.

Of these features the initial implementation in Solaris will only enable
1 and 3 i.e. DMA remapping and the ability to report faults encountered
during DMA remapping.

The following hardware and software elements are used for DMA remapping

1. ACPI tables - The remapping (i.e. IOMMU) hardware in a system is reported
through the DMAR (DMA Remapping Reporting) ACPI table. Each DMAR table has a
header followed by one or more DRHD (DMA Remapping Hardware Unit Definition)
structures and zero or more other structures. These structures start with
a type field followed by a Length field in Bytes.

The DRHD structure includes the following information

a. PCI segment number associated with the unit.

b. The base address of the registers associated with this unit

c. Device scope structure - indicates devices coming under the scope of the
   unit.

2. IOMMU Registers - These are a bunch of registers which are pointed to by the
DMAR ACPI table. These include

a. Capability Register - This register contains among other things the
following fields:
        i.   Number of Fault Recording Registers
        ii.  HW support for Page Selective Invalidation
        iii. Super Page support
        iv.  Advanced Fault Logging Support
        v.   Number of Domains supported

b. Extended capability Register - This register includes
        i.   IOTLB Register offset - offset to IOTLB registers from register
             base address.
        ii.  Interrupt Remapping Support
        iii. Device IOTLB support
        iv.  Queued Invalidation Support

c. Global Command Register - This register controls remapping hardware and
   includes
        i.   Translation enable - Enables DM mapping
        ii.  Enable advanced fault logging
        iii. Queued invalidation enable
        iv.  Interrupt Remapping enable

d. Global Status Register - This register reports status information and
   includes
        i.   Translation enable Status - indicates if DMA remapping is enabled
        ii.  Fault Log Status - indicates that fault log is enabled
        iii. Advanced Fault Log status - indicates that Advanced Fault
             Logging is enabled
        iv.  Queued invalidation enable status - set if this is enabled
        v.   Interrupt remapping enable status

e. Root-Entry Table Address register - used to set address of Root Table

f. Context cache Command Register - used to invalidate the context cache

g. IOTLB Invalidation registers - used to invalidate the IOMMU TLB cache

h. Fault status registers - includes
        i.   Fault record Index - index of first pending fault
        ii.  Indicates invalidation errors.
        iii. Pending primary fault
        iv.  Pending advanced fault

i. Fault Event control register - includes
        i.  Mask fault interrupts
        ii. Fault interrupt pending

j. Fault Recording register - used to record fault information

k. Advanced fault Log register - Base address and size of Advanced Fault Log
   in system memory.
        
3.  Root Entry Table - The root entry table is indexed by PCI bus number
making for a total of 256 Root Entries. Each Root Entry Table entry contains
the following fields

a. The Context Entry Table Pointer - A pointer to the context entry table for
the PCI bus.

b. The Present field - which is used to indicate if an entry is valid or not.

4. Context entry Table - These are tables in memory that are indexed by the
combination of device# and function#. Therefore each entry corresponds to an
individual PCI function. There are 256 context entries in each table
(32 devices and 8 functions in a PCI bus). Each context table entry has the
following fields.

a. Domainid - The unique identifier for a domain. Two devices in the same
domain have the same address translation structures.

b. Present flag - indicate if context entry is valid or not.

c. Address translation root - Pointer to the address translation structures
(I/O page Tables) for this device.

5. I/O Page Tables - The Intel IOMMU uses multi level page tables to translate
virtual memory to physical memory. The normal page size is 4 KB. The system uses
9 bits in the virtual address for each level and a base pointer from the
PDE (Page Directory Entry) from the previous level to get a new PDE/PTE. For
example, the Address Space Root in the context entry for the device is used to
locate a Page Table. The highest 9 bits of the virtual address is used to index
into the Page table to locate a PDE or PTE. The PDE/PTE contains a pointer to
the next lower level Page Table or translated physical page. A PDE/PTE also
contains a Super Pages field which if set indicates that this is a PTE and that
the remaining virtual address bits are used to index into a "large" page whose
base address is pointed to by the PTE. In this initial deliverable of the
project Super Pages will not be enabled.

6. Interrupt Remapping Table - This is a single level table in physical memory
setup by system software. The table's base address and and size is specified
through the Interrupt Remap Address Register. Each IRTE (Interrupt Remapping
Table Entry) is 128 bits in size. The interrupt table index is computed using
interrupt address and data information. Since this feature will not be enabled
in the first phase of this project we wont discuss this further.

7. Context Command Register - The context command register provides the
ability to invalidate the context cache in the IOMMU. The following
invalidation granularities are supported

a. Global invalidation - Invalidate all context caches entries
b. Domain Selective invalidation - Invalidate all context caches entries
   for a domain (specified via domainid)
c. Device Selective invalidation - Invalidate all context caches entries
   for a specific device within a domain.

8. IOTLB invalidation register - The IOTLB invalidation register  is used to
invalidate all translation entries in the TLB within the IOMMU (not the
IOTLB within a device). The following granularities are supported:

a. Global invalidation - Invalidate all entries in the IOTLB

b. Domain selective invalidation - Invalidate all IOTLB entries for a specific
   domain

c Page selective invalidation - Invalidate all IOTLB entries for a specified
  DMA virtual address(es).

9. Invalidation Queued Interface - For batching invalidation commands. Uses
an architecture similar to the AMD IOMMU command buffer where there is a
circular buffer of invalidation commands in system memory indexed by a head
and tail pointer. The Queued Invalidation interface will not be enabled in
the first delivery of the Intel IOMMU support and will not be discussed further.

10. Fault Logging - Faults generated by the IOMMU can be classified into two
types - DMA remapping faults and Interrupt remapping faults. Two types of
fault logging facilities are available

a. Primary Fault Logging - This mechanism uses an array of Fault Recording
Registers. The register array has an index that points to the register where
the next fault is recorded. The index wraps around when it reaches the end of
the array. When the circular array overflows, fault recording is suspended
until the faults in the array are cleared.

b. Advanced Fault Logging - In advanced fault logging a circular buffer
in memory is used to record faults (similar to the Event Log in AMD IOMMU).
The base and size of the buffer is specified via the Advanced Fault Log
register.

In either mechanism when a fault is recorded the PPF (Primary Pending Fault)
or the APF (Advanced Pending Fault) fields in the Fault Status Register is
set and an interrupt is generated.

Operation of the IOMMU
======================
An IOMMU driver initially uses ACPI to locate the number and scope of IOMMU
units using the DMAR ACPI table. It then sets up the various data structures
in memory including page tables, root-entry table and context entry table. The
IOMMU is then started by programming the appropriate control registers.
When a device sets up DMA it makes a call to ddi_dma_addr_bind_handle() or
ddi_dma_buf_bind_handle(). The address passed into these routines is typically
the kernel virtual address. The DDI framework translates these virtual
addresses to physical addresses and passes them onto the IOMMU driver.
The IOMMU driver maps these into device virtual addresses, updates the
device's I/O page tables and then programs the context cache invalidation
register and the IOTLB invalidation registers to invalidate cached context
entries and IOTLB entries.

The OS then passes the device virtual addresses in the form of DMA cookies
back to the DMA requester (the device driver). The device driver programs the
device's DMA engine and starts up the DMA. The device sends data to the device
virtual address programmed into it and this is then translated by the IOMMU
after walking the device I/O page tables or more often by using its IOTLB cache.
Any errors encountered during this process are recorded either in Fault
Recording registers (if Primary Fault Logging is used) or in Advanced Fault
Log Buffer in system memory (if Advanced Fault Logging is used) and an
interrupt is generated to notify the IOMMU driver.


Reply via email to