I am sponsoring the following fast-track for Vikram Hegde.  This
case presents an overview and architectural details on an IOMMU
for AMD cpus.  This information is being presented for information
only, for discussion and future reference.  As there's nothing to
approve, I believe Closed Approved Automatic is appropriate.


-jg



                                AMD IOMMU
                                =========

Introduction
============
A Memory Management Unit (CPU or I/O) is hardware that translates virtual
memory addresses into "real" or physical memory addresses. A physical memory
address typically reflects the actual memory installed on the system. Virtual
memory on the other hand is a complete fabrication valid only for the process
or other entity (such as a kernel) on the system. Using virtual memory has
several benefits including providing isolation, providing the illusion of
contiguous flat memory to a process, providing a large address space (that
may or may not be backed by actual physical memory), providing a process the
complete freedom to load code and data anywhere in the flat virtual address
space. MMU for CPUs are available in almost all modern general purpose CPUs
and almost all non-embedded operating systems support these MMUs. However
IOMMUs i.e. MMUs for I/O devices is not yet common on most OSs. The one
notable exception is SPARC CPUs and SPARC Solaris for which IOMMU has been 
available for quite some time now. IOMMUs are only now making their appearance 
on x86 (intel and AMD) CPUs and this PSARC case discusses providing Solaris
x86 support for the AMD IOMMU.

Background
==========
An I/O MMU or IOMMU as it is commonly called provides a translator from device
virtual addresses to system physical addresses. Until now on Solaris x86
a device DMA engine was programmed with physical memory addresses so that
when it performed DMA reads and writes it was directly accessing physical
memory. With an IOMMU, a device DMA engine is provided with device or
domain specific virtual addresses and DMA accesses by the device are done
with these virtual addresses. An IOMMU intercepts theses accesses and directs
them to the correct physical addresses.

Using an IOMMU provides several benefits including

1. The ability to isolate a device's memory access to certain limited areas
of system physical memory preventing device hardware or driver software from
corrupting memory belonging to the kernel or other I/O devices

2. The ability to provide the illusion of a flat contiguous virtual address
space for device DMA when in fact the backing physical memory is "scattered"
all over the system physical memory. This is useful for devices that don't
have scatter-gather capability and cannot deal with discontiguous memory.

3. For certain legacy devices which have restrictions on memory they can access
(such as only low memory), the virtual memory address space can be used to
provide that illusion while mapping it to high addresses in physical memory.
This allows for better use of 64 bit address spaces without needing expensive
copying using "bounce buffers"

4. For virtualization software it provides the ability to isolate devices
belonging to different virtual machines so that a malicious OS cannot bring
down the entire system.

Technical Details
=================
On AMD CPUs that have IOMMU support, the IOMMU is integrated into the I/O hub.
The AMD IOMMU is a standard PCI function. There may be more than 1 IOMMU per
PCI function. The IOMMUs have the following capabilities

1. The ability to remap DMA accesses (Read and Write) from virtual to
system physical addresses

2. The ability to remap interrupts routing them as desired (such as to VMs that
controls those devices)

3. The ability to record and report faults encountered during the above
remapping steps.

4. The ability to parcel out devices to various VMs.

5. The ability to virtualize the IOMMU for use by VMs and their OSes.

Of these features the initial implementation in Solaris will only enable
1 and 3 i.e. DMA remapping and the ability to report faults encountered
during DMA remapping.

The following hardware and software elements are used for DMA remapping

1. Capability registers - There are a bunch of capability registers implemented
in the IOMMU's (PCI function) configuration space. These registers point to the
virtual address of the memory mapped control registers of the IOMMU.

2. Control Registers - These are a bunch of memory mapped registers which
include

a. Device Table base address register - The Device Table is the primary
software data structure used for DMA and Interrupt remapping. The Device Table
base address register contains the location and size of the Device Table

b. Command Buffer Base Address Register - This register contains the base
address and size of the Command Buffer  - a circular buffer in system memory
used to send commands to the IOMMU.

c. Event Log Base Address Register - This register contains the base address
and size of the Event Log - a circular buffer in system memory used by the
IOMMU to report and record faults.

d. Control Register - A control register used to send control commands to the
IOMMU

e. Status Register - A status register used by the IOMMU to report status
information 

3. Device Table - A software table setup in main memory by the OS for the IOMMU.
The Device Table is indexed by the DeviceID a 16 bit device identifier. Each
entry in the Device Table includes the following information:

a. The Page Table Root Pointer - A pointer to a page table for that device

b. The Interrupt table Root pointer - A pointer to the interrupt mapping table

c. A mode field - This reports the number of levels in the page table

d. A DomainId field - This contains the domainID i.e. the domain which contains
the device. Two devices in the same domain have the same page tables.

e. Read/Write Permission bits for this Translation

f. Fields which indicate if the Interrupt, Translation and Device table entry
valid.

4. Page Tables - These are tables in memory that are used to map specific
bits in the virtual address to the next level Page Table if a Page Directory
Entry (PDE) or the physical page frame if a Page Table Entry (PTE).

Each level in the page tables takes as input the base address of a page table
and 9 bits from the virtual address (from HI to LO). The 9 bits are used as an
index into the page table to get the physical address of the next lower level
page table or the final physical page frame. With 64 bits in the virtual address
this yields a 6-level page table with the lowest 12 bits used as an offset into 
the final 4KB physical page frame. Each PDE has a next level field. If set
to 0 it indicates that it is a PTE and translation has ended. Else it indicates
that the level of the next table. Using 0 early allows the use of large pages
(similar to the Super Pages field in Intel IOMMU).

5. Interrupt Remapping Table - This is a table in physical memory that is
indexed into by using bits from the MSI interrupt data. Since this feature
will not be enabled in the first phase of this project we wont discuss this
further.

6. Command Buffer - The command buffer is a circular buffer in memory that is
written to by the OS/driver and is read by the IOMMU. The IOMMU uses a head
pointer register to get the next location to read while system software uses 
a tail pointer register to determine the next location to write to. The IOMMU
provides a completion wait command that allows the system to wait on an
interrupt until all commands prior to the command completion wait command have
been completed. The command buffer is architecturally similar to the
Queued Invalidation Interface used by Intel IOMMUs.

7. Event Log - The event log is a circular buffer in system memory that is
written to by the IOMMU to report faults encounterred during remapping of
DMA and Interrupts. There is a tail pointer register which points to the
next location to write to the IOMMU and a head pointer register to be used
by system software to locate the next event to read. The IOMMU can be
programmed to generate an interrupt when an event occurs and the Event Log
is updated. The AMD Event Log is architecturally similar to the Advanced
Fault Logging capability provided by Intel IOMMUs.

Operation of the IOMMU
======================
An IOMMU driver initially sets up the various data structures in memory
including page tables, device table, event log and command buffer. The IOMMU
is then started. When a device sets up DMA it makes a call to
ddi_dma_addr_bind_handle() or ddi_dma_buf_bind_handle(). The address passed
into these routines is typically the kernel virtual address. The DDI framework
translates these virtual addresses to physical addresses and passes them onto
the IOMMU driver. The IOMMU driver maps these into device virtual addresses,
updates the device's I/O page tables and then sends a command via the
command buffer to the IOMMU to invalidate any internal TLBs it may have. The
OS then passes the device virtual addresses in the form of DMA cookies
back to the DMA requester (the device driver). The device driver programs the
device's DMA engine and starts up the DMA. The device sends data to the device
virtual address programmed into it and this is then translated by the IOMMU
after walking the device I/O page tables. Any errors encountered during this
process are recorded in the Event Log and an interrupt is generated to notify
the IOMMU driver.


Reply via email to