http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/io.html
CS 473 - IO
Some History
Earliest computers were very pure von Neumann machines: all IO had to
go through CPU. No notion of DMA, etc.
IBM introduced idea of hardware ``channels'' to manage IO. Switch
between CPU, devices, memory. Probably earliest example of parallel
processing.
For a long time, the most reasonable way to distinguish between a
``minicomputer'' and a ``mainframe'' was by whether or not there were
dedicated IO and memory busses, or if everything plugged into a single
bus. Advantage of former system is speed; memory bus doesn't have to
worry about arbitration, so memory accesses can be faster. Advantages
of latter system are cost and uniformity.
Today, the old minicomputer architecture is pretty much completely
obsolete. If we look at a modern desktop computer, we can see that
its bus structure looks a lot like the old-time mainframes.

This figure (which comes from an AMD document from 2001) is pretty
obsolete today, but I like it because it does a good job showing the
interrelationships between all the various busses that are in use
today, even though the functiosn are migrating between the components
over time. Let's go over some of the items in the figure...
- Northbridge and Southbridge
- The modern PC architecture uses two ICs to manage the CPUs
communication with the rest of the system. These have historically been
known as the Northbridge, which provides the actual communication
between the CPU, the memory, the video, and the PCI bus; and the
Southbridge which connects the PCI bus to everything else. In this
figure, the AMD-762 "System Controller" is the Northbridge, and the
AMD-768 "Peripheral Bus Controller" is the Southbridge.
- Processors
- This figure shows two processors connected to the Northbridge. In
Intel designs these are on a bus called the "Front Side Bus"; AMD uses
a point-to-point connection with each processor directly connected to
the Northbridge (this interconnect is frequently called an FSB, also,
even though that's not correct).
- AGP
- AGP is the "accelerated graphics port" -- this is the
communication between the Northbridge and the video card. Video cards
require huge communication bandwidths, so they are given much more
diect access than other peripherals. AGP is being replaced by
PCI-Express.
- DDR Memory
- More correctly, "Double Data Rate Synchronous Dynamic RAM".
- PCI Bus
- PCI is the "Peripheral Component Interconnect", the first
successful attempt to define a bus standard that could be used by
different vendors and radically different computer architectures (I
should note that it was most certainly not the first attempt; other
busses such as Multibus, VME, Nubus, and Futurebus come to mind as
earlier attempts. I expect you could argue with me about whether some
of these -- especially Multibus and VME -- should be regarded as
successes. It's inarguable, however, that PCI is more
successful than any of these were. The intent is that high-speed
devices will communicate using PCI; lower-speed devices will use USB.
Notice there are two of them: a 66 MHz, 64-bit PCI between the
North- and Southbridge, and a 33 MHz, 32-bit PCI hanging off the
Southbridge. PCI is being replaced by PCI-Express. The original PCI
specification was for a 33MHz, 32-bit bus; this has been extended both
by doubling the speed and the width. This chipset supports both the
older, slower PCI and the newer, faster one.
In general, the extended PCIs haven't really taken off. While
faster, they weren't compellingly faster enough to warrant switching.
At this point, PCI is being replaced by PCI-Express, a scalable serial
interconnect.
- AC '97
- Audio Codec '97 -- communication with a "coder-decoder", or
codec.
- GPIO
- This one really surprised me: it turns out that there are 32 pins
on the Southbridge that can simply be programmed as digital input and
output: you can read their state, you can set their state. I don't know
that I've ever seen a motherboard that permitted access to them.
- LPC bus
- Low Pin Count Bus. Legacy devices we all hope will go away soon
are sitting out there.
- USB
- USB is the "Universal Serial Bus", a standard for low-speed
device communication. Like PCI, USB is very nice in terms of allowing
devices to identify themselves so the OS can properly configure a
driver. Unlike PCI, USB is intended to be hot-pluggable: you can plug a
device in or unplug a device with the system running (a hot-pluggable
PCI standard has been defined, primarily for servers. I don't think
I've ever actually a hot-pluggable PCI card).
- Enhanced Integrated Drive Electronics
- aka Advanced Technology Attachment. This is the standard
interconnect for disk drives; being replaced by serial ATA.
- SMBus
- Low-speed System Management bus. This lets the system do things
like query the memory as to how big it is.
- Legacy Peripherals
- A number of "legacy" peripherals are shown on the Super-IO chip
on the lower left. These range from standards that were very good 50
years ago but are now obsolete (like the serial port) to standards that
never should have been unleashed on an unsuspecting public (like the
parallel port). The intent is that these, already obsolete, will be
disappearing from future systems.
Memory-Mapped IO
Putting everything on a single bus leads immediately to a uniform
model of accessing memory and IO devices: just put devices in the
memory space. Question: are you better off losing opcodes to IO
instructions, or space to IO devices? When minicomputers had 16 bit
address spaces, this was a valid question! Today, with memory spaces
which are huge in comparison to the number of IO ports required to
handle devices, taking advantage of the richness of the regular
instruction set to deal with device access only makes sense.
I used to argue at this point that using memory-mapped IO made
it
easier to write device drivers in C, since you could map devices to C
structures. Unfortunately, modern C compilers pad structures for
performance reasons, and trying to coerce the compiler into producing
the memory layout you really want is non-portable and deprecated. So
I'm leaving the argument in place, but it's really for historical
interest at this point. The comments about trying to create
in and out instructions do remain valid,
however.
Also, when writing device drivers in C, it's a lot easier to
work with
when devices are in memory space. Let's suppose you have a simple
device
(for concreteness, let's look at the A/D convertor on an HC11. It's
controlled by five registers (not counting the OPTION register),
located at addresses $1030-$1034. We can define a struct that looks
like this:
struct ad {
unsigned char adctl;
unsigned char adr1 __attribute__ ((packed));
unsigned char adr2 __attribute__ ((packed));
unsigned char adr3 __attribute__ ((packed));
unsigned char adr4 __attribute__ ((packed));
} *adcon;
and define some macroes:
#define CCF 0x80
#define SCAN 0x20
#define MULT 0x10
Now, in our code, we can say
adcon = (struct ad *) 0x1030;
and we can control the device by saying things like
adcon->adctl = SCAN | 3;
and look at the state of the device by saying things like
while (!(adcon->adctl & CCF));
IO Instructions in C
On the other hand, trying to generate IO instructions for Intel in C
is... bizarre. The macroes doing it for Linux are in a file called
/usr/include/asm/io.h ; you're welcome to take a look at
/usr/include/asm/io.h if you want to figure this stuff out. Here's a
relevant comment from the file:
* This file is not meant to be obfuscating: it's just complicated
* to (a) handle it all in a way that makes gcc able to optimize it
* as well as possible and (b) trying to avoid writing the same thing
* over and over again with slight variations and possibly making a
* mistake somewhere.
And an old comment from an older version of the code:
/*
* Talk about misusing macros..
*/
Just as a sample of what they're talking about, here's a macro
definition from the file:
#define __BUILDIO(bwl,bw,type) \
static inline void out##bwl##_quad(unsigned type value, int port, int quad) { \
if (xquad_portio) \
write##bwl(value, XQUAD_PORT_ADDR(port, quad)); \
else \
out##bwl##_local(value, port); \
} \
static inline void out##bwl(unsigned type value, int port) { \
out##bwl##_quad(value, port, 0); \
} \
static inline unsigned type in##bwl##_quad(int port, int quad) { \
if (xquad_portio) \
return read##bwl(XQUAD_PORT_ADDR(port, quad)); \
else \
return in##bwl##_local(port); \
} \
static inline unsigned type in##bwl(int port) { \
return in##bwl##_quad(port, 0); \
}
Good luck.
IO Programming
We can classify IO devices, and IO programming techniques, according
to the degree to which we can off-load the IO processing to the
device:
- Sampling
- Always valid data, CPU can read whenever it wants
- Polling
- CPU must query device to see whether it is ready
- Interrupts
- Device informs CPU that it is ready
- DMA
- Device is able to control transfer of data to/from memory;
requests interrupt when it's done
- IO Controllers
- Device performs a series of IO operations without intervention,
requests interrupt when it's done
- IO coprocessors
- Device is a separate, fully programmable computer
As we move down the list, we have progressively less work for the
CPU
to do, and more sophistication required by the device (with a
correspondingly greater level of difficulty for the programmer).
Correct tradeoff varies by device.
With the exception of sampling, these forms of IO are typically
supersets (so an IO controller will also use DMA. A device that does
DMA also requests interrupts. You can poll a device that is capable
of doing interrupts).
Examples of sampling, polling, and interrupts are present on HC11.
- Sampling: digital input port, motor port
- In these simple devices, the device is always ready to accept or
to provide data, as appropriate. The interface is extremely simple,
consisting of just a data register.
- Polling: analog port
- (though analog port can be programmed to go into a mode such
that, once data has gone valid, you can sample). Polling requires that
you not only have a way to read and write data (the data register), but
also ways to control the device and to determine its status. These are
provided by command and status registers; frequently, they are combined
into a single command/status register: when you read it, you get the
status register; when you write it, you write to the command register.
Frequently in memory-mapped systems, the CSR is implemented so the bits
are compatible; you can do operations such as ``oring'' a bit in with
the CSR contents and have the result be something meaningful.
In the case of the HC11 analog port, the CSR is at address
0x1030. Here, the CCF flag is a ``done'' bit. CPU can keep checking
CCF; when it goes to 1, CPU knows that valid data is available. No
reason to bother with interrupts on this device, since it takes exactly
32 cycles to complete a conversion; starting the process actually
starts 4 conversions, so CCF always goes to 1 in 128 cycles. Too fast
for interrupts to help us. For that matter, since the time is
deterministic, there's no particular reason it should have given us the
CCF flag (except that it's easier than counting).
- Interrupts: serial port
- A flag to tell us when data is available in the input port, and a
flag to tell us when the output port can take data. If either of these
flags goes true, device signals an interrupt. Notice that input and
output are logically separate devices that share an address; which of
these devices is responsible for an interrupt is up to us to discover.
The new wrinkle here is that the device can request service
from the CPU when it's ready. Extra bits required here are normally
some way of globally controlling whether interrupts are enabled (or
controlling interrupts for sets of devices determined by priority), and
individual control of whether a specific device can request an
interrupt.
When an interrupt occurs, the necessary steps are:
-
The device finishes some task, and requires CPU service. If
its interrupts are enabled it will go on to the next step, otherwise it
will stop here (and, normally, not request an interrupt. Though some
devices will remember they want an interrupt, and request it if their
interrupts ever become enabled).
-
The device requests service from the CPU. There is
typically some handshaking during which the CPU determines whether
interrupts are globally enabled, the device identifies itself, and the
CPU determines whether the device is permitted to request an interrupt
at the moment. The details of these tests vary widely. If the
device is permitted to interrupt, we go on to the next step; if not, we
wait here. In this case, if interrupts from the device class are ever
enabled, the pending interrupt will be serviced.
-
The CPU saves enough of its prior state to recover the
former computation, changes to kernel mode, and branches to a location
determined by the interrupt. The interrupt service routine is located
at this location.
This is the last step in the interrupt request/service
operation. At this point, the problem is turned over to software.
Return from the interrupt service routine is normally performed by some
sort of ``return from interrupt'' instruction, which restores the
previous state of the computation.
It's important to understand just what's meant by the
``previous state of the computation'' -- it must be possible for the
process that was running at the time of the interrupt to resume with no
impact on the process. You have to be able to interrupt between the
setting of condition codes and the execution of a branch instruction
that makes use of them, for instance. Occasionally, processors have
instructions that take so long to execute that it's necessary to be
able to interrupt the instruction itself, and then resume that same
instruction later (block move instructions, which move a large amount
of data from one location to another in memory, tend to be in this
category). These instructions typically maintain their intermediate
states in registers as they proceed.
As you can imagine, this is a particular problem with
processors that use out-of-order execution. Intel has devoted a lot of
resources in their processors to it; it's the whole reason for the
in-order retirement buffer. IBM used a scheme they called ``imprecise
interrupts,'' which meant that the saved PC would be ``near'' where the
interrupt happened. This was acceptable for device interrupts, but made
debugging program exceptions very difficult. CDC's CPU didn't do
interrupts (IO was handled by peripheral processors, to be described
later), but faced much the same problem in their context swap
instruction.
One last thing to notice is that interrupts become a very
expensive operation for deeply pipelined and out-of-order processors.
It's substantially worse than a branch penalty; fortunately, interrupts
are much less common than branches.
The key feature that makes interrupts the desired solution for
a device is for an operation performed by the device to take long
enough that requiring the software to check on it periodically would
result in an unacceptable overhead.
More sophisticated IO mechanisms are present on other systems.
- DMA
- A system with DMA will normally have, in addition to the command
and status registers, an address and a byte-count register. This is
appropriate for situations in which relatively large blocks of data
must be transferred between the memory and the device. Consider
transferring a buffer from the memory to a disk drive: the CPU must
inform the disk as to where on the disk to put the data, and then loads
the DMA controller with the starting address of the data to be sent and
the count of the number of bytes. Now the DMA controller can pull the
data out of memory without interfering with the CPU; after the transfer
is done, it requests an interrupt.
- IO Processors
-
IBM is the company most commonly associated with IO processors,
which they call ``channels.'' A channel is a very simple computer,
capable of executing a small set of general-purpose instructions along
with many special-purpose IO instructions such sa ``read a track from
the disk'' or ``skip forward two blocks on the tape.'' The CPU would
construct a sequence of instructions for the channel to perform (a
channel control program) in main memory, and would send the channel a
start signal. The channel would execute the entire channel control
program before interrupting the CPU.
IBM mainframe CPUs would frequently show utilization that would
seem completely unacceptable to us today -- half of their time, or
more, in the OS. But that was really OK, because what they were doing
was more managing the IO than executing user code: doing a corporate
payroll takes remarkably little processing in comparison to the huge
amount of IO involved, and IO is what mainframes are all about.
- Peripheral Processing Units
-
The most extreme case of off-loading the IO task from the CPU
could be found in CDC and Cray computers. Here, all IO is performed by
a front-end computer, and none by the main CPU. In the case of the CDC
6600, there were actually 10 IO computers (CDC referred to these as
Peripheral Processing Units, in contrast to the Central Processing
Unit); a program would request service by placing a code word in a
known location. The IO computers performed all IO, and were also
capable of executing a special instruction (called an exchange jump)
that would cause the CPU to save its entire state and then load up a
new state - effectively, causing it to perform a context switch. CDC
actually ran the operating system itself on one of the IO computers.
In the case of the Cray 1, the front-end computers were
purchased from DEC or Data General.
One last thing: in the current environment, attempting to classify
devices according to this crisp scheme is frequently very difficult.
Probably the best example is current disk drives, which appear to the
CPU as simple DMA-driven devices: The CPU tells the drive which
logical block to write to or read from, the drive does it, and the
drive requests an interrupt. But (1) decoding the logical block
address into an actual location on the disk is quite complex, and (2)
the disk actually caches reads and writes so that a read occurring
shortly after a write doesn't actually require a disk access, and the
disk drive does its own scheduling of the reads and writes. So it can
almost be classified as an IO processor.
Likewise, modern graphics cards do far more of the rendering, hidden
surface calculation, and other graphics operations than the CPU does
(in fact, last I heard typical GPUs had more transistors than typical
CPUs!). The CPU hands the list of polygons and information about
their characteristics to the graphics card, and just lets 'er rip.
Last modified: Mon Apr 24 12:46:29 MDT 2006
|