Linux Kernel Power Management

29 April 2003

Patrick Mochel


Abstract

Power management is the process by which the overall consumption of
power by a computer is limited based on user requirements and
policy. Power management has become a hot topic in the computer world
in recent years, as laptops have become more commonplace and users
have become more conscious of the environmental and financial effects
of limited power resources.

While there is no such thing as perfect power management, since all
computers must use some amount of power to run, there have been many
advances in system and software architectures to conserve the amount
of power being used. Exploiting these features is key to providing
good system- and device-level power management.

This paper discusses recent advances in the power management
infrastructure of the Linux kernel that will allow Linux to fully
exploit the power management capabilities of the various platforms
that it runs on. These advances will allow the kernel to provide
equally great power management, using a simple interface, regardless
of the underlying archtitecture.

This paper covers the two broad areas of power management - System
Power Management (SPM) and Device Power Management (DPM). It describes
the major concepts behind both subjects and describes the new kernel
infrastructure for implement both. It also discusses the mechanism for
implementing hibernation, otherwise known as suspend-to-disk, support
for Linux.


Overview


Benefits of Power Management

A sane power management infrastructure provides many benefits to the
kernel, and in not only the obvious areas.

Battery-powered devices, such as embedded devices, handhelds, and
laptops reap most of the rewards of power management, since the more
conservative the draw on the battery is, the longer it will last.

System power management decreases boot time of a system, by restoring
previously saved state instead of reinitializing the entire
system. This conserves battery life on mobile devices the annoying
wait for the computer to boot into a useable state.

Recently, power management concepts have begun to filter into less
obvious places, like the enterprise. In a rack of servers, some
servers may power down during idle times, and power back up when
needed again to fulfill network requests. While the power consumption
of a single server is but a drop in the water, being able to conserve
the power draw of dozens or hundreds of computers could save a company
a significant amount of money.

Also, at the lower-level, power management may be used to provide
emergency reaction to a critical system state, such as crossing a
pre-defined thermal threshold or reaching a critically low battery
state. The same concept can be applied when triggering a critical
software state, like an Oops or a BUG() in the kernel.



System and Device Power Management

There are two types of power management that the OS must handle -
System Power Management and Device Power Management.

Device Power Management deals with the process of placing individual
devices into low-power states while the system is running. This allows
a user to conserve power on devices that are not currently being used,
such as the sound device in my laptop while I write this paper.

Individual device power management may be invoked explicitly on
devices, or may happen automatically after a device has been idle for
a set of amount of time. Not all devices support run-time power
management, but those that do must export some mechanism for
controlling it in order to execute the user's policy decisions.


System Power Management is the process by which the entire system is
placed into a low-power state. There are several power states that a
system may enter, depending on the platform it is running on. Many are
similar across platforms, and will be discussed in detail later. The
general concept is that the state of the running system is saved
before the system is powered down, and restored once the system has
regained power. This prevents the system from performing an entire
shutdown and startup sequence.

System power management may be invoked for a number of reasons. It may
automatically enter a low-power state after it has been idle for some
amount of time, after a user closes a lid on a laptop, or when some
critical state has been reached. These are also policy decisions that
are up to the user to configure and require some global mechanism for
controlling.



Device Power Management


Device power management in the kernel is made possible by the new
driver model in the 2.5 kernel. In fact, the driver model was inspired
by the requirement to implement decent power management in the kernel.
The new driver model allows generic kernel to communicate with every
device in the system, regardless of the bus the device resides on, or
the class it belongs to.

The driver model also provides a hierarchical representation of the
devices in the system. This is key to power management, since the
kernel cannot power down a device that another device, that isn't
powered down, relies on for power. For example, the system cannot
power down a parent device whose children are still powered up and
depend on their parent for power.


In its simplest form, device power management consists of a
description of the state a device is in, and a mechanism for
controlling those states. Device power states are described as 'D'
states, and consist of states D0-D3, inclusive. This device state
representation is inspired by the PCI device specification and the
ACPI specification [ACPI]. Though not all device types define power
states in this way, this representation can map on to all known
device types.

Each D state represents a tradeoff between the amount of power a
device is consuming and how functional a device is. In a lower power
state (represented by a higher digit following D), some amount of
power to a device is lost. This means that some of the device's
operating state is lost, and must be restored by its driver when
returning to the D0 state.

D0 represents the state when the device is fully powered on and ready
for, or in, use. This state is implicitly supported by every device,
since every device may be powered on at some point while the system is
running. In this state, all units of a device are powered on, and no
device state is lost.

D3 represents the state when the device is off. This state is also
implicitly supported by every device, since every device is implicitly
powered off when the system is powered off. In this state, all device
context is lost and must be restored before using the device
again. This usually means the device must also be completely
reinitialized.

The PCI Power Management spec goes on to define D3hot as a D3 state
that is entered via driver control and D3cold that is entered when the
entire system is powered down. In D3hot, the device may not lose all
operating power, requiring less restoration that must take place. This
is however, device-dependent. The kernel does not distinguish between
the two, though a driver theoretically could take extra steps to do
so.

D1 and D2 are intermediate power states that are optionally supported
by a device. In each case, the device is not functional, but not
entirely powered off. In order to bring the device back to an
operating state, less work is required than reviving the device from
D3. In D1, more power is consumed than in D2, but more device context
is preserved.

A device's power management information is stored in struct
device_pm:

struct device_pm {
#ifdef CONFIG_PM
        dev_power_t     power_state;
        u8              * saved_state;
    atomic_t    depend;
    atomic_t    disable;
        struct kobject  kobj;
#endif
};

struct device contains a statically allocated device_pm object. The
configuration dependency on CONFIG_PM guarantees the overhead for the
structure is nil when power management support is not compiled in.

The kernel defines the following power states in include/linux/pm.h:

typedef enum {
        DEVICE_PM_ON,
        DEVICE_PM_INT1,
        DEVICE_PM_INT2,
        DEVICE_PM_OFF,
        DEVICE_PM_UNKNOWN,
} dev_power_t;

When a device is registered, it's initial power state is set to
DEVICE_PM_UNKOWN. The device driver may query the device and
initialize the known power state using

void device_pm_init_power_state(struct device * dev, dev_power_t state);


Controlling a Device's State

A device's power state may be controlled by the suspend() and resume()
methods in struct device_driver:

  int     (*suspend)      (struct device * dev, u32 state, u32 level);
  int     (*resume)       (struct device * dev, u32 level);

These methods may be initialized by the low-level device driver,
though they are typically initialized at registration time by the bus
driver that the driver belongs to. The bus's functions should forward
power management requests to the bus-specific driver, modifying the
semantics where necessary.

This model is used to provide the easiest route when converting to the
new driver model. However, a device driver's explicit initialization
of these methods will be honored.

The same methods are called during individual device power management
transitions and system power management transitions.


There are two steps to suspending a device and two steps to resume
it. In order to suspend a device, two separate calls are made to the
suspend() method - one to save state, and another to power the device
down. Conversely, one call is made to the resume() method to power the
device up, and another to restore device state.

These steps are encoded thusly:

enum {
        SUSPEND_SAVE_STATE,
        SUSPEND_POWER_DOWN,
};

enum {
        RESUME_POWER_ON,
        RESUME_RESTORE_STATE,
};

and are passed as the 'level' parameter to each method.

During the SUSPEND_SAVE_STATE call, the driver is expected to stop all
device requests and save all relevant device context based on the
state the device is entering.

This call is made in process context, so the driver may sleep and
allocate memory to save state. However during system suspend, backing
swap devices may have already been powered down, so drivers should
use GFP_ATOMIC when allocating memory.

SUSPEND_POWER_DOWN is used only to physically power the device
down. This call has some caveats, and drivers must be aware of
them. Interrupts will be disabled when this device is called. However,
during run-time device power management, interrupts will be re-enabled
once the call returns. Some devices are known to cause problems once
they are powered down and interrupts reenabled - e.g. flooding the
system with interrupts. Drivers should be careful not to service power
management requests for devices known to be buggy.

During system power management, interrupts are disabled and remain
disabled while powering down all devices in the system.

The resume sequence is identical, though reversed, from the suspened
sequence. The RESUME_POWER_ON stage is performed first, with interrupts
disabled. The driver is expected to power the device on. Interrupts
are then enabled and the RESUME_RESTORE_STATE is performed, and the
driver is expected to restore device state and free memory that was
previously allocated.

A driver may use the struct device_pm::state field to store a pointer
to device state when the device is powered down. n


Power Dependencies

Devices that are children of other devices (e.g. devices behind a PCI
bridge) depend on their parent devices to be powered up to either
provide power to them and/or provide I/O transactions.

The system must respect the power dependencies of devices and must not
attempt to power down a device which another device depends on being
on. Put another way, all children devices must be powered down before
their parent can be powered down. Conversely, the parent device must
be powered up before any children devices may be accessed.

Expressing this type of dependency is simple, since it is easy to
determine whether or not a device has any children or not. But, there
are more interesting power dependencies that are more difficult to
express.

On a PCI Hotplug system, the hotplug controller that controls power to
a range of slots may reside on the primary PCI bus. However, the slots
it controls may reside behind a PCI-PCI bridge that is a peer of the
hotplug controller. The devices in the slots depend on the hotplug
controller being on to operate, but it is not the devices' parent.
There are similar transversal relationships on some embedded platforms
in which some I/O controller resides near the system root that some
PCI devices, several layers deep, may depend on to communicate
properly.

Both types of power dependencies are represented using the struct
device_pm::depend field. Implicit dependencies, like parent-child
relationships, are handled by the depend count being incremented when
a child is registered with the PM core. When that child device is
powered down or removed, its parent's depend count is decremented.
Only when a device's depend count is 0 may it be powered down.

Explicit power dependencies can be imposed on devices using

int device_pm_get(struct device *);
void device_pm_put(struct device *);

device_pm_get() will increment a device's dependency count, and
device_pm_put() will decrement it. It is up to the driver to properly
manage the dependency counts on device discovery, removal, and power
management requests.


Disabling Power Management

There are circumstances in which a driver must refuse a power
management request. This is usually because the driver author does not
know the proper reinitialization sequence, or because the user is
performing an uninterruptible operation like burning a CD.

It is valid for a driver to return an error from a suspend() method
call. Although, a driver may know a priori that it can't handle the
request. This works to the system's benefit, since the PM core can
check if any devices have disabled power management before starting a
suspend transition.

To disable power management, a device may call

int device_pm_disable(struct device *);
void device_pm_enable(struct device *);

The former increments the struct device_pm::disable count, and the
lattr decrements it. If the count is positive, system power management
will be disabled completely, and device power management on that
device.

This calls should be used judiciously, since they have a global impact
on system power management.


System Power Management

System power management is the concept of putting an entire computer
into a state in which it is consuming a relatively small amount of
power while maintaining a relatively low response latency.

In a system power management (SPM) state, the system is not running
and no processes are being executed. Typically, all of the devices in
the system are also in a low-power state that corresponds to the
tradeoff between power connsumption and response latency of the entire
system, which will be explained shortly.

SPM details are dependent on the under-laying platform. The amount of
power the system consumes, the response latency, and even the
canonical name for the states a system can enter are dependent on the
architecture; sometimes even the generation of the architecture.

The new power management subsystem defines an abstract interface to
control SPM states. It provides an interface via sysfs from which a
user can trigger SPM transitions. Internally, the PM infrastructure
performs platform-agnostic actions to quiesce the system, then calls
down to a dynamically registered PM driver, which performs the
platform-specific steps to transition the hardware into a low-
power state.


Power Transitions



PM States

The power states that a platform can enter are defined by the
underlying hardware and the firmware that runs on the hardware. Though
the name of each state, and the mechanism for entering each state is
different across each platform, most platforms support three nearly
identical states.

To the PM subsystem, a power state is an abstract object, defined by
struct pm_state. struct pm_state includes both the power state the
system is to enter as well as the lowest power state every device
in the system is to enter.



PM Drivers

The PM subsystem defines a simple object registration model that
platform-specific code can use to register objects that can
communicate with the hardware.
Other Power Management Options


So far, a lot of talk has been dedicated to describing the internals
of the new power management subsystem, but little has been given to
describe how the new infrastructure interacts with current power
management options. This section describes those relationships, and
although it focuses on options specific to ia32 platforms, the
relationships should be extendable to other platforms.


ACPI

In terms of system power management, fits nicely into the new PM
infrastructure. It behaves as a PM driver, and provides
platform-specific hooks to transition the system into a low-power
state. At the basic SPM level, this is all that is required, though
ACPI offers a potentially much more powerful solution, since it it
exposes intimate knowledge of the platform power requirements than has
ever been available on ia32 platforms (e.g. response latencies, power
consumption etc.). Exploiting this knowledge is up to the ACPI
platform driver to expose these attributes via sysfs.

ACPI offers similar potential for device power management. Devices
that appear in the firmware's DSDT (Differentiated System Description
Table) may expose a very fine-grained level of detail about the
devices' power requirements and capabilities.

ACPI also stress the capabilities of device Performance States.
A performance state is a power state that describes a trade-off
between the capabilities of a device against the power consumption of
the device. In each performance state that a device supports, the
device is fully running, but different functional hardware units may
be powered off to conserve power. The driver model does not explicitly
recognize performance states, though the new PM extensions to the driver
model provide a framework that could easily be extended to recognize
performance states.


APM

APM power management does not appear on very many new systems, but the
current Linux install base includes a large number of APM-capable
computers. The new PM model was not developed with APM, or any
firmware-driven PM model, in mind. However, care was taken to ensure
that it conceptually made sense to use such mechanisms as low-level
platform drivers for the PM model. No work has been done, however, to
convert APM to act as a PM driver for the new model.


pm_* infratructure

The original PM infrastucture was developed by Andrew Henroid and was
very ground-breaking, since nothing like it had been done for the
Linux kernel before. It exists in its entirety in:

      kernel/pm.c
      include/linux/pm.h

The general idea is that drivers can declare and register an object
with the pm infrastructure that is accessed during a power state
transition. The idea is very similar to what we have now, though the
registration now is implicit when a device is registered with the
system. And, based on the implementation, we can guarantee that each
device is notified in proper ancestral order, which the old model
cannot do.

Because the new model is far superior the old-style pm infrastructure,
it is declared deprecated. All drivers that implement pm callbacks
should be converted to use the hooks provided by the new driver model.


swsusp

swsusp is a mechanism for doing suspend-to-disk by saving kernel state
to unused swap space. It was also a ground-breaking feature, as it was
the first true suspend-to-disk implementation for Linux. There are
some questionable characteristics of swsusp that many people have that
the maintainers of swsusp counter are frivilous concerns, and it
currently exists as an alternative to the new PM model. However, the
author has revoked any philosophical issues with swsusp. It can be,
and should be ported to be, used as a backend driver for the generic
Hibernate mechanism. The current code base could be reduced to a
fraction of its current complexity.
 
Acknowledgements

Many people have contributed to this document, both explicitly and
implicitly. First, Linus deserves a mention for encouraging me look at
implementing ACPI suspend-to-ram as my first kernel project. Andy
Grover and Paul Diefenbaugh of Intel for many things - contributing
ACPI to the kernel, for talking with me, for always arguing with and
motivating me internally to do things better, and for pushing me over
the edge to write the finest OS driver model in existence. Andy
Henroid for writing the first open-source power management model and
providing a great base -- despite its shortcomings -- to learn and
build from. Pavel Machek for constantlyl providing code and being
energetic about the project. All the swsusp people for doing it in the
first place and keeping it up, no matter how much I gripe about it.

Reply via email to