Linux Kernel Power Management 29 April 2003 Patrick Mochel Abstract Power management is the process by which the overall consumption of power by a computer is limited based on user requirements and policy. Power management has become a hot topic in the computer world in recent years, as laptops have become more commonplace and users have become more conscious of the environmental and financial effects of limited power resources. While there is no such thing as perfect power management, since all computers must use some amount of power to run, there have been many advances in system and software architectures to conserve the amount of power being used. Exploiting these features is key to providing good system- and device-level power management. This paper discusses recent advances in the power management infrastructure of the Linux kernel that will allow Linux to fully exploit the power management capabilities of the various platforms that it runs on. These advances will allow the kernel to provide equally great power management, using a simple interface, regardless of the underlying archtitecture. This paper covers the two broad areas of power management - System Power Management (SPM) and Device Power Management (DPM). It describes the major concepts behind both subjects and describes the new kernel infrastructure for implement both. It also discusses the mechanism for implementing hibernation, otherwise known as suspend-to-disk, support for Linux. Overview Benefits of Power Management A sane power management infrastructure provides many benefits to the kernel, and in not only the obvious areas. Battery-powered devices, such as embedded devices, handhelds, and laptops reap most of the rewards of power management, since the more conservative the draw on the battery is, the longer it will last. System power management decreases boot time of a system, by restoring previously saved state instead of reinitializing the entire system. This conserves battery life on mobile devices the annoying wait for the computer to boot into a useable state. Recently, power management concepts have begun to filter into less obvious places, like the enterprise. In a rack of servers, some servers may power down during idle times, and power back up when needed again to fulfill network requests. While the power consumption of a single server is but a drop in the water, being able to conserve the power draw of dozens or hundreds of computers could save a company a significant amount of money. Also, at the lower-level, power management may be used to provide emergency reaction to a critical system state, such as crossing a pre-defined thermal threshold or reaching a critically low battery state. The same concept can be applied when triggering a critical software state, like an Oops or a BUG() in the kernel. System and Device Power Management There are two types of power management that the OS must handle - System Power Management and Device Power Management. Device Power Management deals with the process of placing individual devices into low-power states while the system is running. This allows a user to conserve power on devices that are not currently being used, such as the sound device in my laptop while I write this paper. Individual device power management may be invoked explicitly on devices, or may happen automatically after a device has been idle for a set of amount of time. Not all devices support run-time power management, but those that do must export some mechanism for controlling it in order to execute the user's policy decisions. System Power Management is the process by which the entire system is placed into a low-power state. There are several power states that a system may enter, depending on the platform it is running on. Many are similar across platforms, and will be discussed in detail later. The general concept is that the state of the running system is saved before the system is powered down, and restored once the system has regained power. This prevents the system from performing an entire shutdown and startup sequence. System power management may be invoked for a number of reasons. It may automatically enter a low-power state after it has been idle for some amount of time, after a user closes a lid on a laptop, or when some critical state has been reached. These are also policy decisions that are up to the user to configure and require some global mechanism for controlling. Device Power Management Device power management in the kernel is made possible by the new driver model in the 2.5 kernel. In fact, the driver model was inspired by the requirement to implement decent power management in the kernel. The new driver model allows generic kernel to communicate with every device in the system, regardless of the bus the device resides on, or the class it belongs to. The driver model also provides a hierarchical representation of the devices in the system. This is key to power management, since the kernel cannot power down a device that another device, that isn't powered down, relies on for power. For example, the system cannot power down a parent device whose children are still powered up and depend on their parent for power. In its simplest form, device power management consists of a description of the state a device is in, and a mechanism for controlling those states. Device power states are described as 'D' states, and consist of states D0-D3, inclusive. This device state representation is inspired by the PCI device specification and the ACPI specification [ACPI]. Though not all device types define power states in this way, this representation can map on to all known device types. Each D state represents a tradeoff between the amount of power a device is consuming and how functional a device is. In a lower power state (represented by a higher digit following D), some amount of power to a device is lost. This means that some of the device's operating state is lost, and must be restored by its driver when returning to the D0 state. D0 represents the state when the device is fully powered on and ready for, or in, use. This state is implicitly supported by every device, since every device may be powered on at some point while the system is running. In this state, all units of a device are powered on, and no device state is lost. D3 represents the state when the device is off. This state is also implicitly supported by every device, since every device is implicitly powered off when the system is powered off. In this state, all device context is lost and must be restored before using the device again. This usually means the device must also be completely reinitialized. The PCI Power Management spec goes on to define D3hot as a D3 state that is entered via driver control and D3cold that is entered when the entire system is powered down. In D3hot, the device may not lose all operating power, requiring less restoration that must take place. This is however, device-dependent. The kernel does not distinguish between the two, though a driver theoretically could take extra steps to do so. D1 and D2 are intermediate power states that are optionally supported by a device. In each case, the device is not functional, but not entirely powered off. In order to bring the device back to an operating state, less work is required than reviving the device from D3. In D1, more power is consumed than in D2, but more device context is preserved. A device's power management information is stored in struct device_pm: struct device_pm { #ifdef CONFIG_PM dev_power_t power_state; u8 * saved_state; atomic_t depend; atomic_t disable; struct kobject kobj; #endif }; struct device contains a statically allocated device_pm object. The configuration dependency on CONFIG_PM guarantees the overhead for the structure is nil when power management support is not compiled in. The kernel defines the following power states in include/linux/pm.h: typedef enum { DEVICE_PM_ON, DEVICE_PM_INT1, DEVICE_PM_INT2, DEVICE_PM_OFF, DEVICE_PM_UNKNOWN, } dev_power_t; When a device is registered, it's initial power state is set to DEVICE_PM_UNKOWN. The device driver may query the device and initialize the known power state using void device_pm_init_power_state(struct device * dev, dev_power_t state); Controlling a Device's State A device's power state may be controlled by the suspend() and resume() methods in struct device_driver: int (*suspend) (struct device * dev, u32 state, u32 level); int (*resume) (struct device * dev, u32 level); These methods may be initialized by the low-level device driver, though they are typically initialized at registration time by the bus driver that the driver belongs to. The bus's functions should forward power management requests to the bus-specific driver, modifying the semantics where necessary. This model is used to provide the easiest route when converting to the new driver model. However, a device driver's explicit initialization of these methods will be honored. The same methods are called during individual device power management transitions and system power management transitions. There are two steps to suspending a device and two steps to resume it. In order to suspend a device, two separate calls are made to the suspend() method - one to save state, and another to power the device down. Conversely, one call is made to the resume() method to power the device up, and another to restore device state. These steps are encoded thusly: enum { SUSPEND_SAVE_STATE, SUSPEND_POWER_DOWN, }; enum { RESUME_POWER_ON, RESUME_RESTORE_STATE, }; and are passed as the 'level' parameter to each method. During the SUSPEND_SAVE_STATE call, the driver is expected to stop all device requests and save all relevant device context based on the state the device is entering. This call is made in process context, so the driver may sleep and allocate memory to save state. However during system suspend, backing swap devices may have already been powered down, so drivers should use GFP_ATOMIC when allocating memory. SUSPEND_POWER_DOWN is used only to physically power the device down. This call has some caveats, and drivers must be aware of them. Interrupts will be disabled when this device is called. However, during run-time device power management, interrupts will be re-enabled once the call returns. Some devices are known to cause problems once they are powered down and interrupts reenabled - e.g. flooding the system with interrupts. Drivers should be careful not to service power management requests for devices known to be buggy. During system power management, interrupts are disabled and remain disabled while powering down all devices in the system. The resume sequence is identical, though reversed, from the suspened sequence. The RESUME_POWER_ON stage is performed first, with interrupts disabled. The driver is expected to power the device on. Interrupts are then enabled and the RESUME_RESTORE_STATE is performed, and the driver is expected to restore device state and free memory that was previously allocated. A driver may use the struct device_pm::state field to store a pointer to device state when the device is powered down. n Power Dependencies Devices that are children of other devices (e.g. devices behind a PCI bridge) depend on their parent devices to be powered up to either provide power to them and/or provide I/O transactions. The system must respect the power dependencies of devices and must not attempt to power down a device which another device depends on being on. Put another way, all children devices must be powered down before their parent can be powered down. Conversely, the parent device must be powered up before any children devices may be accessed. Expressing this type of dependency is simple, since it is easy to determine whether or not a device has any children or not. But, there are more interesting power dependencies that are more difficult to express. On a PCI Hotplug system, the hotplug controller that controls power to a range of slots may reside on the primary PCI bus. However, the slots it controls may reside behind a PCI-PCI bridge that is a peer of the hotplug controller. The devices in the slots depend on the hotplug controller being on to operate, but it is not the devices' parent. There are similar transversal relationships on some embedded platforms in which some I/O controller resides near the system root that some PCI devices, several layers deep, may depend on to communicate properly. Both types of power dependencies are represented using the struct device_pm::depend field. Implicit dependencies, like parent-child relationships, are handled by the depend count being incremented when a child is registered with the PM core. When that child device is powered down or removed, its parent's depend count is decremented. Only when a device's depend count is 0 may it be powered down. Explicit power dependencies can be imposed on devices using int device_pm_get(struct device *); void device_pm_put(struct device *); device_pm_get() will increment a device's dependency count, and device_pm_put() will decrement it. It is up to the driver to properly manage the dependency counts on device discovery, removal, and power management requests. Disabling Power Management There are circumstances in which a driver must refuse a power management request. This is usually because the driver author does not know the proper reinitialization sequence, or because the user is performing an uninterruptible operation like burning a CD. It is valid for a driver to return an error from a suspend() method call. Although, a driver may know a priori that it can't handle the request. This works to the system's benefit, since the PM core can check if any devices have disabled power management before starting a suspend transition. To disable power management, a device may call int device_pm_disable(struct device *); void device_pm_enable(struct device *); The former increments the struct device_pm::disable count, and the lattr decrements it. If the count is positive, system power management will be disabled completely, and device power management on that device. This calls should be used judiciously, since they have a global impact on system power management. System Power Management System power management is the concept of putting an entire computer into a state in which it is consuming a relatively small amount of power while maintaining a relatively low response latency. In a system power management (SPM) state, the system is not running and no processes are being executed. Typically, all of the devices in the system are also in a low-power state that corresponds to the tradeoff between power connsumption and response latency of the entire system, which will be explained shortly. SPM details are dependent on the under-laying platform. The amount of power the system consumes, the response latency, and even the canonical name for the states a system can enter are dependent on the architecture; sometimes even the generation of the architecture. The new power management subsystem defines an abstract interface to control SPM states. It provides an interface via sysfs from which a user can trigger SPM transitions. Internally, the PM infrastructure performs platform-agnostic actions to quiesce the system, then calls down to a dynamically registered PM driver, which performs the platform-specific steps to transition the hardware into a low- power state. Power Transitions PM States The power states that a platform can enter are defined by the underlying hardware and the firmware that runs on the hardware. Though the name of each state, and the mechanism for entering each state is different across each platform, most platforms support three nearly identical states. To the PM subsystem, a power state is an abstract object, defined by struct pm_state. struct pm_state includes both the power state the system is to enter as well as the lowest power state every device in the system is to enter. PM Drivers The PM subsystem defines a simple object registration model that platform-specific code can use to register objects that can communicate with the hardware. Other Power Management Options So far, a lot of talk has been dedicated to describing the internals of the new power management subsystem, but little has been given to describe how the new infrastructure interacts with current power management options. This section describes those relationships, and although it focuses on options specific to ia32 platforms, the relationships should be extendable to other platforms. ACPI In terms of system power management, fits nicely into the new PM infrastructure. It behaves as a PM driver, and provides platform-specific hooks to transition the system into a low-power state. At the basic SPM level, this is all that is required, though ACPI offers a potentially much more powerful solution, since it it exposes intimate knowledge of the platform power requirements than has ever been available on ia32 platforms (e.g. response latencies, power consumption etc.). Exploiting this knowledge is up to the ACPI platform driver to expose these attributes via sysfs. ACPI offers similar potential for device power management. Devices that appear in the firmware's DSDT (Differentiated System Description Table) may expose a very fine-grained level of detail about the devices' power requirements and capabilities. ACPI also stress the capabilities of device Performance States. A performance state is a power state that describes a trade-off between the capabilities of a device against the power consumption of the device. In each performance state that a device supports, the device is fully running, but different functional hardware units may be powered off to conserve power. The driver model does not explicitly recognize performance states, though the new PM extensions to the driver model provide a framework that could easily be extended to recognize performance states. APM APM power management does not appear on very many new systems, but the current Linux install base includes a large number of APM-capable computers. The new PM model was not developed with APM, or any firmware-driven PM model, in mind. However, care was taken to ensure that it conceptually made sense to use such mechanisms as low-level platform drivers for the PM model. No work has been done, however, to convert APM to act as a PM driver for the new model. pm_* infratructure The original PM infrastucture was developed by Andrew Henroid and was very ground-breaking, since nothing like it had been done for the Linux kernel before. It exists in its entirety in: kernel/pm.c include/linux/pm.h The general idea is that drivers can declare and register an object with the pm infrastructure that is accessed during a power state transition. The idea is very similar to what we have now, though the registration now is implicit when a device is registered with the system. And, based on the implementation, we can guarantee that each device is notified in proper ancestral order, which the old model cannot do. Because the new model is far superior the old-style pm infrastructure, it is declared deprecated. All drivers that implement pm callbacks should be converted to use the hooks provided by the new driver model. swsusp swsusp is a mechanism for doing suspend-to-disk by saving kernel state to unused swap space. It was also a ground-breaking feature, as it was the first true suspend-to-disk implementation for Linux. There are some questionable characteristics of swsusp that many people have that the maintainers of swsusp counter are frivilous concerns, and it currently exists as an alternative to the new PM model. However, the author has revoked any philosophical issues with swsusp. It can be, and should be ported to be, used as a backend driver for the generic Hibernate mechanism. The current code base could be reduced to a fraction of its current complexity. Acknowledgements Many people have contributed to this document, both explicitly and implicitly. First, Linus deserves a mention for encouraging me look at implementing ACPI suspend-to-ram as my first kernel project. Andy Grover and Paul Diefenbaugh of Intel for many things - contributing ACPI to the kernel, for talking with me, for always arguing with and motivating me internally to do things better, and for pushing me over the edge to write the finest OS driver model in existence. Andy Henroid for writing the first open-source power management model and providing a great base -- despite its shortcomings -- to learn and build from. Pavel Machek for constantlyl providing code and being energetic about the project. All the swsusp people for doing it in the first place and keeping it up, no matter how much I gripe about it. |