The attached is a draft of the case which introduces the current
engineering program for Power Mangement in Solaris.
Note that it is intentionally high-level, and does not, itself present
any specific interfaces. It attempts to provide an overview of what the
program has in mind, with a partial outline of some of the initially
intended projects. Specific component design and interface will be the
purview of each of these succeeding one-pagers.
Comments are invited.
-db
--
; David J. Brown Ph.D. (cantab.)
; Principal Engineer
; Solaris Engineering
; Oracle
; --
; Postal Address: Telephone: (650) 786-5558
; 4150 Network Circle, UMPK17-307 FAX: (650) 786-5734
; Santa Clara, CA 95054 e-mail: [email protected]
Template Version: @(#)onepager.txt 1.35 07/11/07 SMI
Copyright 2007 Sun Microsystems
1. Introduction
1.1. Project/Component Working Name:
Power Management 2.0 Umbrella Case
1.2. Name of Document Author/Supplier:
David J. Brown
1.3. Date of This Document:
06/15/2010
1.3.1. Date this project was conceived:
June 2009 (This umbrella case is derived and extended
from earlier work to support suspend/resume and CPU
power management on Sun's x64 hardware platforms).
1.4. Name of Major Document Customer(s)/Consumer(s):
1.4.1. The PAC or CPT you expect to review your project:
Systems PAC
1.4.2. The ARC(s) you expect to review your project:
PSARC
1.4.3. The Director/VP who is "Sponsoring" this project:
[email protected], [email protected]
1.4.4. The name of your business unit:
x64 Platform Software
1.5. Email Aliases:
1.5.1. Responsible Manager: [email protected]
1.5.2. Responsible Engineer: [email protected]
1.5.3. Marketing Manager: [email protected]
1.5.4. Interest List: [email protected]
2. Project Summary
2.1. Project Description:
The existing power management in Solaris dates from over 17 years
ago (April 1993), when the original effort to implement checkpoint
resume (CPR) for the "Voyager" product took place.
Recently, there has been a great deal of vigor in the industry related
energy efficiency, and hence the appearance of many new power management
facilities - ranging from the individual hardware components to the
contemporary hardware platforms.
Over the past four years specific work has been done to support
contemporary features on the Intel-architecture platforms (both
Sun's AMD- and Intel-based systems). The principal focus of these
projects has been to implement Suspend-to-RAM (ACPI S3), and to
support contemporary CPU power-management features (P-states, C-states,
and T-states) for contemporary AMD and Intel processors.
A range of modern facilities for power management are emerging.
The system's earlier conceptions for power management need to be
revised to support these properly, and to pursue the end-objective of
energy-efficient computing.
This case introduces the Program, and a number of the initial projects
that constitute the next generation of power management facilities
within Solaris.
2.2. Risks and Assumptions:
The program's broad goal is to provide a practical solution to the
energy-efficient
computing problem. This requires the ability to construct a
comprehensive system power
model (for each platform upon which it runs), and will also ultimately
require improved knowledge of the dynamic resource use of both
individual applications and
workloads. We expect to gain better informational interfaces from both
hardware
component vendors (e.g. Intel) and the hardware platforms teams to
address the first point.
Improved knowledge of applications and workloads is a great opportunity
now that Sun has
been integrated with Oracle. This relies on the participation of the
various software
product groups in question.
3. Business Summary
3.1. Problem Area:
Support for contemporary component- and platform-level power management
facilities provides the basis for the more energy-efficient operation
of the computing hardware Sun/Oracle sells.
A number of rudimentary facilities are coming to be in place in the
hardware components and platforms, and the remainder of the systems
stack above must be improved to exploit these. Particular attention
is needed at the levels of the firmware, the system virtualization layer
(hypervisor) and the operating system (whether virtualized or running
natively).
The focus of this program is on facilities implemented within the
Solaris operating
system. The initial attention will be to the system design and
implementation
required when the OS is run on bare iron (i.e. the non-virtualized
case). These techniques will then be considered within the context
where
Solaris is a virtualized guest and extended as appropriate. The
specific virtualized
settings of interest are Solaris under both the Sun4v and OVM
hypervisors (i.e.
these two para-virtualized cases).
The Program's primary focus will be on Server power management (with
particular
attention to the company's volume servers to begin with).
3.2. Market/Requester:
All customers are now attentive to the energy consequences of
the systems they operate. This is of particular concern in
data centers, where the energy costs to operate equipment
can now be expected to meet or exceed the capital cost of
the equipement's acquisition.
All federal government customers (as well, possibly, as state
and local government ones, and others) will be required to purchase
equipment that meets the EPA's Energy-star guidelines. The EPA already
have a specification for consumer equipment, and one for mobile and
workstation class computer equipment. They are presently developing
the Energy-star specifications for servers, storage, and data
centers.
3.3. Business Justification:
Energy management is now a principal concern for all computer
equipment purchasers - pointedly so for those in the
enterprise/commercial
and government sectors.
3.4. Competitive Analysis:
This feature is needed for Solaris to be competitive with operating
systems
from other vendors, and perhaps even to provide advantage over them.
3.5. Opportunity Window/Exposure:
Exposure is immediate. All major hardware component vendors are driving
and delivering these features, and we must support and exploit them to
keep pace. RedHat, Microsoft Windows (both client and server), SuSE
and others are all following these features and working to support them.
Power management work in Solaris is publicly visible via development
projects
in OpenSolaris.
3.6. How will you know when you are done?:
This Power Management Program can be considered done, when the system
has
a power management facility that can:
- Be enabled or disabled: When enabled, the system strives to be
"energy efficient" --
by making dynamic changes to the hardware platform's provisioning
and/or performance
levels in order to minimize the amount of energy required to perform
any workload run
on the system.
- When enabled, the system administrator can express bounds on the
degradation
in performance and/or responsiveness to dynamic changes in load that
the system
will stay within.
- In addition (whether dynamic power management is enabled or not), the
system will
be able to restrict itself to operate within a specified
portion of the hardware platform's full capacity when the environment
it's operating
in requires that. This may occur when either power or reserved energy
are limited; when
the system is running as a virtualized guest; or when otherwise
specified by the sytsem's
administrator.
Power Management will be enabled by default. It is a goal that no
administrator
would wish to disable it, except in a small number of special or
unusual circumstances.
a. Certain mission-critical or pseudo real-time deployments where
static provisioning
for worst-case is required.
b. Pathological workloads whose dynamic behaviors grossly violate the
assumptions
of the energy-efficiency algorithms used by the PM system.
4. Technical Description:
Power management refers to the system's dynamic adjustment of a
platform's hardware
resources. This may be achieved by adjusting the performance levels of
particular
resources, and/or what is presently provisioned (available), in order
to achieve the
best possible energy efficiency while running computational tasks.
At the highest level, the simple purpose of power management on
Oracle's platforms is
to maximize energy-efficiency at all times. That is, the objective is
to minimize
the total energy required to complete any computational task, and/or to
operate any
service. The following basic operating conceptions are illustrative,
and may be
helpful to a more detailed understanding.
- "Performance-only"
The system does not perform any dynamic power management. It operates
as systems software
has traditionally, using a statically defined set of resources whose
performance levels
and available capacity is not adjusted according to the workload's
requirements
while the system runs.
- Energy-efficient at maximum performance (elsewhere called "Adaptive
Performance")
The system does perform dynamic power management, but must still
achieve the maximum
possible performance for sustained workloads. Energy is saved where
possible
by dynamically adjusting the platform's provisioning and resource
performance levels,
but only for those hardware assets that do not improve the performace
of what
is running. The simplest way to think of this is that the system will
eliminate any
gratuitous over-provisioning - all resources which do not affect
performance of
the current workload are appropriately adjusted.
The key desideratum for this choice, is that there should be no
practical performance
difference between workloads run in this way, when compared to the
situation in which
no power management at all is being done (i.e. when the system's power
management function
is disabled). In practice this may mean that we designate a certain
small "principled amount"
of allowed performance regression as assessed under certain standard
benchmarks.
- Energy-efficient with tolerated performance regression (elsewhere
called "Elastic")
The general case of energy-efficiency is that in which the constraint
of maximum possible
performance for the workload can be relaxed. The system's objective is
still to minimize
the total energy to perform any computational task run on the platform.
While the best achievable performance is not required, some bounds are
established in
order to limit the degradation of performance and/or responsiveness to
changing load.
The system may adjust provisioning dynamically so long as it stays
within a
specified responsiveness to increase provisioning to full capacity as
required by transient
load.
- Power-constrained or Energy-constrained operation (elsewhere called
"Power-saving")
This operating constraint applies when there is a practical limit on
power (rate of energy
delivery) or a limited [reserve] supply of energy available. System
capacity must be
reduced to remain within these limits.
In these cases, the system is prepared to degrade the quality of one or
more services, and/or
to reduce resourcing that may reduce their service level. In addition,
decisions that certain
tasks or services are not to be run at all, in favor of others that are
deemed to be more
critical may be made.
The system might choose to reduce capacity and use that more limited
resource
in such a way that all tasks are affected equally (in their performance
or responsiveness
to changing demand).
In the ideal, different tasks or services running on the system might
be degraded
non-uniformly, with the objective of keeping critical services running
at required
throughput, to stay within available power limits or so that an energy
reserve can
be made to last for an appropriate duration.
The usage cases for this operating condition is limited energy reserve,
such as for
mobile systems whilst operating on battery power,
or for tethered systems that find themselves to be
running on a UPS, battery-backup, or generator backup power source due
to power outage.
Another usage case is when instantaneous *power* availability is
limited - such as in
a power utility brown out, or any other power distribution situation
that might cause this.
In each of these cases, load must be shed and/or service quality
reduced to stay within
these limits.
4.1. Details
This umbrella case provides context and scope for the Program.
Specific design is
to be provided in the projects under it.
A number of projects are expected. The following provides a partial
outline:
1. The system's power management facility will be implementated as a
Solaris service, with new high-level administrative controls expressed
in SMF (Solaris's service managment facility). These controls will
be hardware platform and instruction-set architecture abstract.
The primary method of administration is SMF.
2. Aspects of the system's earlier power management facility which are
obsolete or inadequate to address the above-described dynamic power
management solution will be removed. This includes a number of
low-level
implementation-specific controls (hardware platform and/or
ISA-specific)
which were exposed earlier.
3. The usability of the service will be improved. Both a command-line
and
programmatic interface to the PM service and its facilities will be
provided.
Appropriate exposure of the administrative controls to the service is
another
consideration. An appropriate means to access aspects of the service's
SMF description will be provided.
4. An improved framework for resource-centric power management will be
provided.
The initial work done with the power-aware dispatcher (as that relates
to the CPU
resources) will be used as the example to extend similar dynamic power
management
capabilities to other hardware devices on the platform.
5. A new device driver interface for power-relevant operation will be
introduced,
so that devices can describe their PM-relevant capabilities to the
system.
For example, to describe the various power and performance states they
offer,
their ability to perform software actions such as suspend/resume, as
well as
the interfaces required to operate those controls.
6. Development practices for power management and energy-aware modules
(with
particular attention to device drivers) within the Solaris OS will be
defined
and codified.
7. Observability and debugability of the system's power-aware and
energy-efficient
facilities and their operation will be improved. dtrace probes seem
one likely avenue.
8. The system's Suspend/Resume capability will be expanded to encompass
a
broader range of system-level Power states.
The suspend/resume facility will be improved to encompass and unify a
more
complete range of system suspend types (from power-on suspend,
best-available-suspend,
suspend-to-RAM, suspend-to-disk [non-volatile storage], hybrid-suspend,
to soft-off)
4.2. Bug/RFE Number(s):
N/A
4.3. In Scope:
The framework and system implementation needed to offer a
power-management service that
operate according to the aforementioned approach to energy-efficiency.
This encompasses power management facilities both for active
(while-running) and inactive
(non-running) operation of the platform or its more individual
components.
4.4. Out of Scope:
Near-term, we will give much less attention to non-server systems
(desktops
and laptops).
4.5. Interfaces:
Interfaces will be specified on a per-project basis.
4.6. Doc Impact:
Manual pages, developer docs and administration guides will be impacted
by
the individual projects on this roadmap.
4.7. Admin/Config Impact:
The default installation will enable the system to perform dynamic
power management,
and its default configuration shall be to perform energy-efficient
operation described by
the "adaptive performance" conception above.
The administrator may configure the system to perform energy-efficient
operation under
more relaxed performance constraints.
4.8. HA Impact:
More rapid availability (activation) of non-provisioned (idle/suspended)
resources is one expected outcome of this program.
4.9. I18N/L10N Impact:
Limited to the addition of 10's of messages as exposed by the new SMF
power
service.
4.10. Packaging & Delivery:
Part of the core OS facilities delivered in the Solaris OS/Net
consolidation
4.11. Security Impact:
This project introduces system-level facilities that require an
appropriate level
of authorization to configure and/or enact, but this is not in any way
extraordinary.
Configuration and enactment of power-management actions will be
auditable.
4.12. Dependencies:
The capacity and utilization abstractions presently underlying the
implementation
of the Power-aware dispatcher, and the energy-efficiency heuristic it
presently uses,
is something this Program expects to sustain.
5. Reference Documents:
5.1. Design/Specification documents
To be provided by each individual project under this umbrella.
5.2. Related documents
Brown, David J. and Charles Reams, "Toward Energy-efficient Computing,"
Communications of the ACM, Vol. 53, No. 3, pp. 50-58, March 2010,
http://cacm.acm.org/magazines/2010/3/76284-toward-energy-efficient-computing/fulltext
Saxe, Eric, "Power-efficient Software," Communications of the ACM, Vol.
53, No.2,
pp. 44-48, Feb 2010,
http://cacm.acm.org/magazines/2010/2/69355-power-efficient-software/fulltext
Power management community on Open Solaris:
http://hub.opensolaris.org/bin/view/Community+Group+pm/
Recent PM-related ARC cases
PSARC 2009/396 Tickless Kernel Architecture / lbolt decoupling
PSARC 2009/289 FBDIMM Idle Power Enhancement (FIPE) driver
PSARC 2009/283 Default enabling of CPU power management in S10U8 for
x86 systems
PSARC 2009/112 sys-suspend(1)
PSARC 2009/101 Turbo mode observability
PSARC 2009/086 PowerTOP --cpu option
PSARC 2008/777 cpupm keyword mode extensions
PSARC 2008/742 SDcard Framework Suspend & Resume
PSARC 2008/376 PowerTOP for OpenSolaris
PSARC 2008/291 Power Management Core-disable for n2/vf CPUs
PSARC 2008/091 Libtopo enumeration of fans and power supplies via IPMI
PSARC 2008/021 HAL Power Management Support
PSARC 2007/679 CPUFreq HAL
PSARC 2006/273 Rage XL Framebuffer Driver
PSARC 2006/132 Wake On LAN
PSARC 2005/469 X86 Energy Star compliance
6. Resources and Schedule:
6.1. Projected Availability:
CY2010Q3 Suspend-to-disk
CY2010Q4 Power service (initial PM 2.0) SMF facility
CY2010Q4 libpower - C language bindings (programmatic interface)
CY2010Q4 Suspend/Resume for initial reference set of x64 volume
server products (e.g. x4170, x4270, x4275, Lynx+)
CY2011Q4 Energy-star compliance for certain volume servers
6.2. Cost of Effort:
To be defined by the follow-on projects.
6.3. Cost of Capital Resources:
Existing lab clients and servers will be used.
6.4. Product Approval Committee requested information:
6.4.1. Consolidation or Component Name:
ON
6.4.3. Type of CPT Review and Approval expected:
Standard
6.4.4. Project Boundary Conditions:
To be define by the follow-on projects.
6.4.5. Is this a necessary project for OEM agreements:
No
6.4.6. Notes:
// See dependencies section above.
6.4.7. Target RTI Date/Release:
To be defined by the follow-on projects.
6.4.8. Target Code Design Review Date:
To be defined by the follow-on projects.
6.4.9. Update approval addition:
N/A
6.5. ARC review type:
Standard
6.6. ARC Exposure:
open
6.6.1. Rationale:
N/A
7. Prototype Availability:
7.1. Prototype Availability:
To be defined by the follow-on projects.
7.2. Prototype Cost:
To be defined by the follow-on projects.
_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss