Hi Aubrey,

+1. This looks great to me. We need it for many projects to allow
the admin to specify the system should run more energy-efficient.

We discussed this with the power architects, and they agree it looks good.
One suggestion was to change "power" to "energy". For example
they would like to see "power-bias" name changed to "energy-bias",
and this should be called something like "system energy policy".
That would help identify that this knob is for system *energy* efficiency.

As a side note, saving energy is different from saving power.
Power Capping (an externally imposed policy) takes effect when
power has an increased cost or increased maginal-cost. For example:
1. the power grid fails or is over-budgeted, 2. the server room has
exceeded cooling capacity. Power Capping is different and has
higher precedence than this system-level energy efficiency policy.

I think I covered the points the pm architects were concerned with?
Sarito or Julia etc can comment if I left something out. :-)

Regards,
Bill


On 04/01/10 01:43, Li, Aubrey wrote:
Randy Fishel wrote:
 This might be a bit contentious, as there not only is effort to
migrate the configuration to SMF, there is a consideration to define
something similar to system-pm-policy.  On the other hand, there also
is lacking architecture and there doesn't seem to be much momentum in
providing it.

 I am also leaving for vacation on Friday morning.  I will take a
printout with me in hopes of maybe reviewing it over the next week.
It may also give others the opportunity to see how this might fit into
the "new" architecture.

 Cheers!

      ---- Randy

This was intended as cpu-pm-policy, a mechanism to provide a knob for the
user to tune the pm policy introduced by Intel Energy_Perf_Bias feature on
the fly. Currently Energy_Perf_Bias is set to be performance bias by default,
that means the power control unit in the processor will drive the processor
to the peak performance with any energy cost. This feature for example can
throttle turbo performance boost by setting a MSR to Power bias. In the near
future, the trend of silicon design is doing more and more in hardware, Package
/core C-state auto promotion or demotion, QPI link state, DRAM refreshing, etc
all will accept the hint from this feature.

Besides this, as for CPU, we don't have an option to let the processor run at
the lowest frequency, or always run in the supported deepest idle state if in
idle. CMT_COALESCE dispatching policy is disabled in the kernel due to peak
performance hurt. But this policy helps to group the utilization onto one
package or even one core as possible. If we could group the utilization onto
one package in idle, that means the other packages can sleep longer and deeper,
and hence save more energy. These should be the momentum to prolong the battery
life or server not in the rush hour.

Besides CPU, memory or other devices have the same situation. In the current
kernel, the memory power management driver FIPE has a default policy setting
                fipe_pm_policy = FIPE_PM_POLICY_BALANCE
From the source, FIPE_PM_POLICY_POWERSAVE policy could save more power I think.
Sooner or later, DDR3 could have the same requirement if we implement power
management on it.

Recently, I found USB EHCI driver is not friendly to idle power when I did a
power characterization analysis. EHCI driver keeps polling and making the host
controller to issue DMA read and write operations when there is no USB related
ops, or even when there is no USB device connected. This problem throttles the
package c-state and makes a big gap between solaris and other OSes. This might
not depend on the power/perf profile. But a profile could make the solution 
easy.

I believe there are a few other cases I missed to give more momentum to 
introduce
a user profile for power performance bias, :)

Thanks,
-Aubrey
On Thu, 1 Apr 2010, Li, Aubrey wrote:

Just wanna move forward for this work, here is a PSARC onepager, Any
inputs
are really appreciated!

Thanks,
-Aubrey

======== system-pm-policy_onepager_v1.txt
=================================
Template Version: @(#)onepager.txt 1.35 07/11/07 SMI

1. Introduction
   1.1. Project/Component Working Name:
        system-pm-policy keyword

   1.2. Name of Document Author/Supplier:
        Author: Aubrey Li <[email protected]>

   1.3. Date of This Document:
        April 28 , 2010

2. Project Summary
   2.1. Project Description:
        Solaris support for the system-pm-policy keyword in
power.conf(4).
        A mechanism is desired to set system wide power performance
bias.
   2.2. Risks and Assumptions:
        Very few customers will use this keyword. Most customers will
desire
        power performance balanced policy to be the default.

4. Technical Description:
    4.1. Details:

        pmconfig(1M) parses /etc/power.conf, if the system-pm-policy
keword
        is in power.conf(4), it passes the user preferred policy to
the kernel
        thru pm_ioctl by the command PM_SET_SYSTEM_POLICY. pm_ioctl()
then
        calls pm_set_system_policy() to set the global policy variable
and
        calls the power managable modules to pass the policy down.

        Currently pm_set_system_policy() only set the CPU power
management
        policy, and could set memory and other devices power
management policy
        in future. CPU pm policy setting is machine specific.

        CPU has a few power management features, like C-state, P-state,
energy
        performance bias etc. Every CPU pm feature which wants to
inherit the
        system-pm-policy will register its callback function to a list,
when
        pmconfig passes the policy to the kernel, the kernel will walk
the list
        to call the callback function and hence set the user perferred
policy
        to the different modules.

        /etc/power.conf may have [system-pm-policy <value>]
          |
          v
        pmconfig
          |
          v
        pm_ioctl(PM_SET_SYSTEM_POLICY, policy)
          |
          v
        pm_set_system_policy(policy)
          |
          ----> CPU pm policy callback
          |     |
          |     ----> registered CPU pm feature 1
callback(ENERGY_PERF_BIAS)
          |     |
          |     ----> ...
          |
          ----> Memory pm policy callback in future
          |
          ----> ...


        Power performance balanced policy will be set by default, this
keeps the
        current out-of-box setting unchanged. The system which has
extreme
        performance requirements could disable the power management
features by
        performance bias setting. If laptop runs on a battery, or the
system in
        the low utilization prefers power than performance, system-pm-
policy could
        be set to power bias and save more power, this could lead to
the lowest
        CPU clock and always deepest idle state.

        Different power manageable devices could inherit the system
wide policy
        completely, or they can maintain a specific pm policy
themselves but the
        system wide policy must be the biggest weight coefficient to
their own
        mechanism.


    4.2. Bug/RFE Number(s): xxxxxxx

    4.5. Interfaces:
        This project will import these existing interfaces.
        Interface stability will be "committed".

        Import:
                power.conf(4) (PSARC/1992/202)
                pmconfig(1m)

        Export:
                system-pm-policy

        system-pm-policy keyword.
        A system-pm-policy entry can be added to power.conf(4) to set
the system
        wide power policy. If this entry is present and set to default
or it is
        not present then the default balanced policy will be used,
this keeps the
        current behavior unchanged. The other options will tune the
policy to power
        bias or performance bias.

        power.conf(4) man page addition:

        a system-pm-policy may be used to set system wide power policy.
The format
        of the system-pm-policy entry is system-pm-policy policy.

     Acceptable policy values are:

     default    Power performance balanced policy.

     perf-bias  The system drives to maximum performance at any energy
cost.
     balanced   Balanced performance vs. power and energy

     power-bias Max energy efficient.

     absent     If the system-pm-policy keyword is absent from
power.conf(4),
                the behavior is the same as the default case.

    4.6. Doc Impact:
        power.conf man page.  See above.

    4.7. Admin/Config Impact:
        Administrators of systems can use this option to match the
different power
        performance requirement.

    4.8. HA Impact: None.

    4.9. I18N/L10N Impact: No.

    4.10. Packaging & Delivery:
        This change will be delivered as part of the Deep C-State RFE.
        These changes will be made at the same time:
                kernel package
                power.conf package
                pmconfig package

    4.11. Security Impact: None.

    4.12. Dependencies: power.conf, pmconfig(1M)

6. Resources and Schedule:
   6.1. Projected Availability: April 2010

   6.4. Product Approval Committee requested information:
        6.4.1. Consolidation C-team Name:
                ON
   6.5. ARC review type: FastTrack
   6.6. ARC Exposure:   open

7. Prototype Availability:
   7.1. Prototype Availability:
        Prototype available on OpenSolaris in April 2010.

========================================================================
===========
Li, Aubrey wrote:
Hi Bill,

Here I made a change to propose system-wide policy support.
http://cr.opensolaris.org/~aubrey/sys_pm_policy_v1/
The user profile from /etc/power.conf is still passed to the kernel
thru pm_ioctl, then call pm_set_system_policy(). Currently there is
only
cpu pm policy setting there, if memory/other devices need a bias as
well,
they can also be added to that function.
cpu pm policy related implementation has minor change against last
webrev,
mcpu_pm_policy pointer has been moved from machcpu to
mcpu_pm_mach_state
structure according to your suggestion.

Any comments and suggestions are highly appreciated.

Thanks,
-Aubrey

Li, Aubrey wrote:
It looks like memory PM need such a bias as well. So I'd like to
change
the proposal to use the keyword "sys-pm-policy" instead. The
mechanism
will use the existing callb implementation to pass the user policy
from
/etc/power.conf to the kernel and walk the module registered list to
call
module hook function to set the pm policy individually.

I'm not sure if any other device driver need or be happy with this
proposal.
It would be great if the device driver developer can share some
thoughts
here.

Thanks,
-Aubrey

Julia.Harper wrote:
I assume that this knob (profile) when turned way down would
basically
put the
system into "power savings" mode -- where the set of power states
is
restricted.
 That is, no matter how long the utilization level demands more
power,
the
highest power states (for the cpus, memory, whatever) will never be
entered.  We
should probably use terminology that makes this clear.

-- jdh


Liu, Jiang wrote:
I prefer the solution to introduce a global power profile for all
devices. Currently
we need such a profile for CPUPM. In future when supporting
memory
power
management, we may need a similiar profile for memory PM. And
user
won't
like two variables/profiles for the same objective.

Li, Aubrey <> wrote:
Bill Holler wrote:
Hi,

I forgot to mention that cpu_pm_policy is just a policy.
There is no guaranty it maps to a specific MSR or hardware
implementation.
Yes, I would like to propose a new option for CPU power
management
policy. This policy is a CPU bias between performance and power,
the
future CPU power management enhancement work can be based on
this
policy. - the default policy should keep the current "out of the
box"
behavior unchanged, we'll try to save more power without
performance
hurt.
- there will be more power management futures coming on the
future
processor, like ENERGY_PERFORMANCE_BIAS, we can register these
new
futures under the policy framework, and offer a knob to the user
to
change these settings on the fly.
- laptop users who want to prolong the battery life and less
heat
and
smaller fan noise may want the system to work in some edge
situation:
for example, currently CPU can work in the highest clock if
cpupm
is
disabled, but no choice to let CPU always work in the lowest
clock.
Similarly, Always enter deepest c-state is another choice to
save
more power. What's more, power aware dispatcher could be more
flexible to pick up CPU and dispatch thread if there is a policy
indicator. - Some users doesn't care about power. Yes, we
already
have the options to let them to set ENERGY_PERFORMANCE_BIAS to
be
performance bias, to close c-state/p-state, and so on and so
forth.
But it's more friendly to the user to just change only one
option.
Here, the policy only focus on CPU. If you think we should have
a
policy for the memory, for the devices, or we should have a
system-wide policy, let's do this. cpu_pm_policy can be one part
of
system-wide policy.
If nobody have thoughts on it, I'll continue to prepare a PSARC
file
to add cpu_pm_policy keyword.

For example Solaris could be dynamically setting the
ENERGY_PERFORMANCE_BIAS register to different settings
depending
on
things such as system-load,
Yes, such of these settings can be dynamically changed if we see
the
benefit.

the priority of the application being scheduled, a power policy
of
the application,
Making the thread power aware need another bunch of interfaces I
think. For example, cmt_balance() can choose the different
processor
group according to the perf/power bias of the thread.

or power policy of the zone.
Zone policy is an interesting topic. Different zone could have
different CPU resource, or can share the global CPU resource,
different zone could have different power policy, or they can
inherit
the global cpu_pm_policy setting. The virtual container could
have
many, but the hardware resource is unique. I think this can be
enhanced in the zone management, which will not be covered in my
proposal, :)

Thanks,
-Aubrey

Regards,
Bill


On 03/03/10 16:21, Bill Holler wrote:
+1.

Hi Aubrey,

I also think it is time to move forward with this proposal.
Generally we want the system to work best "out of the box"
with no tuning.  On the other hand, vendors will keep
improving
products with new features, and there will always be some
specific
applications were custom settings may be better.  I feel this
proposal supports innovation and application specific
customization
in line with the OpenSolaris community goals.

This proposal applies to all types of CPUs.  It uses
"cpu_pm_policy"
instead of for example mentioning a specific CPU's MSR.  ;-)
This
proposal will be useful with other CPUs if/when they have
hardware
mechanisms for tuning power / performance.


In the arc case we want to mention that there could be a
policy
conflict between this component setting and a system-power-
policy,
external Power Caping, etc. Generally we want users to use the
default or a higher level policy such as the system power
policy.
Unfortunately the system power policy may not be fine-grain or
diverse enough for some applications to specify cpu power
policy.
In that case cpu_pm_policy will be useful.  My thought is: the
user
must really know what they want if they specify a component
policy
such as cpu_pm_policy instead of just using the system power
policy.  For that reason I feel cpu_pm_policy should override
the
system-power-policy at the cpupm level.

Power Caping is different.  Power Capping is an external
policy.
It
is currently "owned" by the SP external to the OS.  Power
Caping
should override a local cpu_pm_policy.


Implementation comments:
IMHO mcpu_pm_policy pointer should be in the
mcpu_pm_mach_state
structure instead of in the machcpu.
We may want to allow the user to specify a number instead of
just
Perf, Balanced, Power, Default?

Regards,
Bill


On 02/20/10 18:43, Li, Aubrey wrote:
Hi Bill,

I think it's time to continue this proposal, since b134 is
closed
and the build is not limited now. power/perf bias setting is
a
start point for future power related work, I'll prepare a
PSARC
file for the new option if this is acceptable. No is also a
good
answer with good reason.

Thanks,
-Aubrey


Bill.Holler Wrote:

Hi,

This proposal is for a mechanism to set the new MSR
IA32_ENERGY_PERF_BIAS_MSR.   This is a new hardware
feature.  The MSR effects overall power/performance.
It gives a hint to the processor & package for desired
power/performance characteristics.  It is related to p-
states
and
c-states (and may effect these features), but this feature
can
have other socket/system-level effects as well.
The programmers guides do not go into details what the
other
effects can be.  :-(

The perf and power impact of this MSR is model specific.
It's able to throttle turbo on WSM and probably help to do
more
hardware decision in future. For example, when the short
interrupt
storm is detected, it can demote CC6 request to CC3.


On 11/05/09 05:15, minskey guo wrote:

Jedy Wang ??:

Hi Li,

As far as I know, gnome-power-manager has removed the
support
for changing governor which is the same as profile I
think.
I
remember someone wrote a blog explaining the reason but I
can
not find it now.

I

wonder why what makes us still need to implement this
feature.
In linux world, there is ondemand governor in kernel. It
sets
cpu freqency according to cpu's current load. So, somebody
consider that
eveybody
should use that governor, and let CPUs finish their jobs
asap
and

then

enter
into C states for power-saving. Comparing to P state, c-
state
does

save

more power. That's why gnome removed it.

This is also model specific and depends on if the frequency
and
voltage and power are linear. That's true on latest
processor
but
not on earlier processor.

I'm not sure why gnome removed it, but seems not a good idea
to
me. Some users want max perf and others want longer battery
life.
Yes, a good p-state + c-state implementation is not easy to
tune
for more power savings.  Running in lower p-states when a
CPU
is
busy burns more power due to shorter time in deeper C-
states.
Entering deeper C-states too aggressively also burns more
power
(on both an idle and busy system) due to unnecessary wakeup
latency.  ;-)  Without knowing the details, it seems likely
that
the gnome-power-manager was removed because setting it made
worse
decisions than a runtime prediction.


Solaris currently has mechanisms to turn P-state and deeper
C-state support on/off.

A requirement is that the Energy Perf Bias MSR can be set
on
systems not running a GUI.  We would like to support a
possible
future Gnome interface to set this MSR if/when it exists.
The
proposal provides a mechanism that works on systems without
Gnome.

Right, most of servers do not run gnome. I don't expect
gnome
support but it would be great if it will, :-)

IMHO, we should use this global cpu power policy setting
instead
of "cpupm" and "cpu-deep-idle", this is more friendly to the
user. The users just want more perf or more power, I think
they
don't care if the system support p/c- state at the same time.
"cpupm" is a confusion only for p-state. we call "cpupm"
before
we have deep idle support. Actually cpu-deep-idle is also
one
part of cpu power management, :)

but, someone doesn't care power-saving, when comparing it
to
other factors. For example, if you are plagued by the
noise
of
CPU fan,
and
expect quiet it then you can lower cpu frequency, which
results
in lower heat, and then fan can be stopped.

personally, I vote +1 for this project if I could vote,
but I
don't

like

the names of "perf-bias" etc :)


Besides, can somebody tell me where
IA32_ENERGY_PERF_BIAS_MSR
comes ? Is it a part of IPS feature ?

Intel's Software Developer's Manuals 2A describes CPUID
detection
of IA32_ENERGY_PERF_BIAS_MSR and volume 3A describes the
MSR.
http://www.intel.com/products/processor/manuals/
Sorry, I do not know what IPS stands for?

cough, cough, IPS is not a released feature and should not
be
discussed here, ;p

Thanks,
-Aubrey


Regards,
Bill



-minskey




I remember why already support 2 profile through gnome-
power-
manager

on

Solaris. What's the difference between them?

I do not understand the exact meaning perf-bias, balanced
and
power-

bias

either. Does not perf-bias means the cpu frequency will
be
always
at
the

highest level?

Regards,

Jedy
On Wed, 2009-11-04 at 08:47 +0800, Li, Aubrey wrote:


Hi,

When we enable intel energy performance bias feature, we
found the power profile implementation is necessary.
Here I
did a draft for cpu level power policy.
http://cr.opensolaris.org/~aubrey/cpu_power_policy_v1/

The proposal added a new keyword to /etc/power.conf
"cpu-power-policy", And we have 4 options for this new
keyword: 1) perf-bias 2) balanced
3) power-bias
4) default, the same as perf-bias.

/etc/power.conf accepts the user input and passes the
prefered
policy

to the kernel thru ioctl. Then pm_ioctl calls the
callback
to
walk

a

cpu
power policy list. Every cpu pm feature which wants to
be
adjusted

by

this option and verified to be supported will register
its
callback function to the list, so that it can be called
and
adjusted by pmconfig.
    ----------------------------------------------------
---
-
    /etc/power.conf | pm_ioctl(cpu_power_policy, policy)
    |
cpu_power_policy_callb (policy)
    |
    ----> registered pm feature callback 1
(ENERGY_PERF_BIAS)
|
    ----> registered pm feature callback 2
    ...
--------------------------------------------------------
-
Currently, only energy_perf_bias feature is registered,
because my intention is to support adjusting
energy_perf_bias
MSR without reboot. I guess

we

probably
can add p/t/c-state support later. When we add p/t/c-
state
support, my quick thought is, this option will override
"cpupm" and "cpu-deep-idle" setting.

Welcome your any comments and suggestions.

Thanks,
-Aubrey
_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss


_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss



_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss

_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss

_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss

_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss
_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss
_______________________________________________
tesla-dev mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/tesla-dev
Liu Jiang (Gerry)
OpenSolaris, OTC, SSG, Intel
_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss
--

---------------------
    Julia Harper, [email protected]
_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss
_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss

_______________________________________________
pm-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pm-discuss

Reply via email to