On 08/18/2014 09:24 PM, Nicolas Pitre wrote: > On Mon, 11 Aug 2014, Preeti U Murthy wrote: > >> The goal of the power aware scheduling design is to integrate all >> policy, metrics and averaging into the scheduler. Today the >> cpu power management is fragmented and hence inconsistent. >> >> As a first step towards this integration, rid the cpuidle state management >> of the governors. Retain only the cpuidle driver in the cpu idle >> susbsystem which acts as an interface between the scheduler and low >> level platform specific cpuidle drivers. For all decision making around >> selection of idle states,the cpuidle driver falls back to the scheduler. >> >> The current algorithm for idle state selection is the same as the logic used >> by the menu governor. However going ahead the heuristics will be tuned and >> improved upon with metrics better known to the scheduler. > > I'd strongly suggest a different approach here. Instead of copying the > menu governor code and tweaking it afterwards, it would be cleaner to > literally start from scratch with a new governor. Said new governor > would grow inside the scheduler with more design freedom instead of > being strapped on the side. > > By copying existing code, the chance for cruft to remain for a long time > is close to 100%. We already have one copy of it, let's keep it working > and start afresh instead. > > By starting clean it is way easier to explain and justify additions to a > new design than convincing ourselves about the removal of no longer > needed pieces from a legacy design.
Ok. The reason I did it this way was that I did not find anything grossly wrong in the current cpuidle governor algorithm. Of course this can be improved but I did not see strong reasons to completely wipe it away. I see good scope to improve upon the existing algorithm with additional knowledge of *the idle states being mapped to scheduling domains*. This will in itself give us a better algorithm and does not mandate significant changes from the current algorithm. So I really don't see why we need to start from scratch. The primary issue that I found was that with the goal being power aware scheduler we must ensure that the possibility of a governor getting registered with cpuidle to choose idle states no longer will exist. The reason being there is just *one entity who will take this decision and there is no option about it*. This patch intends to bring the focus to this specific detail. Regards Preeti U Murthy > > >> >> Note: cpufrequency is still left disabled when CONFIG_SCHED_POWER is >> selected. >> >> Signed-off-by: Preeti U Murthy <[email protected]> >> --- >> >> drivers/cpuidle/Kconfig | 12 + >> drivers/cpuidle/cpuidle-powernv.c | 2 >> drivers/cpuidle/cpuidle.c | 65 ++++- >> include/linux/sched.h | 9 + >> kernel/sched/Makefile | 1 >> kernel/sched/power.c | 480 >> +++++++++++++++++++++++++++++++++++++ >> 6 files changed, 554 insertions(+), 15 deletions(-) >> create mode 100644 kernel/sched/power.c >> >> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig >> index 2c4ac79..4fa4cb1 100644 >> --- a/drivers/cpuidle/Kconfig >> +++ b/drivers/cpuidle/Kconfig >> @@ -3,16 +3,14 @@ menu "CPU Idle" >> config CPU_IDLE >> bool "CPU idle PM support" >> default y if ACPI || PPC_PSERIES >> - depends on !SCHED_POWER >> - select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE) >> - select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) >> + select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE && !SCHED_POWER) >> + select CPU_IDLE_GOV_MENU if ((NO_HZ || NO_HZ_IDLE) && !SCHED_POWER) >> help >> CPU idle is a generic framework for supporting software-controlled >> idle processor power management. It includes modular cross-platform >> governors that can be swapped during runtime. >> >> If you're using an ACPI-enabled platform, you should say Y here. >> - This feature will turn off if power aware scheduling is enabled. >> >> if CPU_IDLE >> >> @@ -22,10 +20,16 @@ config CPU_IDLE_MULTIPLE_DRIVERS >> config CPU_IDLE_GOV_LADDER >> bool "Ladder governor (for periodic timer tick)" >> default y >> + depends on !SCHED_POWER >> + help >> + This feature will turn off if power aware scheduling is enabled. >> >> config CPU_IDLE_GOV_MENU >> bool "Menu governor (for tickless system)" >> default y >> + depends on !SCHED_POWER >> + help >> + This feature will turn off if power aware scheduling is enabled. >> >> menu "ARM CPU Idle Drivers" >> depends on ARM >> diff --git a/drivers/cpuidle/cpuidle-powernv.c >> b/drivers/cpuidle/cpuidle-powernv.c >> index fa79392..95ef533 100644 >> --- a/drivers/cpuidle/cpuidle-powernv.c >> +++ b/drivers/cpuidle/cpuidle-powernv.c >> @@ -70,7 +70,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, >> unsigned long new_lpcr; >> >> if (powersave_nap < 2) >> - return; >> + return 0; >> if (unlikely(system_state < SYSTEM_RUNNING)) >> return index; >> >> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c >> index ee9df5e..38fb213 100644 >> --- a/drivers/cpuidle/cpuidle.c >> +++ b/drivers/cpuidle/cpuidle.c >> @@ -150,6 +150,19 @@ int cpuidle_enter_state(struct cpuidle_device *dev, >> struct cpuidle_driver *drv, >> return entered_state; >> } >> >> +#ifdef CONFIG_SCHED_POWER >> +static int __cpuidle_select(struct cpuidle_driver *drv, >> + struct cpuidle_device *dev) >> +{ >> + return cpuidle_sched_select(drv, dev); >> +} >> +#else >> +static int __cpuidle_select(struct cpuidle_driver *drv, >> + struct cpuidle_device *dev) >> +{ >> + return cpuidle_curr_governor->select(drv, dev); >> +} >> +#endif >> /** >> * cpuidle_select - ask the cpuidle framework to choose an idle state >> * >> @@ -169,7 +182,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct >> cpuidle_device *dev) >> if (unlikely(use_deepest_state)) >> return cpuidle_find_deepest_state(drv, dev); >> >> - return cpuidle_curr_governor->select(drv, dev); >> + return __cpuidle_select(drv, dev); >> } >> >> /** >> @@ -190,6 +203,18 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct >> cpuidle_device *dev, >> return cpuidle_enter_state(dev, drv, index); >> } >> >> +#ifdef CONFIG_SCHED_POWER >> +static void __cpuidle_reflect(struct cpuidle_device *dev, int index) >> +{ >> + cpuidle_sched_reflect(dev, index); >> +} >> +#else >> +static void __cpuidle_reflect(struct cpuidle_device *dev, int index) >> +{ >> + if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state)) >> + cpuidle_curr_governor->reflect(dev, index); >> +} >> +#endif >> /** >> * cpuidle_reflect - tell the underlying governor what was the state >> * we were in >> @@ -200,8 +225,7 @@ int cpuidle_enter(struct cpuidle_driver *drv, struct >> cpuidle_device *dev, >> */ >> void cpuidle_reflect(struct cpuidle_device *dev, int index) >> { >> - if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state)) >> - cpuidle_curr_governor->reflect(dev, index); >> + __cpuidle_reflect(dev, index); >> } >> >> /** >> @@ -265,6 +289,28 @@ void cpuidle_resume(void) >> mutex_unlock(&cpuidle_lock); >> } >> >> +#ifdef CONFIG_SCHED_POWER >> +static int cpuidle_check_governor(struct cpuidle_driver *drv, >> + struct cpuidle_device *dev, int enable) >> +{ >> + if (enable) >> + return cpuidle_sched_enable_device(drv, dev); >> + else >> + return 0; >> +} >> +#else >> +static int cpuidle_check_governor(struct cpuidle_driver *drv, >> + struct cpuidle_device *dev, int enable) >> +{ >> + if (!cpuidle_curr_governor) >> + return -EIO; >> + >> + if (enable && cpuidle_curr_governor->enable) >> + return cpuidle_curr_governor->enable(drv, dev); >> + else if (cpuidle_curr_governor->disable) >> + cpuidle_curr_governor->disable(drv, dev); >> +} >> +#endif >> /** >> * cpuidle_enable_device - enables idle PM for a CPU >> * @dev: the CPU >> @@ -285,7 +331,7 @@ int cpuidle_enable_device(struct cpuidle_device *dev) >> >> drv = cpuidle_get_cpu_driver(dev); >> >> - if (!drv || !cpuidle_curr_governor) >> + if (!drv) >> return -EIO; >> >> if (!dev->registered) >> @@ -298,8 +344,8 @@ int cpuidle_enable_device(struct cpuidle_device *dev) >> if (ret) >> return ret; >> >> - if (cpuidle_curr_governor->enable && >> - (ret = cpuidle_curr_governor->enable(drv, dev))) >> + ret = cpuidle_check_governor(drv, dev, 1); >> + if (ret) >> goto fail_sysfs; >> >> smp_wmb(); >> @@ -331,13 +377,12 @@ void cpuidle_disable_device(struct cpuidle_device *dev) >> if (!dev || !dev->enabled) >> return; >> >> - if (!drv || !cpuidle_curr_governor) >> + if (!drv) >> return; >> - >> + >> dev->enabled = 0; >> >> - if (cpuidle_curr_governor->disable) >> - cpuidle_curr_governor->disable(drv, dev); >> + cpuidle_check_governor(drv, dev, 0); >> >> cpuidle_remove_device_sysfs(dev); >> enabled_devices--; >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 7c19d55..5dd99b5 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -26,6 +26,7 @@ struct sched_param { >> #include <linux/nodemask.h> >> #include <linux/mm_types.h> >> #include <linux/preempt_mask.h> >> +#include <linux/cpuidle.h> >> >> #include <asm/page.h> >> #include <asm/ptrace.h> >> @@ -846,6 +847,14 @@ enum cpu_idle_type { >> CPU_MAX_IDLE_TYPES >> }; >> >> +#ifdef CONFIG_SCHED_POWER >> +extern void cpuidle_sched_reflect(struct cpuidle_device *dev, int index); >> +extern int cpuidle_sched_select(struct cpuidle_driver *drv, >> + struct cpuidle_device *dev); >> +extern int cpuidle_sched_enable_device(struct cpuidle_driver *drv, >> + struct cpuidle_device *dev); >> +#endif >> + >> /* >> * Increase resolution of cpu_capacity calculations >> */ >> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile >> index ab32b7b..5b8e469 100644 >> --- a/kernel/sched/Makefile >> +++ b/kernel/sched/Makefile >> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o >> obj-$(CONFIG_SCHEDSTATS) += stats.o >> obj-$(CONFIG_SCHED_DEBUG) += debug.o >> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o >> +obj-$(CONFIG_SCHED_POWER) += power.o >> diff --git a/kernel/sched/power.c b/kernel/sched/power.c >> new file mode 100644 >> index 0000000..63c9276 >> --- /dev/null >> +++ b/kernel/sched/power.c >> @@ -0,0 +1,480 @@ >> +/* >> + * power.c - the power aware scheduler >> + * >> + * Author: >> + * Preeti U. Murthy <[email protected]> >> + * >> + * This code is a replica of drivers/cpuidle/governors/menu.c >> + * To make the transition to power aware scheduler away from >> + * the cpuidle governor model easy, we do exactly what the >> + * governors do for now. Going ahead the heuristics will be >> + * tuned and improved upon. >> + * >> + * This code is licenced under the GPL version 2 as described >> + * in the COPYING file that acompanies the Linux Kernel. >> + */ >> + >> +#include <linux/kernel.h> >> +#include <linux/cpuidle.h> >> +#include <linux/pm_qos.h> >> +#include <linux/time.h> >> +#include <linux/ktime.h> >> +#include <linux/hrtimer.h> >> +#include <linux/tick.h> >> +#include <linux/sched.h> >> +#include <linux/math64.h> >> +#include <linux/module.h> >> + >> +/* >> + * Please note when changing the tuning values: >> + * If (MAX_INTERESTING-1) * RESOLUTION > UINT_MAX, the result of >> + * a scaling operation multiplication may overflow on 32 bit platforms. >> + * In that case, #define RESOLUTION as ULL to get 64 bit result: >> + * #define RESOLUTION 1024ULL >> + * >> + * The default values do not overflow. >> + */ >> +#define BUCKETS 12 >> +#define INTERVALS 8 >> +#define RESOLUTION 1024 >> +#define DECAY 8 >> +#define MAX_INTERESTING 50000 >> + >> + >> +/* >> + * Concepts and ideas behind the power aware scheduler >> + * >> + * For the power aware scheduler, there are 3 decision factors for picking >> a C >> + * state: >> + * 1) Energy break even point >> + * 2) Performance impact >> + * 3) Latency tolerance (from pmqos infrastructure) >> + * These these three factors are treated independently. >> + * >> + * Energy break even point >> + * ----------------------- >> + * C state entry and exit have an energy cost, and a certain amount of time >> in >> + * the C state is required to actually break even on this cost. CPUIDLE >> + * provides us this duration in the "target_residency" field. So all that we >> + * need is a good prediction of how long we'll be idle. Like the traditional >> + * governors, we start with the actual known "next timer event" time. >> + * >> + * Since there are other source of wakeups (interrupts for example) than >> + * the next timer event, this estimation is rather optimistic. To get a >> + * more realistic estimate, a correction factor is applied to the estimate, >> + * that is based on historic behavior. For example, if in the past the >> actual >> + * duration always was 50% of the next timer tick, the correction factor >> will >> + * be 0.5. >> + * >> + * power aware scheduler uses a running average for this correction factor, >> + * however it uses a set of factors, not just a single factor. This stems >> from >> + * the realization that the ratio is dependent on the order of magnitude of >> the >> + * expected duration; if we expect 500 milliseconds of idle time the >> likelihood of >> + * getting an interrupt very early is much higher than if we expect 50 micro >> + * seconds of idle time. A second independent factor that has big impact on >> + * the actual factor is if there is (disk) IO outstanding or not. >> + * (as a special twist, we consider every sleep longer than 50 milliseconds >> + * as perfect; there are no power gains for sleeping longer than this) >> + * >> + * For these two reasons we keep an array of 12 independent factors, that >> gets >> + * indexed based on the magnitude of the expected duration as well as the >> + * "is IO outstanding" property. >> + * >> + * Repeatable-interval-detector >> + * ---------------------------- >> + * There are some cases where "next timer" is a completely unusable >> predictor: >> + * Those cases where the interval is fixed, for example due to hardware >> + * interrupt mitigation, but also due to fixed transfer rate devices such as >> + * mice. >> + * For this, we use a different predictor: We track the duration of the >> last 8 >> + * intervals and if the stand deviation of these 8 intervals is below a >> + * threshold value, we use the average of these intervals as prediction. >> + * >> + * Limiting Performance Impact >> + * --------------------------- >> + * C states, especially those with large exit latencies, can have a real >> + * noticeable impact on workloads, which is not acceptable for most >> sysadmins, >> + * and in addition, less performance has a power price of its own. >> + * >> + * As a general rule of thumb, power aware sched assumes that the following >> + * heuristic holds: >> + * The busier the system, the less impact of C states is acceptable >> + * >> + * This rule-of-thumb is implemented using a performance-multiplier: >> + * If the exit latency times the performance multiplier is longer than >> + * the predicted duration, the C state is not considered a candidate >> + * for selection due to a too high performance impact. So the higher >> + * this multiplier is, the longer we need to be idle to pick a deep C >> + * state, and thus the less likely a busy CPU will hit such a deep >> + * C state. >> + * >> + * Two factors are used in determing this multiplier: >> + * a value of 10 is added for each point of "per cpu load average" we have. >> + * a value of 5 points is added for each process that is waiting for >> + * IO on this CPU. >> + * (these values are experimentally determined) >> + * >> + * The load average factor gives a longer term (few seconds) input to the >> + * decision, while the iowait value gives a cpu local instantanious input. >> + * The iowait factor may look low, but realize that this is also already >> + * represented in the system load average. >> + * >> + */ >> + >> +struct sched_cpuidle_info { >> + int last_state_idx; >> + int needs_update; >> + >> + unsigned int next_timer_us; >> + unsigned int predicted_us; >> + unsigned int bucket; >> + unsigned int correction_factor[BUCKETS]; >> + unsigned int intervals[INTERVALS]; >> + int interval_ptr; >> +}; >> + >> + >> +#define LOAD_INT(x) ((x) >> FSHIFT) >> +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) >> + >> +static int get_loadavg(void) >> +{ >> + unsigned long this = this_cpu_load(); >> + >> + >> + return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10; >> +} >> + >> +static inline int which_bucket(unsigned int duration) >> +{ >> + int bucket = 0; >> + >> + /* >> + * We keep two groups of stats; one with no >> + * IO pending, one without. >> + * This allows us to calculate >> + * E(duration)|iowait >> + */ >> + if (nr_iowait_cpu(smp_processor_id())) >> + bucket = BUCKETS/2; >> + >> + if (duration < 10) >> + return bucket; >> + if (duration < 100) >> + return bucket + 1; >> + if (duration < 1000) >> + return bucket + 2; >> + if (duration < 10000) >> + return bucket + 3; >> + if (duration < 100000) >> + return bucket + 4; >> + return bucket + 5; >> +} >> + >> +/* >> + * Return a multiplier for the exit latency that is intended >> + * to take performance requirements into account. >> + * The more performance critical we estimate the system >> + * to be, the higher this multiplier, and thus the higher >> + * the barrier to go to an expensive C state. >> + */ >> +static inline int performance_multiplier(void) >> +{ >> + int mult = 1; >> + >> + /* for higher loadavg, we are more reluctant */ >> + >> + mult += 2 * get_loadavg(); >> + >> + /* for IO wait tasks (per cpu!) we add 5x each */ >> + mult += 10 * nr_iowait_cpu(smp_processor_id()); >> + >> + return mult; >> +} >> + >> +static DEFINE_PER_CPU(struct sched_cpuidle_info, cpuidle_info ); >> + >> +static void cpuidle_sched_update(struct cpuidle_driver *drv, >> + struct cpuidle_device *dev); >> + >> +/* This implements DIV_ROUND_CLOSEST but avoids 64 bit division */ >> +static u64 div_round64(u64 dividend, u32 divisor) >> +{ >> + return div_u64(dividend + (divisor / 2), divisor); >> +} >> + >> +/* >> + * Try detecting repeating patterns by keeping track of the last 8 >> + * intervals, and checking if the standard deviation of that set >> + * of points is below a threshold. If it is... then use the >> + * average of these 8 points as the estimated value. >> + */ >> +static void get_typical_interval(struct sched_cpuidle_info *data) >> +{ >> + int i, divisor; >> + unsigned int max, thresh; >> + uint64_t avg, stddev; >> + >> + thresh = UINT_MAX; /* Discard outliers above this value */ >> + >> +again: >> + >> + /* First calculate the average of past intervals */ >> + max = 0; >> + avg = 0; >> + divisor = 0; >> + for (i = 0; i < INTERVALS; i++) { >> + unsigned int value = data->intervals[i]; >> + if (value <= thresh) { >> + avg += value; >> + divisor++; >> + if (value > max) >> + max = value; >> + } >> + } >> + do_div(avg, divisor); >> + >> + /* Then try to determine standard deviation */ >> + stddev = 0; >> + for (i = 0; i < INTERVALS; i++) { >> + unsigned int value = data->intervals[i]; >> + if (value <= thresh) { >> + int64_t diff = value - avg; >> + stddev += diff * diff; >> + } >> + } >> + do_div(stddev, divisor); >> + /* >> + * The typical interval is obtained when standard deviation is small >> + * or standard deviation is small compared to the average interval. >> + * >> + * int_sqrt() formal parameter type is unsigned long. When the >> + * greatest difference to an outlier exceeds ~65 ms * sqrt(divisor) >> + * the resulting squared standard deviation exceeds the input domain >> + * of int_sqrt on platforms where unsigned long is 32 bits in size. >> + * In such case reject the candidate average. >> + * >> + * Use this result only if there is no timer to wake us up sooner. >> + */ >> + if (likely(stddev <= ULONG_MAX)) { >> + stddev = int_sqrt(stddev); >> + if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3)) >> + || stddev <= 20) { >> + if (data->next_timer_us > avg) >> + data->predicted_us = avg; >> + return; >> + } >> + } >> + >> + /* >> + * If we have outliers to the upside in our distribution, discard >> + * those by setting the threshold to exclude these outliers, then >> + * calculate the average and standard deviation again. Once we get >> + * down to the bottom 3/4 of our samples, stop excluding samples. >> + * >> + * This can deal with workloads that have long pauses interspersed >> + * with sporadic activity with a bunch of short pauses. >> + */ >> + if ((divisor * 4) <= INTERVALS * 3) >> + return; >> + >> + thresh = max - 1; >> + goto again; >> +} >> + >> +/** >> + * cpuidle_sched_select - selects the next idle state to enter >> + * @drv: cpuidle driver containing state data >> + * @dev: the CPU >> + */ >> +int cpuidle_sched_select(struct cpuidle_driver *drv, >> + struct cpuidle_device *dev) >> +{ >> + struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info); >> + int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY); >> + int i; >> + unsigned int interactivity_req; >> + struct timespec t; >> + >> + if (data->needs_update) { >> + cpuidle_sched_update(drv, dev); >> + data->needs_update = 0; >> + } >> + >> + data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1; >> + >> + /* Special case when user has set very strict latency requirement */ >> + if (unlikely(latency_req == 0)) >> + return 0; >> + >> + /* determine the expected residency time, round up */ >> + t = ktime_to_timespec(tick_nohz_get_sleep_length()); >> + data->next_timer_us = >> + t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC; >> + >> + >> + data->bucket = which_bucket(data->next_timer_us); >> + >> + /* >> + * Force the result of multiplication to be 64 bits even if both >> + * operands are 32 bits. >> + * Make sure to round up for half microseconds. >> + */ >> + data->predicted_us = div_round64((uint64_t)data->next_timer_us * >> + data->correction_factor[data->bucket], >> + RESOLUTION * DECAY); >> + >> + get_typical_interval(data); >> + >> + /* >> + * Performance multiplier defines a minimum predicted idle >> + * duration / latency ratio. Adjust the latency limit if >> + * necessary. >> + */ >> + interactivity_req = data->predicted_us / performance_multiplier(); >> + if (latency_req > interactivity_req) >> + latency_req = interactivity_req; >> + >> + /* >> + * We want to default to C1 (hlt), not to busy polling >> + * unless the timer is happening really really soon. >> + */ >> + if (data->next_timer_us > 5 && >> + !drv->states[CPUIDLE_DRIVER_STATE_START].disabled && >> + dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0) >> + data->last_state_idx = CPUIDLE_DRIVER_STATE_START; >> + >> + /* >> + * Find the idle state with the lowest power while satisfying >> + * our constraints. >> + */ >> + for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) { >> + struct cpuidle_state *s = &drv->states[i]; >> + struct cpuidle_state_usage *su = &dev->states_usage[i]; >> + >> + if (s->disabled || su->disable) >> + continue; >> + if (s->target_residency > data->predicted_us) >> + continue; >> + if (s->exit_latency > latency_req) >> + continue; >> + >> + data->last_state_idx = i; >> + } >> + >> + return data->last_state_idx; >> +} >> + >> +/** >> + * cpuidle_sched_reflect - records that data structures need update >> + * @dev: the CPU >> + * @index: the index of actual entered state >> + * >> + * NOTE: it's important to be fast here because this operation will add to >> + * the overall exit latency. >> + */ >> +void cpuidle_sched_reflect(struct cpuidle_device *dev, int index) >> +{ >> + struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info); >> + data->last_state_idx = index; >> + if (index >= 0) >> + data->needs_update = 1; >> +} >> + >> +/** >> + * cpuidle_sched_update - attempts to guess what happened after entry >> + * @drv: cpuidle driver containing state data >> + * @dev: the CPU >> + */ >> +static void cpuidle_sched_update(struct cpuidle_driver *drv, struct >> cpuidle_device *dev) >> +{ >> + struct sched_cpuidle_info *data = &__get_cpu_var(cpuidle_info); >> + int last_idx = data->last_state_idx; >> + struct cpuidle_state *target = &drv->states[last_idx]; >> + unsigned int measured_us; >> + unsigned int new_factor; >> + >> + /* >> + * Try to figure out how much time passed between entry to low >> + * power state and occurrence of the wakeup event. >> + * >> + * If the entered idle state didn't support residency measurements, >> + * we are basically lost in the dark how much time passed. >> + * As a compromise, assume we slept for the whole expected time. >> + * >> + * Any measured amount of time will include the exit latency. >> + * Since we are interested in when the wakeup begun, not when it >> + * was completed, we must subtract the exit latency. However, if >> + * the measured amount of time is less than the exit latency, >> + * assume the state was never reached and the exit latency is 0. >> + */ >> + if (unlikely(!(target->flags & CPUIDLE_FLAG_TIME_VALID))) { >> + /* Use timer value as is */ >> + measured_us = data->next_timer_us; >> + >> + } else { >> + /* Use measured value */ >> + measured_us = cpuidle_get_last_residency(dev); >> + >> + /* Deduct exit latency */ >> + if (measured_us > target->exit_latency) >> + measured_us -= target->exit_latency; >> + >> + /* Make sure our coefficients do not exceed unity */ >> + if (measured_us > data->next_timer_us) >> + measured_us = data->next_timer_us; >> + } >> + >> + /* Update our correction ratio */ >> + new_factor = data->correction_factor[data->bucket]; >> + new_factor -= new_factor / DECAY; >> + >> + if (data->next_timer_us > 0 && measured_us < MAX_INTERESTING) >> + new_factor += RESOLUTION * measured_us / data->next_timer_us; >> + else >> + /* >> + * we were idle so long that we count it as a perfect >> + * prediction >> + */ >> + new_factor += RESOLUTION; >> + >> + /* >> + * We don't want 0 as factor; we always want at least >> + * a tiny bit of estimated time. Fortunately, due to rounding, >> + * new_factor will stay nonzero regardless of measured_us values >> + * and the compiler can eliminate this test as long as DECAY > 1. >> + */ >> + if (DECAY == 1 && unlikely(new_factor == 0)) >> + new_factor = 1; >> + >> + data->correction_factor[data->bucket] = new_factor; >> + >> + /* update the repeating-pattern data */ >> + data->intervals[data->interval_ptr++] = measured_us; >> + if (data->interval_ptr >= INTERVALS) >> + data->interval_ptr = 0; >> +} >> + >> +/** >> + * cpuidle_sched_enable_device - scans a CPU's states and does setup >> + * @drv: cpuidle driver >> + * @dev: the CPU >> + */ >> +int cpuidle_sched_enable_device(struct cpuidle_driver *drv, >> + struct cpuidle_device *dev) >> +{ >> + struct sched_cpuidle_info *data = &per_cpu(cpuidle_info, dev->cpu); >> + int i; >> + >> + memset(data, 0, sizeof(struct sched_cpuidle_info)); >> + >> + /* >> + * if the correction factor is 0 (eg first time init or cpu hotplug >> + * etc), we actually want to start out with a unity factor. >> + */ >> + for(i = 0; i < BUCKETS; i++) >> + data->correction_factor[i] = RESOLUTION * DECAY; >> + >> + return 0; >> +} >> + >> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

