On Monday, August 20, 2018 11:44:06 AM CEST Quentin Perret wrote: > This patch series introduces Energy Aware Scheduling (EAS) for CFS tasks > on platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE). > > For more details about the ideas behind it and the overall design, > please refer to the cover letter of version 5 [1]. > > > 1. Version History > ------------------ > > Changes v5[1]->v6: > - Rebased on Peter’s sched/core branch (that includes Morten's misfit > patches [2] and the automatic detection of SD_ASYM_CPUCAPACITY [3]) > - Removed patch 13/14 (not needed with the automatic flag detection) > - Added patch creating a dependency between sugov and EAS > - Renamed frequency domains to performance domains to avoid creating too > deep assumptions in the code about the HW > - Renamed the sd_ea shortcut sd_asym_cpucapacity > - Added comment to explain why new tasks are not accounted when > detecting the 'overutilized' flag > - Added comment explaining why forkees don’t go in > find_energy_efficient_cpu() > > Changes v4[4]->v5: > - Removed the RCU protection of the EM tables and the associated > need for em_rescale_cpu_capacity(). > - Factorized schedutil’s PELT aggregation function with EAS > - Improved comments/doc in the EM framework > - Added check on the uarch of CPUs in one fd in the EM framework > - Reduced CONFIG_ENERGY_MODEL ifdefery in kernel/sched/topology.c > - Cleaned-up update_sg_lb_stats parameters > - Improved comments in compute_energy() to explain the multi-rd > scenarios > > Changes v3[5]->v4: > - Replaced spinlock in EM framework by smp_store_release/READ_ONCE > - Fixed missing locks to protect rcu_assign_pointer in EM framework > - Fixed capacity calculation in EM framework on 32 bits system > - Fixed compilation issue for CONFIG_ENERGY_MODEL=n > - Removed cpumask from struct em_freq_domain, now dynamically allocated > - Power costs of the EM are specified in milliwatts > - Added example of CPUFreq driver modification > - Added doc/comments in the EM framework and better commit header > - Fixed integration issue with util_est in cpu_util_next() > - Changed scheduler topology code to have one freq. dom. list per rd > - Split sched topology patch in smaller patches > - Added doc/comments explaining the heuristic in the wake-up path > - Changed energy threshold for migration to from 1.5% to 6% > > Changes v2[6]->v3: > - Removed the PM_OPP dependency by implementing a new EM framework > - Modified the scheduler topology code to take references on the EM data > structures > - Simplified the overutilization mechanism into a system-wide flag > - Reworked the integration in the wake-up path using the sd_ea shortcut > - Rebased on tip/sched/core (247f2f6f3c70 "sched/core: Don't schedule > threads on pre-empted vCPUs") > > Changes v1[7]->v2: > - Reworked interface between fair.c and energy.[ch] (Remove #ifdef > CONFIG_PM_OPP from energy.c) (Greg KH) > - Fixed licence & header issue in energy.[ch] (Greg KH) > - Reordered EAS path in select_task_rq_fair() (Joel) > - Avoid prev_cpu if not allowed in select_task_rq_fair() (Morten/Joel) > - Refactored compute_energy() (Patrick) > - Account for RT/IRQ pressure in task_fits() (Patrick) > - Use UTIL_EST and DL utilization during OPP estimation (Patrick/Juri) > - Optimize selection of CPU candidates in the energy-aware wake-up path > - Rebased on top of tip/sched/core (commit b720342849fe “sched/core: > Update Preempt_notifier_key to modern API”) > > > 2. Test results > --------------- > > Two fundamentally different tests were executed. Firstly the energy test > case shows the impact on energy consumption this patch-set has using a > synthetic set of tasks. Secondly the performance test case provides the > conventional hackbench metric numbers. > > The tests run on two arm64 big.LITTLE platforms: Hikey960 (4xA73 + > 4xA53) and Juno r0 (2xA57 + 4xA53). > > Base kernel is tip/sched/core (4.18-rc5), with some Hikey960 and Juno > specific patches, the SD_ASYM_CPUCAPACITY flag set at DIE sched domain > level for arm64 and schedutil as cpufreq governor [8]. > > 2.1 Energy test case > > 10 iterations of between 10 and 50 periodic rt-app tasks (16ms period, > 5% duty-cycle) for 30 seconds with energy measurement. Unit is Joules. > The goal is to save energy, so lower is better. > > 2.1.1 Hikey960 > > Energy is measured with an ACME Cape on an instrumented board. Numbers > include consumption of big and little CPUs, LPDDR memory, GPU and most > of the other small components on the board. They do not include > consumption of the radio chip (turned-off anyway) and external > connectors. > > +----------+-----------------+-------------------------+ > | | Without patches | With patches | > +----------+--------+--------+------------------+------+ > | Tasks nb | Mean | RSD* | Mean | RSD* | > +----------+--------+--------+------------------+------+ > | 10 | 34.33 | 4.8% | 30.51 (-11.13%) | 6.4% | > | 20 | 52.84 | 1.9% | 44.15 (-16.45%) | 2.0% | > | 30 | 66.20 | 1.8% | 60.14 (-9.15%) | 4.8% | > | 40 | 90.83 | 2.5% | 86.91 (-4.32%) | 2.7% | > | 50 | 136.76 | 4.6% | 108.90 (-20.37%) | 4.7% | > +----------+--------+--------+------------------+------+ > > 2.1.2 Juno r0 > > Energy is measured with the onboard energy meter. Numbers include > consumption of big and little CPUs. > > +----------+-----------------+------------------------+ > | | Without patches | With patches | > +----------+--------+--------+-----------------+------+ > | Tasks nb | Mean | RSD* | Mean | RSD* | > +----------+--------+--------+-----------------+------+ > | 10 | 11.48 | 3.2% | 8.09 (-29.53%) | 3.1% | > | 20 | 20.84 | 3.4% | 14.38 (-31.00%) | 1.1% | > | 30 | 32.94 | 3.2% | 23.97 (-27.23%) | 1.0% | > | 40 | 46.05 | 0.5% | 37.82 (-17.87%) | 6.2% | > | 50 | 57.25 | 0.5% | 55.30 ( -3.41%) | 0.5% | > +----------+--------+--------+-----------------+------+ > > > 2.2 Performance test case > > 30 iterations of perf bench sched messaging --pipe --thread --group G > --loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0). > > 2.2.1 Hikey960 > > The impact of thermal capping was mitigated thanks to a heatsink, a > fan, and a 30 sec delay between two successive executions. IPA is > disabled to reduce the stddev. > > +----------------+-----------------+------------------------+ > | | Without patches | With patches | > +--------+-------+---------+-------+----------------+-------+ > | Groups | Tasks | Mean | RSD* | Mean | RSD* | > +--------+-------+---------+-------+----------------+-------+ > | 1 | 40 | 8.04 | 0.88% | 8.22 (+2.31%) | 1.76% | > | 2 | 80 | 14.78 | 0.67% | 14.83 (+0.35%) | 0.59% | > | 4 | 160 | 30.92 | 0.57% | 30.95 (+0.09%) | 0.51% | > | 8 | 320 | 65.54 | 0.32% | 65.57 (+0.04%) | 0.46% | > +--------+-------+---------+-------+----------------+-------+ > > 2.2.2 Juno r0 > > +----------------+-----------------+-----------------------+ > | | Without patches | With patches | > +--------+-------+---------+-------+---------------+-------+ > | Groups | Tasks | Mean | RSD* | Mean | RSD* | > +--------+-------+---------+-------+---------------+-------+ > | 1 | 40 | 7.74 | 0.13% | 7.82 (0.01%) | 0.12% | > | 2 | 80 | 14.27 | 0.15% | 14.27 (0.00%) | 0.14% | > | 4 | 160 | 27.07 | 0.35% | 26.96 (0.00%) | 0.18% | > | 8 | 320 | 55.14 | 1.81% | 55.21 (0.00%) | 1.29% | > +--------+-------+---------+-------+---------------+-------+ > > *RSD: Relative Standard Deviation (std dev / mean) > > > [1] https://marc.info/?l=linux-pm&m=153243513908731&w=2 > [2] https://marc.info/?l=linux-kernel&m=153069968022982&w=2 > [3] https://marc.info/?l=linux-kernel&m=153209362826476&w=2 > [4] https://marc.info/?l=linux-kernel&m=153018606728533&w=2 > [5] https://marc.info/?l=linux-kernel&m=152691273111941&w=2 > [6] https://marc.info/?l=linux-kernel&m=152302902427143&w=2 > [7] https://marc.info/?l=linux-kernel&m=152153905805048&w=2 > [8] > http://www.linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v6 > > Morten Rasmussen (1): > sched: Add over-utilization/tipping point indicator > > Quentin Perret (13): > sched: Relocate arch_scale_cpu_capacity > sched/cpufreq: Factor out utilization to frequency mapping > PM: Introduce an Energy Model management framework > PM / EM: Expose the Energy Model in sysfs > sched/topology: Reference the Energy Model of CPUs when available > sched/topology: Lowest CPU asymmetry sched_domain level pointer > sched/topology: Introduce sched_energy_present static key > sched/fair: Clean-up update_sg_lb_stats parameters > sched/cpufreq: Refactor the utilization aggregation method > sched/fair: Introduce an energy estimation helper function > sched/fair: Select an energy-efficient CPU on task wake-up > sched/topology: Make Energy Aware Scheduling depend on schedutil > OPTIONAL: cpufreq: dt: Register an Energy Model > > drivers/cpufreq/cpufreq-dt.c | 45 ++++- > drivers/cpufreq/cpufreq.c | 4 + > include/linux/cpufreq.h | 1 + > include/linux/energy_model.h | 162 +++++++++++++++++ > include/linux/sched/cpufreq.h | 6 + > include/linux/sched/topology.h | 19 ++ > kernel/power/Kconfig | 15 ++ > kernel/power/Makefile | 2 + > kernel/power/energy_model.c | 289 +++++++++++++++++++++++++++++ > kernel/sched/cpufreq_schedutil.c | 136 ++++++++++---- > kernel/sched/fair.c | 301 ++++++++++++++++++++++++++++--- > kernel/sched/sched.h | 65 ++++--- > kernel/sched/topology.c | 231 +++++++++++++++++++++++- > 13 files changed, 1195 insertions(+), 81 deletions(-) > create mode 100644 include/linux/energy_model.h > create mode 100644 kernel/power/energy_model.c
I have looked at all of the patches in the series now and I don't really have any major objections from the cpufreq (and generally PM) perspective. There are some points of concern here and there, but they are mostly details and things I would do differently, but as a whole this looks mostly OK to me. I will reply to the individual patches where there are issues in my view. Thanks, Rafael