Hi,

On 14.04.2018 07:01, Srinivas Pandruvada wrote:
Hi Francisco,

[...]

Are you no longer interested in improving those aspects of the non-
HWP
governor?  Is it that you're planning to delete it and move back to a
generic cpufreq governor for non-HWP platforms in the near future?

Yes that is the plan for Atom platforms, which are only non HWP
platforms till now. You have to show good gain for performance and
performance/watt to carry and maintain such big change. So we have to
see your performance and power numbers.

For the active cases, you can look at the links at the beginning / bottom of this mail thread. Francisco provided performance results for >100 benchmarks.


At this side of Atlantic, we've been testing different versions of the patchset in past few months for >50 Linux 3D benchmarks on 6 different platforms.

On Geminilake and few BXT configurations (where 3D benchmarks are TDP limited), many tests' performance improves by 5-15%, also complex ones. And more importantly, there were no regressions.

(You can see details + links to more info in Jira ticket VIZ-12078.)

*On (fully) TDP limited cases, power usage (obviously) keeps the same, so performance/watt improvements can be derived from the measured performance improvements.*


We have data also for earlier platforms from slightly older versions of the patchset, but on those it didn't have any significant impact on performance.

I think the main reason for this is that BYT & BSW NUCs that we have, have space only for single memory module. Without dual-memory channel configuration, benchmarks are too memory-bottlenecked to utilized GPU enough to make things TDP limited on those platforms.

However, now that I look at the old BYT & BSW data (for few benchmarks which improved most on BXT & GLK), I see that there's a reduction in the CPU power utilization according to RAPL, at least on BSW.


        - Eero


This will benefit all architectures including x86 + non i915.


The current design encourages re-use of the IO utilization statistic
(see PATCH 1) by other governors as a mechanism driving the trade-off
between energy efficiency and responsiveness based on whether the
system
is close to CPU-bound, in whatever way is applicable to each governor
(e.g. it would make sense for it to be hooked up to the EPP
preference
knob in the case of the intel_pstate HWP governor, which would allow
it
to achieve better energy efficiency in IO-bound situations just like
this series does for non-HWP parts).  There's nothing really x86- nor
i915-specific about it.

BTW intel-pstate can be driven by sched-util governor (passive
mode),
so if your prove benefits to Broxton, this can be a default.
As before:
- No regression to idle power at all. This is more important than
benchmarks
- Not just score, performance/watt is important


Is schedutil actually on par with the intel_pstate non-HWP governor
as
of today, according to these metrics and the overall benchmark
numbers?
Yes, except for few cases. I have not tested recently, so may be
better.

Thanks,
Srinivas


Thanks,
Srinivas


controller does, even though the frequent IO waits may actually
be
an
indication that the system is IO-bound (which means that the
large
energy usage increase may not be translated in any performance
benefit
in practice, not to speak of performance being impacted
negatively
in
TDP-bound scenarios like GPU rendering).

Regarding run-time complexity, I haven't observed this governor
to
be
measurably more computationally intensive than the present
one.  It's a
bunch more instructions indeed, but still within the same
ballpark
as
the current governor.  The average increase in CPU utilization
on
my BXT
with this series is less than 0.03% (sampled via ftrace for v1,
I
can
repeat the measurement for the v2 I have in the works, though I
don't
expect the result to be substantially different).  If this is a
problem
for you there are several optimization opportunities that would
cut
down
the number of CPU cycles get_target_pstate_lp() takes to
execute by
a
large percent (most of the optimization ideas I can think of
right
now
though would come at some
accuracy/maintainability/debuggability
cost,
but may still be worth pursuing), but the computational
overhead is
low
enough at this point that the impact on any benchmark or real
workload
would be orders of magnitude lower than its variance, which
makes
it
kind of difficult to keep the discussion data-driven [as
possibly
any
performance optimization discussion should ever be ;)].


Thanks,
Srinivas




[Absolute benchmark results are unfortunately omitted
from
this
letter
due to company policies, but the percent change and
Student's
T
p-value are included above and in the referenced
benchmark
results]

The most obvious impact of this series will likely be the
overall
improvement in graphics performance on systems with an
IGP
integrated
into the processor package (though for the moment this is
only
enabled
on BXT+), because the TDP budget shared among CPU and GPU
can
frequently become a limiting factor in low-power
devices.  On
heavily
TDP-bound devices this series improves performance of
virtually any
non-trivial graphics rendering by a significant amount
(of
the
order
of the energy efficiency improvement for that workload
assuming the
optimization didn't cause it to become non-TDP-bound).

See [1]-[5] for detailed numbers including various
graphics
benchmarks
and a sample of the Phoronix daily-system-tracker.  Some
popular
graphics benchmarks like GfxBench gl_manhattan31 and gl_4
improve
between 5% and 11% on our systems.  The exact improvement
can
vary
substantially between systems (compare the benchmark
results
from
the
two different J3455 systems [1] and [3]) due to a number
of
factors,
including the ratio between CPU and GPU processing power,
the
behavior
of the userspace graphics driver, the windowing system
and
resolution,
the BIOS (which has an influence on the package TDP), the
thermal
characteristics of the system, etc.

Unigine Valley and Heaven improve by a similar factor on
some
systems
(see the J3455 results [1]), but on others the
improvement is
lower
because the benchmark fails to fully utilize the GPU,
which
causes
the
heuristic to remain in low-latency state for longer,
which
leaves a
reduced TDP budget available to the GPU, which prevents
performance
from increasing further.  This can be avoided by using
the
alternative
heuristic parameters suggested in the commit message of
PATCH
8,
which
provide a lower IO utilization threshold and hysteresis
for
the
controller to attempt to save energy.  I'm not proposing
those for
upstream (yet) because they would also increase the risk
for
latency-sensitive IO-heavy workloads to regress (like
SynMark2
OglTerrainFly* and some arguably poorly designed IPC-
bound
X11
benchmarks).

Discrete graphics aren't likely to experience that much
of a
visible
improvement from this, even though many non-IGP workloads
*could*
benefit by reducing the system's energy usage while the
discrete
GPU
(or really, any other IO device) becomes a bottleneck,
but
this is
not
attempted in this series, since that would involve making
an
energy
efficiency/latency trade-off that only the maintainers of
the
respective drivers are in a position to make.  The
cpufreq
interface
introduced in PATCH 1 to achieve this is left as an opt-
in
for that
reason, only the i915 DRM driver is hooked up since it
will
get the
most direct pay-off due to the increased energy budget
available to
the GPU, but other power-hungry third-party gadgets built
into the
same package (*cough* AMD *cough* Mali *cough* PowerVR
*cough*) may
be
able to benefit from this interface eventually by
instrumenting the
driver in a similar way.

The cpufreq interface is not exclusively tied to the
intel_pstate
driver, because other governors can make use of the
statistic
calculated as result to avoid over-optimizing for latency
in
scenarios
where a lower frequency would be able to achieve similar
throughput
while using less energy.  The interpretation of this
statistic
relies
on the observation that for as long as the system is CPU-
bound, any
IO
load occurring as a result of the execution of a program
will
scale
roughly linearly with the clock frequency the program is
run
at, so
(assuming that the CPU has enough processing power) a
point
will be
reached at which the program won't be able to execute
faster
with
increasing CPU frequency because the throughput limits of
some
device
will have been attained.  Increasing frequencies past
that
point
only
pessimizes energy usage for no real benefit -- The
optimal
behavior
is
for the CPU to lock to the minimum frequency that is able
to
keep
the
IO devices involved fully utilized (assuming we are past
the
maximum-efficiency inflection point of the CPU's power-
to-
frequency
curve), which is roughly the goal of this series.

PELT could be a useful extension for this model since its
largely
heuristic assumptions would become more accurate if the
IO
and CPU
load could be tracked separately for each scheduling
entity,
but
this
is not attempted in this series because the additional
complexity
and
computational cost of such an approach is hard to justify
at
this
stage, particularly since the current governor has
similar
limitations.

Various frequency and step-function response graphs are
available
in
[6]-[9] for comparison (obtained empirically on a BXT
J3455
system).
The response curves for the low-latency and low-power
states
of the
heuristic are shown separately -- As you can see they
roughly
bracket
the frequency response curve of the current
governor.  The
step
response of the aggressive heuristic is within a single
update
period
(even though it's not quite obvious from the graph with
the
levels
of
zoom provided).  I'll attach benchmark results from a
slower
but
non-TDP-limited machine (which means there will be no TDP
budget
increase that could possibly mask a performance
regression of
other
kind) as soon as they come out.

Thanks to Eero and Valtteri for testing a number of
intermediate
revisions of this series (and there were quite a few of
them)
in
more
than half a dozen systems, they helped spot quite a few
issues of
earlier versions of this heuristic.

[PATCH 1/9] cpufreq: Implement infrastructure keeping
track
of
aggregated IO active time.
[PATCH 2/9] Revert "cpufreq: intel_pstate: Replace
bxt_funcs
with
core_funcs"
[PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a
couple
of long
names"
[PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
intel_pstate_adjust_pstate()"
[PATCH 5/9] Revert "cpufreq: intel_pstate: Drop
->update_util
from
pstate_funcs"
[PATCH 6/9] cpufreq/intel_pstate: Implement variably low-
pass
filtering controller for small core.
[PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
controller
based on ACPI FADT profile.
[PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
controller
parameters via debugfs.
[PATCH 9/9] drm/i915/execlists: Report GPU rendering as
IO
activity
to cpufreq.

[1] http://people.freedesktop.org/~currojerez/intel_pstat
e-lp
/bench
mark-perf-comparison-J3455.log
[2] http://people.freedesktop.org/~currojerez/intel_pstat
e-lp
/bench
mark-perf-per-watt-comparison-J3455.log
[3] http://people.freedesktop.org/~currojerez/intel_pstat
e-lp
/bench
mark-perf-comparison-J3455-1.log
[4] http://people.freedesktop.org/~currojerez/intel_pstat
e-lp
/bench
mark-perf-comparison-J4205.log
[5] http://people.freedesktop.org/~currojerez/intel_pstat
e-lp
/bench
mark-perf-comparison-J5005.log
[6] http://people.freedesktop.org/~currojerez/intel_pstat
e-lp
/frequ
ency-response-magnitude-comparison.svg
[7] http://people.freedesktop.org/~currojerez/intel_pstat
e-lp
/frequ
ency-response-phase-comparison.svg
[8] http://people.freedesktop.org/~currojerez/intel_pstat
e-lp
/step-
response-comparison-1.svg
[9] http://people.freedesktop.org/~currojerez/intel_pstat
e-lp
/step-
response-comparison-2.svg

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to