Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Quoting Francisco Jerez (2020-03-20 02:46:19) > Francisco Jerez writes: > > > Francisco Jerez writes: > > > >> Chris Wilson writes: > >> > >>> Quoting Francisco Jerez (2020-03-10 21:41:55) > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c > b/drivers/gpu/drm/i915/gt/intel_lrc.c > index b9b3f78f1324..a5d7a80b826d 100644 > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct > intel_engine_cs *engine) > /* we need to manually load the submit queue */ > if (execlists->ctrl_reg) > writel(EL_CTRL_LOAD, execlists->ctrl_reg); > + > + if (execlists_num_ports(execlists) > 1 && > >>> pending[1] is always defined, the minimum submission is one slot, with > >>> pending[1] as the sentinel NULL. > >>> > + execlists->pending[1] && > + !atomic_xchg(>overload, 1)) > + intel_gt_pm_active_begin(>i915->gt); > >>> > >>> engine->gt > >>> > >> > >> Applied your suggestions above locally, will probably wait to have a few > >> more changes batched up before sending a v2. > >> > } > > static bool ctx_single_port_submission(const struct intel_context *ce) > @@ -2213,6 +2218,12 @@ cancel_port_requests(struct > intel_engine_execlists * const execlists) > clear_ports(execlists->inflight, > ARRAY_SIZE(execlists->inflight)); > > WRITE_ONCE(execlists->active, execlists->inflight); > + > + if (atomic_xchg(>overload, 0)) { > + struct intel_engine_cs *engine = > + container_of(execlists, typeof(*engine), > execlists); > + intel_gt_pm_active_end(>i915->gt); > + } > } > > static inline void > @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs > *engine) > /* port0 completed, advanced to port1 */ > trace_ports(execlists, "completed", > execlists->active); > > + if (atomic_xchg(>overload, 0)) > + > intel_gt_pm_active_end(>i915->gt); > >>> > >>> So this looses track if we preempt a dual-ELSP submission with a > >>> single-ELSP submission (and never go back to dual). > >>> > >> > >> Yes, good point. You're right that if a dual-ELSP submission gets > >> preempted by a single-ELSP submission "overload" will remain signaled > >> until the first completion interrupt arrives (e.g. from the preempting > >> submission). > >> > >>> If you move this to the end of the loop and check > >>> > >>> if (!execlists->active[1] && atomic_xchg(>overload, 0)) > >>> intel_gt_pm_active_end(engine->gt); > >>> > >>> so that it covers both preemption/promotion and completion. > >>> > >> > >> That sounds reasonable. > >> > >>> However, that will fluctuate quite rapidly. (And runs the risk of > >>> exceeding the sentinel.) > >>> > >>> An alternative approach would be to couple along > >>> schedule_in/schedule_out > >>> > >>> atomic_set(overload, -1); > >>> > >>> __execlists_schedule_in: > >>> if (!atomic_fetch_inc(overload) > >>> intel_gt_pm_active_begin(engine->gt); > >>> __execlists_schedule_out: > >>> if (!atomic_dec_return(overload) > >>> intel_gt_pm_active_end(engine->gt); > >>> > >>> which would mean we are overloaded as soon as we try to submit an > >>> overlapping ELSP. > >>> > >> > >> That sounds good to me too, and AFAICT would have roughly the same > >> behavior as this metric except for the preemption corner case you > >> mention above. I'll try this and verify that I get approximately the > >> same performance numbers. > >> > > > > This suggestion seems to lead to some minor regressions, I'm > > investigating the issue. Will send a v2 as soon as I have something > > along the lines of what you suggested running with equivalent > > performance to v1. > > I think I've figured out why both of the alternatives we were talking > about above lead to a couple percent regressions in latency-sensitive > workloads: In some scenarios it's possible for execlist_dequeue() to > execute after the GPU has gone idle, but before we've processed the > corresponding CSB entries, particularly when called from the > submit_queue() path. In that case __execlists_schedule_in() will think > that the next request is overlapping, and tell CPU power management to > relax, even though the GPU is starving intermittently. > > How about we do the same: > > | if (atomic_xchg(>overload, 0)) > | intel_gt_pm_active_end(engine->gt); > > as in this patch from process_csb() in response to each completion CSB > entry, which ensures that the system is considered non-GPU-bound as soon > as the first context completes. Subsequently if another
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Francisco Jerez writes: > Francisco Jerez writes: > >> Chris Wilson writes: >> >>> Quoting Francisco Jerez (2020-03-10 21:41:55) diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index b9b3f78f1324..a5d7a80b826d 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine) /* we need to manually load the submit queue */ if (execlists->ctrl_reg) writel(EL_CTRL_LOAD, execlists->ctrl_reg); + + if (execlists_num_ports(execlists) > 1 && >>> pending[1] is always defined, the minimum submission is one slot, with >>> pending[1] as the sentinel NULL. >>> + execlists->pending[1] && + !atomic_xchg(>overload, 1)) + intel_gt_pm_active_begin(>i915->gt); >>> >>> engine->gt >>> >> >> Applied your suggestions above locally, will probably wait to have a few >> more changes batched up before sending a v2. >> } static bool ctx_single_port_submission(const struct intel_context *ce) @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists) clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight)); WRITE_ONCE(execlists->active, execlists->inflight); + + if (atomic_xchg(>overload, 0)) { + struct intel_engine_cs *engine = + container_of(execlists, typeof(*engine), execlists); + intel_gt_pm_active_end(>i915->gt); + } } static inline void @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine) /* port0 completed, advanced to port1 */ trace_ports(execlists, "completed", execlists->active); + if (atomic_xchg(>overload, 0)) + intel_gt_pm_active_end(>i915->gt); >>> >>> So this looses track if we preempt a dual-ELSP submission with a >>> single-ELSP submission (and never go back to dual). >>> >> >> Yes, good point. You're right that if a dual-ELSP submission gets >> preempted by a single-ELSP submission "overload" will remain signaled >> until the first completion interrupt arrives (e.g. from the preempting >> submission). >> >>> If you move this to the end of the loop and check >>> >>> if (!execlists->active[1] && atomic_xchg(>overload, 0)) >>> intel_gt_pm_active_end(engine->gt); >>> >>> so that it covers both preemption/promotion and completion. >>> >> >> That sounds reasonable. >> >>> However, that will fluctuate quite rapidly. (And runs the risk of >>> exceeding the sentinel.) >>> >>> An alternative approach would be to couple along >>> schedule_in/schedule_out >>> >>> atomic_set(overload, -1); >>> >>> __execlists_schedule_in: >>> if (!atomic_fetch_inc(overload) >>> intel_gt_pm_active_begin(engine->gt); >>> __execlists_schedule_out: >>> if (!atomic_dec_return(overload) >>> intel_gt_pm_active_end(engine->gt); >>> >>> which would mean we are overloaded as soon as we try to submit an >>> overlapping ELSP. >>> >> >> That sounds good to me too, and AFAICT would have roughly the same >> behavior as this metric except for the preemption corner case you >> mention above. I'll try this and verify that I get approximately the >> same performance numbers. >> > > This suggestion seems to lead to some minor regressions, I'm > investigating the issue. Will send a v2 as soon as I have something > along the lines of what you suggested running with equivalent > performance to v1. I think I've figured out why both of the alternatives we were talking about above lead to a couple percent regressions in latency-sensitive workloads: In some scenarios it's possible for execlist_dequeue() to execute after the GPU has gone idle, but before we've processed the corresponding CSB entries, particularly when called from the submit_queue() path. In that case __execlists_schedule_in() will think that the next request is overlapping, and tell CPU power management to relax, even though the GPU is starving intermittently. How about we do the same: | if (atomic_xchg(>overload, 0)) | intel_gt_pm_active_end(engine->gt); as in this patch from process_csb() in response to each completion CSB entry, which ensures that the system is considered non-GPU-bound as soon as the first context completes. Subsequently if another CSB entry signals a dual-ELSP active-to-idle transition or a dual-ELSP preemption we call intel_gt_pm_active_begin() directly from process_csb(). If we hit a single-ELSP preemption CSB entry we call intel_gt_pm_active_end() instead, in order to avoid the problem you pointed out in your previous email. How
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Francisco Jerez writes: > Chris Wilson writes: > >> Quoting Francisco Jerez (2020-03-10 21:41:55) >>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c >>> b/drivers/gpu/drm/i915/gt/intel_lrc.c >>> index b9b3f78f1324..a5d7a80b826d 100644 >>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c >>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c >>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct >>> intel_engine_cs *engine) >>> /* we need to manually load the submit queue */ >>> if (execlists->ctrl_reg) >>> writel(EL_CTRL_LOAD, execlists->ctrl_reg); >>> + >>> + if (execlists_num_ports(execlists) > 1 && >> pending[1] is always defined, the minimum submission is one slot, with >> pending[1] as the sentinel NULL. >> >>> + execlists->pending[1] && >>> + !atomic_xchg(>overload, 1)) >>> + intel_gt_pm_active_begin(>i915->gt); >> >> engine->gt >> > > Applied your suggestions above locally, will probably wait to have a few > more changes batched up before sending a v2. > >>> } >>> >>> static bool ctx_single_port_submission(const struct intel_context *ce) >>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * >>> const execlists) >>> clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight)); >>> >>> WRITE_ONCE(execlists->active, execlists->inflight); >>> + >>> + if (atomic_xchg(>overload, 0)) { >>> + struct intel_engine_cs *engine = >>> + container_of(execlists, typeof(*engine), execlists); >>> + intel_gt_pm_active_end(>i915->gt); >>> + } >>> } >>> >>> static inline void >>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs >>> *engine) >>> /* port0 completed, advanced to port1 */ >>> trace_ports(execlists, "completed", >>> execlists->active); >>> >>> + if (atomic_xchg(>overload, 0)) >>> + intel_gt_pm_active_end(>i915->gt); >> >> So this looses track if we preempt a dual-ELSP submission with a >> single-ELSP submission (and never go back to dual). >> > > Yes, good point. You're right that if a dual-ELSP submission gets > preempted by a single-ELSP submission "overload" will remain signaled > until the first completion interrupt arrives (e.g. from the preempting > submission). > >> If you move this to the end of the loop and check >> >> if (!execlists->active[1] && atomic_xchg(>overload, 0)) >> intel_gt_pm_active_end(engine->gt); >> >> so that it covers both preemption/promotion and completion. >> > > That sounds reasonable. > >> However, that will fluctuate quite rapidly. (And runs the risk of >> exceeding the sentinel.) >> >> An alternative approach would be to couple along >> schedule_in/schedule_out >> >> atomic_set(overload, -1); >> >> __execlists_schedule_in: >> if (!atomic_fetch_inc(overload) >> intel_gt_pm_active_begin(engine->gt); >> __execlists_schedule_out: >> if (!atomic_dec_return(overload) >> intel_gt_pm_active_end(engine->gt); >> >> which would mean we are overloaded as soon as we try to submit an >> overlapping ELSP. >> > > That sounds good to me too, and AFAICT would have roughly the same > behavior as this metric except for the preemption corner case you > mention above. I'll try this and verify that I get approximately the > same performance numbers. > This suggestion seems to lead to some minor regressions, I'm investigating the issue. Will send a v2 as soon as I have something along the lines of what you suggested running with equivalent performance to v1. signature.asc Description: PGP signature ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Francisco Jerez writes: > Tvrtko Ursulin writes: >[...] >> Some time ago we entertained the idea of GPU "load average", where that >> was defined as a count of runnable requests (so batch buffers). How >> that, more generic metric, would behave here if used as an input signal >> really intrigues me. Sadly I don't have a patch ready to give to you and >> ask to please test it. >> >> Or maybe the key is count of runnable contexts as opposed to requests, >> which would more match the ELSP[1] idea. >> >[..] > This patch takes the rather conservative approach of limiting the > application of the response frequency PM QoS request to the more > restrictive set of cases where we are most certain that CPU latency > shouldn't be an issue, in order to avoid regressions. But it might be > that you find the additional energy efficiency benefit from the more > aggressive approach to be worth the cost to a few execlists submission > latency-sensitive applications. I'm trying to get some numbers > comparing the two approaches now, will post them here once I have > results so we can make a more informed trade-off. > I got some results from the promised comparison between the dual-ELSP utilization approach used in this series and the more obvious alternative of keeping track of the time that any request (or context) is in flight. As expected there are quite a few performance improvements (numbers relative to this approach), however most of them are either synthetic benchmarks or off-screen variants of benchmarks (the corresponding on-screen variant of each benchmark below doesn't show a significant improvement): synmark/OglCSDof: XXX ±0.15% x18 -> XXX ±0.22% x12 d=1.15% ±0.18% p=0.00% synmark/OglDeferred: XXX ±0.31% x18 -> XXX ±0.15% x12 d=1.16% ±0.26% p=0.00% synmark/OglTexFilterAniso: XXX ±0.18% x18 -> XXX ±0.21% x12 d=1.25% ±0.19% p=0.00% synmark/OglPSPhong: XXX ±0.43% x18 -> XXX ±0.29% x12 d=1.28% ±0.38% p=0.00% synmark/OglBatch0: XXX ±0.40% x18 -> XXX ±0.53% x12 d=1.29% ±0.46% p=0.00% synmark/OglVSDiffuse8: XXX ±0.49% x17 -> XXX ±0.25% x12 d=1.30% ±0.41% p=0.00% synmark/OglVSTangent: XXX ±0.53% x18 -> XXX ±0.31% x12 d=1.31% ±0.46% p=0.00% synmark/OglGeomPoint: XXX ±0.56% x18 -> XXX ±0.15% x12 d=1.48% ±0.44% p=0.00% gputest/plot3d: XXX ±0.16% x18 -> XXX ±0.11% x12 d=1.50% ±0.14% p=0.00% gputest/tess_x32: XXX ±0.15% x18 -> XXX ±0.06% x12 d=1.59% ±0.13% p=0.00% synmark/OglTexFilterTri: XXX ±0.15% x18 -> XXX ±0.19% x12 d=1.62% ±0.17% p=0.00% synmark/OglBatch3: XXX ±0.57% x18 -> XXX ±0.33% x12 d=1.70% ±0.49% p=0.00% synmark/OglBatch1: XXX ±0.41% x18 -> XXX ±0.34% x12 d=1.81% ±0.38% p=0.00% synmark/OglShMapVsm: XXX ±0.53% x18 -> XXX ±0.38% x12 d=1.81% ±0.48% p=0.00% synmark/OglTexMem128: XXX ±0.62% x18 -> XXX ±0.29% x12 d=1.87% ±0.52% p=0.00% phoronix/x11perf/test=Scrolling 500 x 500 px: XXX ±0.35% x6 -> XXX ±0.56% x12 d=2.23% ±0.52% p=0.00% phoronix/x11perf/test=500px Copy From Window To Window: XXX ±0.00% x3 -> XXX ±0.74% x12 d=2.41% ±0.70% p=0.01% gfxbench/gl_trex_off: XXX ±0.04% x3 -> XXX ±0.34% x12 d=2.59% ±0.32% p=0.00% synmark/OglBatch2: XXX ±0.85% x18 -> XXX ±0.21% x12 d=2.87% ±0.67% p=0.00% glbenchmark/GLB27_EgyptHD_inherited_C24Z16_FixedTime_Offscreen: XXX ±0.35% x3 -> XXX ±0.84% x12 d=3.03% ±0.81% p=0.01% glbenchmark/GLB27_TRex_C24Z16_Offscreen: XXX ±0.23% x3 -> XXX ±0.32% x12
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Tvrtko Ursulin writes: > On 11/03/2020 19:54, Francisco Jerez wrote: >> Tvrtko Ursulin writes: >> >>> On 10/03/2020 22:26, Chris Wilson wrote: Quoting Francisco Jerez (2020-03-10 21:41:55) > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c > b/drivers/gpu/drm/i915/gt/intel_lrc.c > index b9b3f78f1324..a5d7a80b826d 100644 > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct > intel_engine_cs *engine) > /* we need to manually load the submit queue */ > if (execlists->ctrl_reg) > writel(EL_CTRL_LOAD, execlists->ctrl_reg); > + > + if (execlists_num_ports(execlists) > 1 && pending[1] is always defined, the minimum submission is one slot, with pending[1] as the sentinel NULL. > + execlists->pending[1] && > + !atomic_xchg(>overload, 1)) > + intel_gt_pm_active_begin(>i915->gt); engine->gt >} > >static bool ctx_single_port_submission(const struct intel_context *ce) > @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists > * const execlists) > clear_ports(execlists->inflight, > ARRAY_SIZE(execlists->inflight)); > > WRITE_ONCE(execlists->active, execlists->inflight); > + > + if (atomic_xchg(>overload, 0)) { > + struct intel_engine_cs *engine = > + container_of(execlists, typeof(*engine), > execlists); > + intel_gt_pm_active_end(>i915->gt); > + } >} > >static inline void > @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs > *engine) > /* port0 completed, advanced to port1 */ > trace_ports(execlists, "completed", > execlists->active); > > + if (atomic_xchg(>overload, 0)) > + intel_gt_pm_active_end(>i915->gt); So this looses track if we preempt a dual-ELSP submission with a single-ELSP submission (and never go back to dual). If you move this to the end of the loop and check if (!execlists->active[1] && atomic_xchg(>overload, 0)) intel_gt_pm_active_end(engine->gt); so that it covers both preemption/promotion and completion. However, that will fluctuate quite rapidly. (And runs the risk of exceeding the sentinel.) An alternative approach would be to couple along schedule_in/schedule_out atomic_set(overload, -1); __execlists_schedule_in: if (!atomic_fetch_inc(overload) intel_gt_pm_active_begin(engine->gt); __execlists_schedule_out: if (!atomic_dec_return(overload) intel_gt_pm_active_end(engine->gt); which would mean we are overloaded as soon as we try to submit an overlapping ELSP. >>> >>> Putting it this low-level into submission code also would not work well >>> with GuC. >>> >> >> I wrote a patch at some point that added calls to >> intel_gt_pm_active_begin() and intel_gt_pm_active_end() to the GuC >> submission code in order to obtain a similar effect. However people >> requested me to leave GuC submission alone for the moment in order to >> avoid interference with SLPC. At some point it might make sense to hook >> this up in combination with SLPC, because SLPC doesn't provide much of a >> CPU energy efficiency advantage in comparison to this series. >> >>> How about we try to keep some accounting one level higher, as the i915 >>> scheduler is passing requests on to the backend for execution? >>> >>> Or number of runnable contexts, if the distinction between contexts and >>> requests is better for this purpose. >>> >>> Problematic bit in going one level higher though is that the exit point >>> is less precisely coupled to the actual state. Or maybe with aggressive >>> engine retire we have nowadays it wouldn't be a problem. >>> >> >> The main advantage of instrumenting the execlists submission code at a >> low level is that it gives us visibility over the number of ELSP ports >> pending execution, which can cause the performance of the workload to be >> substantially more or less latency-sensitive. GuC submission shouldn't >> care about this variable, so it kind of makes sense for its behavior to >> be slightly different. >> >> Anyway if we're willing to give up the accuracy of keeping track of this >> at a low level (and give GuC submission exactly the same treatment) it >> should be possible to move the tracking one level up. > > The results you got are certainly extremely attractive and the approach > and code looks tidy and mature - just so you don't get me wrong that I > am not objecting to the
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
On 11/03/2020 19:54, Francisco Jerez wrote: Tvrtko Ursulin writes: On 10/03/2020 22:26, Chris Wilson wrote: Quoting Francisco Jerez (2020-03-10 21:41:55) diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index b9b3f78f1324..a5d7a80b826d 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine) /* we need to manually load the submit queue */ if (execlists->ctrl_reg) writel(EL_CTRL_LOAD, execlists->ctrl_reg); + + if (execlists_num_ports(execlists) > 1 && pending[1] is always defined, the minimum submission is one slot, with pending[1] as the sentinel NULL. + execlists->pending[1] && + !atomic_xchg(>overload, 1)) + intel_gt_pm_active_begin(>i915->gt); engine->gt } static bool ctx_single_port_submission(const struct intel_context *ce) @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists) clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight)); WRITE_ONCE(execlists->active, execlists->inflight); + + if (atomic_xchg(>overload, 0)) { + struct intel_engine_cs *engine = + container_of(execlists, typeof(*engine), execlists); + intel_gt_pm_active_end(>i915->gt); + } } static inline void @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine) /* port0 completed, advanced to port1 */ trace_ports(execlists, "completed", execlists->active); + if (atomic_xchg(>overload, 0)) + intel_gt_pm_active_end(>i915->gt); So this looses track if we preempt a dual-ELSP submission with a single-ELSP submission (and never go back to dual). If you move this to the end of the loop and check if (!execlists->active[1] && atomic_xchg(>overload, 0)) intel_gt_pm_active_end(engine->gt); so that it covers both preemption/promotion and completion. However, that will fluctuate quite rapidly. (And runs the risk of exceeding the sentinel.) An alternative approach would be to couple along schedule_in/schedule_out atomic_set(overload, -1); __execlists_schedule_in: if (!atomic_fetch_inc(overload) intel_gt_pm_active_begin(engine->gt); __execlists_schedule_out: if (!atomic_dec_return(overload) intel_gt_pm_active_end(engine->gt); which would mean we are overloaded as soon as we try to submit an overlapping ELSP. Putting it this low-level into submission code also would not work well with GuC. I wrote a patch at some point that added calls to intel_gt_pm_active_begin() and intel_gt_pm_active_end() to the GuC submission code in order to obtain a similar effect. However people requested me to leave GuC submission alone for the moment in order to avoid interference with SLPC. At some point it might make sense to hook this up in combination with SLPC, because SLPC doesn't provide much of a CPU energy efficiency advantage in comparison to this series. How about we try to keep some accounting one level higher, as the i915 scheduler is passing requests on to the backend for execution? Or number of runnable contexts, if the distinction between contexts and requests is better for this purpose. Problematic bit in going one level higher though is that the exit point is less precisely coupled to the actual state. Or maybe with aggressive engine retire we have nowadays it wouldn't be a problem. The main advantage of instrumenting the execlists submission code at a low level is that it gives us visibility over the number of ELSP ports pending execution, which can cause the performance of the workload to be substantially more or less latency-sensitive. GuC submission shouldn't care about this variable, so it kind of makes sense for its behavior to be slightly different. Anyway if we're willing to give up the accuracy of keeping track of this at a low level (and give GuC submission exactly the same treatment) it should be possible to move the tracking one level up. The results you got are certainly extremely attractive and the approach and code looks tidy and mature - just so you don't get me wrong that I am not objecting to the idea. What I'd like to see is an easier to read breakdown of results, at minimum with separate perf and perf-per-Watt results. A graph with sorted results and error bars would also be nice. Secondly in in the commit message of this particular patch I'd like to read some more thought about why ELSP[1] occupancy is thought to be the desired signal. Why for instance a deep ELSP[0] shouldn't benefit from more TDP budget towards the GPU and similar. Also a description of the control processing "rf_qos" function do with
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Tvrtko Ursulin writes: > On 10/03/2020 22:26, Chris Wilson wrote: >> Quoting Francisco Jerez (2020-03-10 21:41:55) >>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c >>> b/drivers/gpu/drm/i915/gt/intel_lrc.c >>> index b9b3f78f1324..a5d7a80b826d 100644 >>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c >>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c >>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct >>> intel_engine_cs *engine) >>> /* we need to manually load the submit queue */ >>> if (execlists->ctrl_reg) >>> writel(EL_CTRL_LOAD, execlists->ctrl_reg); >>> + >>> + if (execlists_num_ports(execlists) > 1 && >> pending[1] is always defined, the minimum submission is one slot, with >> pending[1] as the sentinel NULL. >> >>> + execlists->pending[1] && >>> + !atomic_xchg(>overload, 1)) >>> + intel_gt_pm_active_begin(>i915->gt); >> >> engine->gt >> >>> } >>> >>> static bool ctx_single_port_submission(const struct intel_context *ce) >>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * >>> const execlists) >>> clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight)); >>> >>> WRITE_ONCE(execlists->active, execlists->inflight); >>> + >>> + if (atomic_xchg(>overload, 0)) { >>> + struct intel_engine_cs *engine = >>> + container_of(execlists, typeof(*engine), execlists); >>> + intel_gt_pm_active_end(>i915->gt); >>> + } >>> } >>> >>> static inline void >>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs >>> *engine) >>> /* port0 completed, advanced to port1 */ >>> trace_ports(execlists, "completed", >>> execlists->active); >>> >>> + if (atomic_xchg(>overload, 0)) >>> + intel_gt_pm_active_end(>i915->gt); >> >> So this looses track if we preempt a dual-ELSP submission with a >> single-ELSP submission (and never go back to dual). >> >> If you move this to the end of the loop and check >> >> if (!execlists->active[1] && atomic_xchg(>overload, 0)) >> intel_gt_pm_active_end(engine->gt); >> >> so that it covers both preemption/promotion and completion. >> >> However, that will fluctuate quite rapidly. (And runs the risk of >> exceeding the sentinel.) >> >> An alternative approach would be to couple along >> schedule_in/schedule_out >> >> atomic_set(overload, -1); >> >> __execlists_schedule_in: >> if (!atomic_fetch_inc(overload) >> intel_gt_pm_active_begin(engine->gt); >> __execlists_schedule_out: >> if (!atomic_dec_return(overload) >> intel_gt_pm_active_end(engine->gt); >> >> which would mean we are overloaded as soon as we try to submit an >> overlapping ELSP. > > Putting it this low-level into submission code also would not work well > with GuC. > I wrote a patch at some point that added calls to intel_gt_pm_active_begin() and intel_gt_pm_active_end() to the GuC submission code in order to obtain a similar effect. However people requested me to leave GuC submission alone for the moment in order to avoid interference with SLPC. At some point it might make sense to hook this up in combination with SLPC, because SLPC doesn't provide much of a CPU energy efficiency advantage in comparison to this series. > How about we try to keep some accounting one level higher, as the i915 > scheduler is passing requests on to the backend for execution? > > Or number of runnable contexts, if the distinction between contexts and > requests is better for this purpose. > > Problematic bit in going one level higher though is that the exit point > is less precisely coupled to the actual state. Or maybe with aggressive > engine retire we have nowadays it wouldn't be a problem. > The main advantage of instrumenting the execlists submission code at a low level is that it gives us visibility over the number of ELSP ports pending execution, which can cause the performance of the workload to be substantially more or less latency-sensitive. GuC submission shouldn't care about this variable, so it kind of makes sense for its behavior to be slightly different. Anyway if we're willing to give up the accuracy of keeping track of this at a low level (and give GuC submission exactly the same treatment) it should be possible to move the tracking one level up. > Regards, > > Tvrtko > Thank you. signature.asc Description: PGP signature ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Quoting Tvrtko Ursulin (2020-03-11 10:00:41) > > On 10/03/2020 22:26, Chris Wilson wrote: > > Quoting Francisco Jerez (2020-03-10 21:41:55) > >> static inline void > >> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs > >> *engine) > >> /* port0 completed, advanced to port1 */ > >> trace_ports(execlists, "completed", > >> execlists->active); > >> > >> + if (atomic_xchg(>overload, 0)) > >> + intel_gt_pm_active_end(>i915->gt); > > > > So this looses track if we preempt a dual-ELSP submission with a > > single-ELSP submission (and never go back to dual). > > > > If you move this to the end of the loop and check > > > > if (!execlists->active[1] && atomic_xchg(>overload, 0)) > > intel_gt_pm_active_end(engine->gt); > > > > so that it covers both preemption/promotion and completion. > > > > However, that will fluctuate quite rapidly. (And runs the risk of > > exceeding the sentinel.) > > > > An alternative approach would be to couple along > > schedule_in/schedule_out > > > > atomic_set(overload, -1); > > > > __execlists_schedule_in: > > if (!atomic_fetch_inc(overload) > > intel_gt_pm_active_begin(engine->gt); > > __execlists_schedule_out: > > if (!atomic_dec_return(overload) > > intel_gt_pm_active_end(engine->gt); > > > > which would mean we are overloaded as soon as we try to submit an > > overlapping ELSP. > > Putting it this low-level into submission code also would not work well > with GuC. We can cross that bridge when it is built. [The GuC is also likely to not want to play with us anyway, and just use SLPC.] Now, I suspect we may want to use an engine utilisation (busy-stats or equivalent) metric, but honestly if we can finally land this work it brings huge benefit for GPU bound TDP constrained workloads. (p-state loves to starve the GPU even when it provides no extra benefit for the CPU.) We can raise the bar, establish expected behaviour and then work to maintain and keep on improving. -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
On 10/03/2020 22:26, Chris Wilson wrote: Quoting Francisco Jerez (2020-03-10 21:41:55) diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index b9b3f78f1324..a5d7a80b826d 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine) /* we need to manually load the submit queue */ if (execlists->ctrl_reg) writel(EL_CTRL_LOAD, execlists->ctrl_reg); + + if (execlists_num_ports(execlists) > 1 && pending[1] is always defined, the minimum submission is one slot, with pending[1] as the sentinel NULL. + execlists->pending[1] && + !atomic_xchg(>overload, 1)) + intel_gt_pm_active_begin(>i915->gt); engine->gt } static bool ctx_single_port_submission(const struct intel_context *ce) @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists) clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight)); WRITE_ONCE(execlists->active, execlists->inflight); + + if (atomic_xchg(>overload, 0)) { + struct intel_engine_cs *engine = + container_of(execlists, typeof(*engine), execlists); + intel_gt_pm_active_end(>i915->gt); + } } static inline void @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine) /* port0 completed, advanced to port1 */ trace_ports(execlists, "completed", execlists->active); + if (atomic_xchg(>overload, 0)) + intel_gt_pm_active_end(>i915->gt); So this looses track if we preempt a dual-ELSP submission with a single-ELSP submission (and never go back to dual). If you move this to the end of the loop and check if (!execlists->active[1] && atomic_xchg(>overload, 0)) intel_gt_pm_active_end(engine->gt); so that it covers both preemption/promotion and completion. However, that will fluctuate quite rapidly. (And runs the risk of exceeding the sentinel.) An alternative approach would be to couple along schedule_in/schedule_out atomic_set(overload, -1); __execlists_schedule_in: if (!atomic_fetch_inc(overload) intel_gt_pm_active_begin(engine->gt); __execlists_schedule_out: if (!atomic_dec_return(overload) intel_gt_pm_active_end(engine->gt); which would mean we are overloaded as soon as we try to submit an overlapping ELSP. Putting it this low-level into submission code also would not work well with GuC. How about we try to keep some accounting one level higher, as the i915 scheduler is passing requests on to the backend for execution? Or number of runnable contexts, if the distinction between contexts and requests is better for this purpose. Problematic bit in going one level higher though is that the exit point is less precisely coupled to the actual state. Or maybe with aggressive engine retire we have nowadays it wouldn't be a problem. Regards, Tvrtko The metric feels very multiple client (game + display server, or saturated transcode) centric. In the endless kernel world, we expect 100% engine utilisation from a single context, and never a dual-ELSP submission. They are also likely to want to avoid being throttled to converse TDP for the CPU. Should we also reduce the overload for the number of clients who are waiting for interrupts from the GPU, so that their wakeup latency is not impacted? -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Chris Wilson writes: > Quoting Francisco Jerez (2020-03-10 21:41:55) >> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c >> b/drivers/gpu/drm/i915/gt/intel_lrc.c >> index b9b3f78f1324..a5d7a80b826d 100644 >> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c >> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c >> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct >> intel_engine_cs *engine) >> /* we need to manually load the submit queue */ >> if (execlists->ctrl_reg) >> writel(EL_CTRL_LOAD, execlists->ctrl_reg); >> + >> + if (execlists_num_ports(execlists) > 1 && > pending[1] is always defined, the minimum submission is one slot, with > pending[1] as the sentinel NULL. > >> + execlists->pending[1] && >> + !atomic_xchg(>overload, 1)) >> + intel_gt_pm_active_begin(>i915->gt); > > engine->gt > Applied your suggestions above locally, will probably wait to have a few more changes batched up before sending a v2. >> } >> >> static bool ctx_single_port_submission(const struct intel_context *ce) >> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * >> const execlists) >> clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight)); >> >> WRITE_ONCE(execlists->active, execlists->inflight); >> + >> + if (atomic_xchg(>overload, 0)) { >> + struct intel_engine_cs *engine = >> + container_of(execlists, typeof(*engine), execlists); >> + intel_gt_pm_active_end(>i915->gt); >> + } >> } >> >> static inline void >> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine) >> /* port0 completed, advanced to port1 */ >> trace_ports(execlists, "completed", >> execlists->active); >> >> + if (atomic_xchg(>overload, 0)) >> + intel_gt_pm_active_end(>i915->gt); > > So this looses track if we preempt a dual-ELSP submission with a > single-ELSP submission (and never go back to dual). > Yes, good point. You're right that if a dual-ELSP submission gets preempted by a single-ELSP submission "overload" will remain signaled until the first completion interrupt arrives (e.g. from the preempting submission). > If you move this to the end of the loop and check > > if (!execlists->active[1] && atomic_xchg(>overload, 0)) > intel_gt_pm_active_end(engine->gt); > > so that it covers both preemption/promotion and completion. > That sounds reasonable. > However, that will fluctuate quite rapidly. (And runs the risk of > exceeding the sentinel.) > > An alternative approach would be to couple along > schedule_in/schedule_out > > atomic_set(overload, -1); > > __execlists_schedule_in: > if (!atomic_fetch_inc(overload) > intel_gt_pm_active_begin(engine->gt); > __execlists_schedule_out: > if (!atomic_dec_return(overload) > intel_gt_pm_active_end(engine->gt); > > which would mean we are overloaded as soon as we try to submit an > overlapping ELSP. > That sounds good to me too, and AFAICT would have roughly the same behavior as this metric except for the preemption corner case you mention above. I'll try this and verify that I get approximately the same performance numbers. > > The metric feels very multiple client (game + display server, or > saturated transcode) centric. In the endless kernel world, we expect > 100% engine utilisation from a single context, and never a dual-ELSP > submission. They are also likely to want to avoid being throttled to > converse TDP for the CPU. > Yes, this metric is fairly conservative, it won't trigger in all cases which would potentially benefit from the energy efficiency optimization, only where we can be reasonably certain that CPU latency is not critical in order to keep the GPU busy (e.g. because the CS has an additional ELSP port pending execution that will immediately kick in as soon as the current one completes). My original approach was to call intel_gt_pm_active_begin() directly as soon as the first ELSP is submitted to the GPU, which was somewhat more effective at improving the energy efficiency of the system than waiting for the second port to be in use, but it involved a slight execlists submission latency cost that led to some regressions. It would certainly cover the single-context case you have in mind though. I'll get some updated numbers with my previous approach so we can decide which one provides a better trade-off. > Should we also reduce the overload for the number of clients who are > waiting for interrupts from the GPU, so that their wakeup latency is not > impacted? A number of clients waiting doesn't necessarily indicate that wake-up latency is a concern. It frequently indicates the opposite: That the GPU has a bottleneck which will only be exacerbated by attempting to reduce the ramp-up latency of the CPU. IOW, I think we
Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Quoting Francisco Jerez (2020-03-10 21:41:55) > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c > b/drivers/gpu/drm/i915/gt/intel_lrc.c > index b9b3f78f1324..a5d7a80b826d 100644 > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct > intel_engine_cs *engine) > /* we need to manually load the submit queue */ > if (execlists->ctrl_reg) > writel(EL_CTRL_LOAD, execlists->ctrl_reg); > + > + if (execlists_num_ports(execlists) > 1 && pending[1] is always defined, the minimum submission is one slot, with pending[1] as the sentinel NULL. > + execlists->pending[1] && > + !atomic_xchg(>overload, 1)) > + intel_gt_pm_active_begin(>i915->gt); engine->gt > } > > static bool ctx_single_port_submission(const struct intel_context *ce) > @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * > const execlists) > clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight)); > > WRITE_ONCE(execlists->active, execlists->inflight); > + > + if (atomic_xchg(>overload, 0)) { > + struct intel_engine_cs *engine = > + container_of(execlists, typeof(*engine), execlists); > + intel_gt_pm_active_end(>i915->gt); > + } > } > > static inline void > @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine) > /* port0 completed, advanced to port1 */ > trace_ports(execlists, "completed", > execlists->active); > > + if (atomic_xchg(>overload, 0)) > + intel_gt_pm_active_end(>i915->gt); So this looses track if we preempt a dual-ELSP submission with a single-ELSP submission (and never go back to dual). If you move this to the end of the loop and check if (!execlists->active[1] && atomic_xchg(>overload, 0)) intel_gt_pm_active_end(engine->gt); so that it covers both preemption/promotion and completion. However, that will fluctuate quite rapidly. (And runs the risk of exceeding the sentinel.) An alternative approach would be to couple along schedule_in/schedule_out atomic_set(overload, -1); __execlists_schedule_in: if (!atomic_fetch_inc(overload) intel_gt_pm_active_begin(engine->gt); __execlists_schedule_out: if (!atomic_dec_return(overload) intel_gt_pm_active_end(engine->gt); which would mean we are overloaded as soon as we try to submit an overlapping ELSP. The metric feels very multiple client (game + display server, or saturated transcode) centric. In the endless kernel world, we expect 100% engine utilisation from a single context, and never a dual-ELSP submission. They are also likely to want to avoid being throttled to converse TDP for the CPU. Should we also reduce the overload for the number of clients who are waiting for interrupts from the GPU, so that their wakeup latency is not impacted? -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx