Re: [Intel-gfx] [PATCH 36/56] drm/i915: Fair low-latency scheduling

2021-01-07 Thread Matthew Brost
On Thu, Jan 07, 2021 at 04:45:31PM +, Chris Wilson wrote:
> Quoting Matthew Brost (2021-01-07 16:05:07)
> > On Tue, Dec 29, 2020 at 12:01:25PM +, Chris Wilson wrote:
> > > The first "scheduler" was a topographical sorting of requests into
> > > priority order. The execution order was deterministic, the earliest
> > > submitted, highest priority request would be executed first. Priority
> > > inheritance ensured that inversions were kept at bay, and allowed us to
> > > dynamically boost priorities (e.g. for interactive pageflips).
> > > 
> > > The minimalistic timeslicing scheme was an attempt to introduce fairness
> > > between long running requests, by evicting the active request at the end
> > > of a timeslice and moving it to the back of its priority queue (while
> > > ensuring that dependencies were kept in order). For short running
> > > requests from many clients of equal priority, the scheme is still very
> > > much FIFO submission ordering, and as unfair as before.
> > > 
> > > To impose fairness, we need an external metric that ensures that clients
> > > are interpersed, so we don't execute one long chain from client A before
> > > executing any of client B. This could be imposed by the clients
> > > themselves by using fences based on an external clock, that is they only
> > > submit work for a "frame" at frame-intervals, instead of submitting as
> > > much work as they are able to. The standard SwapBuffers approach is akin
> > > to double bufferring, where as one frame is being executed, the next is
> > > being submitted, such that there is always a maximum of two frames per
> > > client in the pipeline and so ideally maintains consistent input-output
> > > latency. Even this scheme exhibits unfairness under load as a single
> > > client will execute two frames back to back before the next, and with
> > > enough clients, deadlines will be missed.
> > > 
> > > The idea introduced by BFS/MuQSS is that fairness is introduced by
> > > metering with an external clock. Every request, when it becomes ready to
> > > execute is assigned a virtual deadline, and execution order is then
> > > determined by earliest deadline. Priority is used as a hint, rather than
> > > strict ordering, where high priority requests have earlier deadlines,
> > > but not necessarily earlier than outstanding work. Thus work is executed
> > > in order of 'readiness', with timeslicing to demote long running work.
> > > 
> > > The Achille's heel of this scheduler is its strong preference for
> > > low-latency and favouring of new queues. Whereas it was easy to dominate
> > > the old scheduler by flooding it with many requests over a short period
> > > of time, the new scheduler can be dominated by a 'synchronous' client
> > > that waits for each of its requests to complete before submitting the
> > > next. As such a client has no history, it is always considered
> > > ready-to-run and receives an earlier deadline than the long running
> > > requests. This is compensated for by refreshing the current execution's
> > > deadline and by disallowing preemption for timeslice shuffling.
> > > 
> > > To check the impact on throughput (often the downfall of latency
> > > sensitive schedulers), we used gem_wsim to simulate various transcode
> > > workloads with different load balancers, and varying the number of
> > > competing [heterogenous] clients.
> > > 
> > > +delta%--+
> > > |a   |
> > > |a   |
> > > |aa  |
> > > |aa  |
> > > |aa  |
> > > |aa  |
> > > |   aaa  |
> > > |    |
> > > |   a  a |
> > > |   a aa |
> > > |a  aa   a  aa aa   a   a|
> > > |A_| |
> > > ++
> > >N  Min   MaxMedian   AvgStddev
> > >  108   -23.982194 28.421527  -0.077474828  -0.0726504180.16179718
> > > 
> > > The impact was on average 0.1% under contention due to the change in
> > > context execution order and number of context switches. The biggest
> > > swings are due to the execution ordering favouring one client or another,
> > > and maybe room for improvement.
> > > 
> > 
> > I haven't dug into 

Re: [Intel-gfx] [PATCH 36/56] drm/i915: Fair low-latency scheduling

2021-01-07 Thread Chris Wilson
Quoting Matthew Brost (2021-01-07 16:05:07)
> On Tue, Dec 29, 2020 at 12:01:25PM +, Chris Wilson wrote:
> > The first "scheduler" was a topographical sorting of requests into
> > priority order. The execution order was deterministic, the earliest
> > submitted, highest priority request would be executed first. Priority
> > inheritance ensured that inversions were kept at bay, and allowed us to
> > dynamically boost priorities (e.g. for interactive pageflips).
> > 
> > The minimalistic timeslicing scheme was an attempt to introduce fairness
> > between long running requests, by evicting the active request at the end
> > of a timeslice and moving it to the back of its priority queue (while
> > ensuring that dependencies were kept in order). For short running
> > requests from many clients of equal priority, the scheme is still very
> > much FIFO submission ordering, and as unfair as before.
> > 
> > To impose fairness, we need an external metric that ensures that clients
> > are interpersed, so we don't execute one long chain from client A before
> > executing any of client B. This could be imposed by the clients
> > themselves by using fences based on an external clock, that is they only
> > submit work for a "frame" at frame-intervals, instead of submitting as
> > much work as they are able to. The standard SwapBuffers approach is akin
> > to double bufferring, where as one frame is being executed, the next is
> > being submitted, such that there is always a maximum of two frames per
> > client in the pipeline and so ideally maintains consistent input-output
> > latency. Even this scheme exhibits unfairness under load as a single
> > client will execute two frames back to back before the next, and with
> > enough clients, deadlines will be missed.
> > 
> > The idea introduced by BFS/MuQSS is that fairness is introduced by
> > metering with an external clock. Every request, when it becomes ready to
> > execute is assigned a virtual deadline, and execution order is then
> > determined by earliest deadline. Priority is used as a hint, rather than
> > strict ordering, where high priority requests have earlier deadlines,
> > but not necessarily earlier than outstanding work. Thus work is executed
> > in order of 'readiness', with timeslicing to demote long running work.
> > 
> > The Achille's heel of this scheduler is its strong preference for
> > low-latency and favouring of new queues. Whereas it was easy to dominate
> > the old scheduler by flooding it with many requests over a short period
> > of time, the new scheduler can be dominated by a 'synchronous' client
> > that waits for each of its requests to complete before submitting the
> > next. As such a client has no history, it is always considered
> > ready-to-run and receives an earlier deadline than the long running
> > requests. This is compensated for by refreshing the current execution's
> > deadline and by disallowing preemption for timeslice shuffling.
> > 
> > To check the impact on throughput (often the downfall of latency
> > sensitive schedulers), we used gem_wsim to simulate various transcode
> > workloads with different load balancers, and varying the number of
> > competing [heterogenous] clients.
> > 
> > +delta%--+
> > |a   |
> > |a   |
> > |aa  |
> > |aa  |
> > |aa  |
> > |aa  |
> > |   aaa  |
> > |    |
> > |   a  a |
> > |   a aa |
> > |a  aa   a  aa aa   a   a|
> > |A_| |
> > ++
> >N  Min   MaxMedian   AvgStddev
> >  108   -23.982194 28.421527  -0.077474828  -0.0726504180.16179718
> > 
> > The impact was on average 0.1% under contention due to the change in
> > context execution order and number of context switches. The biggest
> > swings are due to the execution ordering favouring one client or another,
> > and maybe room for improvement.
> > 
> 
> I haven't dug into this series deeply but it does seem plausible for
> execlist submission. However this new scheduler seems completely broken
> for Guc submission.

Underneath it's the same topological sort, just with a different 

Re: [Intel-gfx] [PATCH 36/56] drm/i915: Fair low-latency scheduling

2021-01-07 Thread Matthew Brost
On Tue, Dec 29, 2020 at 12:01:25PM +, Chris Wilson wrote:
> The first "scheduler" was a topographical sorting of requests into
> priority order. The execution order was deterministic, the earliest
> submitted, highest priority request would be executed first. Priority
> inheritance ensured that inversions were kept at bay, and allowed us to
> dynamically boost priorities (e.g. for interactive pageflips).
> 
> The minimalistic timeslicing scheme was an attempt to introduce fairness
> between long running requests, by evicting the active request at the end
> of a timeslice and moving it to the back of its priority queue (while
> ensuring that dependencies were kept in order). For short running
> requests from many clients of equal priority, the scheme is still very
> much FIFO submission ordering, and as unfair as before.
> 
> To impose fairness, we need an external metric that ensures that clients
> are interpersed, so we don't execute one long chain from client A before
> executing any of client B. This could be imposed by the clients
> themselves by using fences based on an external clock, that is they only
> submit work for a "frame" at frame-intervals, instead of submitting as
> much work as they are able to. The standard SwapBuffers approach is akin
> to double bufferring, where as one frame is being executed, the next is
> being submitted, such that there is always a maximum of two frames per
> client in the pipeline and so ideally maintains consistent input-output
> latency. Even this scheme exhibits unfairness under load as a single
> client will execute two frames back to back before the next, and with
> enough clients, deadlines will be missed.
> 
> The idea introduced by BFS/MuQSS is that fairness is introduced by
> metering with an external clock. Every request, when it becomes ready to
> execute is assigned a virtual deadline, and execution order is then
> determined by earliest deadline. Priority is used as a hint, rather than
> strict ordering, where high priority requests have earlier deadlines,
> but not necessarily earlier than outstanding work. Thus work is executed
> in order of 'readiness', with timeslicing to demote long running work.
> 
> The Achille's heel of this scheduler is its strong preference for
> low-latency and favouring of new queues. Whereas it was easy to dominate
> the old scheduler by flooding it with many requests over a short period
> of time, the new scheduler can be dominated by a 'synchronous' client
> that waits for each of its requests to complete before submitting the
> next. As such a client has no history, it is always considered
> ready-to-run and receives an earlier deadline than the long running
> requests. This is compensated for by refreshing the current execution's
> deadline and by disallowing preemption for timeslice shuffling.
> 
> To check the impact on throughput (often the downfall of latency
> sensitive schedulers), we used gem_wsim to simulate various transcode
> workloads with different load balancers, and varying the number of
> competing [heterogenous] clients.
> 
> +delta%--+
> |a   |
> |a   |
> |aa  |
> |aa  |
> |aa  |
> |aa  |
> |   aaa  |
> |    |
> |   a  a |
> |   a aa |
> |a  aa   a  aa aa   a   a|
> |A_| |
> ++
>N  Min   MaxMedian   AvgStddev
>  108   -23.982194 28.421527  -0.077474828  -0.0726504180.16179718
> 
> The impact was on average 0.1% under contention due to the change in
> context execution order and number of context switches. The biggest
> swings are due to the execution ordering favouring one client or another,
> and maybe room for improvement.
> 

I haven't dug into this series deeply but it does seem plausible for
execlist submission. However this new scheduler seems completely broken
for Guc submission. The scheduler really isn't used that much with GuC
submission but when it is the old one works perfectly. The vfunc for
the scheduler is deleted in patch #25. I don't think that is good idea
as it appears we are trending to 2 separate schedulers.

Lastly this