subject:"RE\: \[RFC PATCH 00\/10\] Xe DRM scheduler and long running workload plans"

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-18 Thread Liviu Dudau

On Mon, Apr 03, 2023 at 05:22:01PM -0700, Matthew Brost wrote:
> Hello,

Hello,

Jumping a bit late on this thread as I was waiting on some approvals and then
holidays kicked in, but I would like to (re)introduce myself and the people
I work with and to let you know that we are interested in the changes proposed
here and we would like to help if we can.

I currently maintain a number of Arm Mali Display drivers but in recent times
I have moved to the Mali GPU team and now we've got approval to start making
contributions to the upstream driver(s). We're planning to collaborate on
Boris' new Mali driver and make it work well on Mali GPUs. One of the first
things to look at (besides bringing the driver up on internal dev platforms)
are the scheduler changes proposed here.

As such, I would like to ask that people start including John Reitan,
Ketil Johnsen and me on patches. As soon as we have something working and we
can make comments on, we will do so.

Best regards,
Liviu


> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
> 
> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
> 
> - Support for using TDR for all error paths of a scheduler / entity
> 
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> 
> - Annotate dma-fences for long running workloads.
> 
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
> 
> Again this a first draft and looking forward to feedback.
> 
> Enjoy - Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/ 
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188=1
> [5] 
> https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> 
> Matthew Brost (8):
>   drm/sched: Convert drm scheduler to use a work queue rather than
> kthread
>   drm/sched: Move schedule policy to scheduler / entity
>   drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>   drm/sched: Add generic scheduler message interface
>   drm/sched: Start run wq before TDR in drm_sched_start
>   drm/sched: Submit job before starting TDR
>   drm/sched: Add helper to set TDR timeout
>   drm/syncobj: Warn on long running dma-fences
> 
> Thomas Hellström (2):
>   dma-buf/dma-fence: Introduce long-running completion fences
>   drm/sched: Support long-running sched entities
> 
>  drivers/dma-buf/dma-fence.c | 142 +++---
>  drivers/dma-buf/dma-resv.c  |   5 +
>

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-17 Thread Daniel Vetter

On Mon, Apr 17, 2023 at 08:47:19AM +0200, Christian König wrote:
> Am 11.04.23 um 16:13 schrieb Daniel Vetter:
> > On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
> > > The point is that this not only requires some work in the drm_scheduler, 
> > > but
> > > rather it then makes only little sense to use the drm_scheduler in the 
> > > first
> > > place.
> > > 
> > > The whole point of the drm_scheduler is to provide dma_fence 
> > > implementation
> > > for the submitted jobs.
> > > 
> > > We also have dependency handling, but as Daniel and I said this can be
> > > easily extracted into a separate object/component.
> > Uh that's not what I meant. My take is that minimally patching drm/sched
> > to make the out-fence either optional, or complete it right away, is the
> > simplest way to get at the dependency handling. For me at least the major
> > part of drm/sched is the dep handling and timeout stuff. And the later can
> > be reused with some glue to handle preempt timeouts too and other things,
> > since tdr is a work struct you can just issue any other gpu timeouts on
> > the same workqueue and using the roughly same pattern as the ->timed_out
> > hook and it'll just work.
> 
> Well that strongly sounds like what I had in mind as well.
> 
> If we move the dependency/timeout functionality into a new component or if
> we move the scheduler fence into a new component doesn't seem to matter, the
> high level goal is that we have separated the two functionalities and both
> approach will work for that.

Ah ok, I guess I just got confused about your wording then.

> > The entire "oh we also make sure your hw fence doesn't leak into public
> > fences and causes lifetime mayhem" seems pretty minor. And maybe also
> > something we want to replicate for the preempt-ctx dma_fence that some
> > long-running context need (but more as part of drm_sched_entity I guess).
> > 
> > We can of course bikeshed how much flexibility really should be in the
> > different parts of drm/sched, but imo that's a bikeshed.
> 
> Well the dependency handling in a separate component would still be
> interesting to have since we need something similar for user queues as well.

Yeah it might be neater to refactor, but I think that part is optional at
least near-term. There's always room to polish shared code, and often it's
better to do that once you have at least some in-tree users for the new
need :-)
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > Am 07.04.23 um 02:20 schrieb Zeng, Oak:
> > > > So this series basically go with option 2. The part that option2 makes 
> > > > me uncomfortable is, dma-fence doesn't work for long running workload, 
> > > > why we generate it in the first place? As long as dma-fence is 
> > > > generated, it will become a source of confusion in the future. It 
> > > > doesn't matter how much you annotate it/document it. So if we decide to 
> > > > go with option2, the bottom line is, don't generate dma-fence for long 
> > > > running workload during job submission. This requires some rework in 
> > > > drm scheduler.
> > > > 
> > > > The cleanest solution to me is option3. Dma-fence is a very old 
> > > > technology. When it was created, no gpu support page fault. Obviously 
> > > > this is not a good technology for modern gpu with page fault support. I 
> > > > think the best way is to create a new scheduler and dependency tracking 
> > > > mechanism works for both page fault enabled and page fault disabled 
> > > > context. I think this matches what Christian said below. Maybe nobody 
> > > > think this is easy?
> > > > 
> > > > Thanks,
> > > > Oak
> > > > 
> > > > > -Original Message-
> > > > > From: Brost, Matthew 
> > > > > Sent: April 5, 2023 2:53 PM
> > > > > To: Zeng, Oak 
> > > > > Cc: Christian König ; Vetter, Daniel
> > > > > ; Thomas Hellström
> > > > > ; dri-devel@lists.freedesktop.org; 
> > > > > intel-
> > > > > x...@lists.freedesktop.org; robdcl...@chromium.org; airl...@linux.ie;
> > > > > l...@asahilina.net; boris.brezil...@collabora.com; 
> > > > > faith.ekstr...@collabora.com
> > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running 
> > > > > workload
> >

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-17 Thread Christian König


Am 11.04.23 um 16:13 schrieb Daniel Vetter:

On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:

The point is that this not only requires some work in the drm_scheduler, but
rather it then makes only little sense to use the drm_scheduler in the first
place.

The whole point of the drm_scheduler is to provide dma_fence implementation
for the submitted jobs.

We also have dependency handling, but as Daniel and I said this can be
easily extracted into a separate object/component.

Uh that's not what I meant. My take is that minimally patching drm/sched
to make the out-fence either optional, or complete it right away, is the
simplest way to get at the dependency handling. For me at least the major
part of drm/sched is the dep handling and timeout stuff. And the later can
be reused with some glue to handle preempt timeouts too and other things,
since tdr is a work struct you can just issue any other gpu timeouts on
the same workqueue and using the roughly same pattern as the ->timed_out
hook and it'll just work.


Well that strongly sounds like what I had in mind as well.

If we move the dependency/timeout functionality into a new component or 
if we move the scheduler fence into a new component doesn't seem to 
matter, the high level goal is that we have separated the two 
functionalities and both approach will work for that.



The entire "oh we also make sure your hw fence doesn't leak into public
fences and causes lifetime mayhem" seems pretty minor. And maybe also
something we want to replicate for the preempt-ctx dma_fence that some
long-running context need (but more as part of drm_sched_entity I guess).

We can of course bikeshed how much flexibility really should be in the
different parts of drm/sched, but imo that's a bikeshed.


Well the dependency handling in a separate component would still be 
interesting to have since we need something similar for user queues as well.


Christian.


-Daniel



Regards,
Christian.

Am 07.04.23 um 02:20 schrieb Zeng, Oak:

So this series basically go with option 2. The part that option2 makes me 
uncomfortable is, dma-fence doesn't work for long running workload, why we 
generate it in the first place? As long as dma-fence is generated, it will 
become a source of confusion in the future. It doesn't matter how much you 
annotate it/document it. So if we decide to go with option2, the bottom line 
is, don't generate dma-fence for long running workload during job submission. 
This requires some rework in drm scheduler.

The cleanest solution to me is option3. Dma-fence is a very old technology. 
When it was created, no gpu support page fault. Obviously this is not a good 
technology for modern gpu with page fault support. I think the best way is to 
create a new scheduler and dependency tracking mechanism works for both page 
fault enabled and page fault disabled context. I think this matches what 
Christian said below. Maybe nobody think this is easy?

Thanks,
Oak


-Original Message-
From: Brost, Matthew 
Sent: April 5, 2023 2:53 PM
To: Zeng, Oak 
Cc: Christian König ; Vetter, Daniel
; Thomas Hellström
; dri-devel@lists.freedesktop.org; intel-
x...@lists.freedesktop.org; robdcl...@chromium.org; airl...@linux.ie;
l...@asahilina.net; boris.brezil...@collabora.com; faith.ekstr...@collabora.com
Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
plans

On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:

Hi,

Using dma-fence for completion/dependency tracking for long-run

workload(more precisely on-demand paging/page fault enabled workload) can
cause deadlock. This seems the significant issue here. Other issues such as the
drm scheduler completion order implication etc are minors which can be solve
inside the framework of drm scheduler. We need to evaluate below paths:

1) still use drm scheduler for job submission, and use dma-fence for job

completion waiting/dependency tracking. This is solution proposed in this 
series.
Annotate dma-fence for long-run workload: user can still wait dma-fence for job
completion but can't wait dma-fence while holding any memory management
locks.  We still use dma-fence for dependency tracking. But it is just very 
easily
run into deadlock when on-demand paging is in the picture. The annotation helps
us to detect deadlock but not solve deadlock problems. Seems *not* a complete
solution: It is almost impossible to completely avoid dependency deadlock in
complex runtime environment
No one can wait on LR fence, so it is impossible to deadlock. The
annotations enforce this. Literally this is only for flow controling the
ring / hold pending jobs in in the DRM schedule list.


2) Still use drm scheduler but not use dma-fence for completion 
signaling

and dependency tracking. This way we still get some free functions (reset, err
handling ring flow control as Matt said)from drm scheduler, but push the
dependency/completion tracking completely to user space usi

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-16 Thread Matthew Brost

On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> On 04/04/2023 10.58, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > Hi, thanks for the Cc!
> > > 
> > 
> > No problem.
> > 
> > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between 
> > > > scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > 
> > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > welcome! We've been using the existing scheduler as-is, but this should 
> > > help
> > > remove some unneeded complexity in this use case.
> > > 
> > 
> > That's the idea.
> > 
> > > Do you want me to pull in this series into our tree and make sure this all
> > > works out for us?
> > > 
> > 
> > We tested this in Xe and it definitely works for us but the more testing
> > the better.
> > 
> 
> I haven't gotten around to testing this series yet, but after more debugging
> of drm_sched issues I want to hear more about how Xe uses the scheduler.
> 
> From what I can tell, and from what Christian says, drm_sched has the hidden
> requirement that all job objects outlive the scheduler. I've run into
> several UAF bugs due to this. Not only that, it also currently has the
> requirement that all drm_sched fences outlive the scheduler object.
> 
> These requirements are subtle and only manifest as kernel oopses in rare
> corner cases, so it wasn't at all obvious to me that this was somehow a
> fundamental design assumption when I started using it.
> 
> As far as I can tell, this design is going to work in 99% of cases for
> global-schedulers-per-GPU models, where those corner cases would have to be
> hit on top of a GPU removal scenario (and GPU remove is... well, not the
> most tested/exercised use case). When the scheduler basically lives forever,
> none of this really matters.
> 
> But with a one-scheduler-per-queue model, how do you deal with this when the
> queue goes away? So far, without any of the partial bugfixes I have sent so
> far (which Christian objected to):
> 
> - If you try to tear down a scheduler with any jobs currently scheduled at
> the hardware, drm_sched will oops when those jobs complete and the hw fences
> signal.
> - If you try to tear down an entity (which should cancel all its pending
> jobs) and then the scheduler it was attached to without actually waiting for
> all the free_job() callbacks to be called on every job that ever existed for
> that entity, you can oops (entity cleanup is asynchronous in some cases like
> killed processes, so it will return before all jobs are freed and then that
> asynchronous process will crash and burn if the scheduler goes away out from
> under its feet). Waiting for job completion fences is not enough for this,
> you have to wait until free_job() has actually been called for all jobs.
> - Even if you actually wait for all jobs to be truly gone and then tear down
> the scheduler, if any scheduler job fences remain alive, that will then oops
> if you try to call the debug functions on them (like cat
> /sys/kernel/debug/dma_buf/bufinfo).
> 
> I tried to fix these things, but Christian objected implying it was the
> driver's job to keep a

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-12 Thread Daniel Vetter

On Wed, Apr 12, 2023 at 02:47:52PM +0900, Asahi Lina wrote:
> On 11/04/2023 23.07, Daniel Vetter wrote:
> > On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> > > On 04/04/2023 10.58, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > > > Hi, thanks for the Cc!
> > > > > 
> > > > 
> > > > No problem.
> > > > 
> > > > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > have been asked to merge our common DRM scheduler patches first as 
> > > > > > well
> > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > welcome any and all feedback.
> > > > > > 
> > > > > > This can we thought of as 4 parts detailed below.
> > > > > > 
> > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler 
> > > > > > and
> > > > > > entity (patches 1-3)
> > > > > > 
> > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler 
> > > > > > (the
> > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > severals problems as the DRM was originally designed to schedule 
> > > > > > jobs on
> > > > > > hardware queues. The main problem being that DRM scheduler expects 
> > > > > > the
> > > > > > submission order of jobs to be the completion order of jobs even 
> > > > > > across
> > > > > > multiple entities. This assumption falls apart with a firmware 
> > > > > > scheduler
> > > > > > as a firmware scheduler has no concept of jobs and jobs can 
> > > > > > complete out
> > > > > > of order. A novel solution for was originally thought of by Faith 
> > > > > > during
> > > > > > the initial prototype of Xe, create a 1 to 1 relationship between 
> > > > > > scheduler
> > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > 
> > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 
> > > > > > relationship.
> > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > relationship and can be thought of as using scheduler as a 
> > > > > > dependency /
> > > > > > infligt job tracker rather than a true scheduler.
> > > > > 
> > > > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > > > welcome! We've been using the existing scheduler as-is, but this 
> > > > > should help
> > > > > remove some unneeded complexity in this use case.
> > > > > 
> > > > 
> > > > That's the idea.
> > > > 
> > > > > Do you want me to pull in this series into our tree and make sure 
> > > > > this all
> > > > > works out for us?
> > > > > 
> > > > 
> > > > We tested this in Xe and it definitely works for us but the more testing
> > > > the better.
> > > > 
> > > 
> > > I haven't gotten around to testing this series yet, but after more 
> > > debugging
> > > of drm_sched issues I want to hear more about how Xe uses the scheduler.
> > > 
> > >  From what I can tell, and from what Christian says, drm_sched has the 
> > > hidden
> > > requirement that all job objects outlive the scheduler. I've run into
> > > several UAF bugs due to this. Not only that, it also currently has the
> > > requirement that all drm_sched fences outlive the scheduler object.
> > > 
> > > These requirements are subtle and only manifest as kernel oopses in rare
> > > corner cases, so it wasn't at all obvious to me that this was somehow a
> > > fundamental design assumption when I started using it.
> > > 
> > > As far as I can tell, this design is going to work in 99% of cases for
> > > global-schedulers-per-GPU models, where those corner cases would have to 
> > > be
> > > hit on top of a GPU removal scenario (and GPU remove is... well, not the
> > > most tested/exercised use case). When the scheduler basically lives 
> > > forever,
> > > none of this really matters.
> > > 
> > > But with a one-scheduler-per-queue model, how do you deal with this when 
> > > the
> > > queue goes away? So far, without any of the partial bugfixes I have sent 
> > > so
> > > far (which Christian objected to):
> > > 
> > > - If you try to tear down a scheduler with any jobs currently scheduled at
> > > the hardware, drm_sched will oops when those jobs complete and the hw 
> > > fences
> > > signal.
> > > - If you try to tear down an entity (which should cancel all its pending
> > > jobs) and then the scheduler it was attached to without actually waiting 
> > > for
> > > all the free_job() callbacks to be called on every job that ever existed 
> > > for
> > > that entity, you can oops (entity cleanup is asynchronous in some cases 
> >

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-11 Thread Asahi Lina


On 11/04/2023 23.07, Daniel Vetter wrote:

On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:

On 04/04/2023 10.58, Matthew Brost wrote:

On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:

Hi, thanks for the Cc!



No problem.


On 04/04/2023 09.22, Matthew Brost wrote:

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.


Yup, we're in the exact same situation with drm/asahi, so this is very
welcome! We've been using the existing scheduler as-is, but this should help
remove some unneeded complexity in this use case.



That's the idea.


Do you want me to pull in this series into our tree and make sure this all
works out for us?



We tested this in Xe and it definitely works for us but the more testing
the better.



I haven't gotten around to testing this series yet, but after more debugging
of drm_sched issues I want to hear more about how Xe uses the scheduler.

 From what I can tell, and from what Christian says, drm_sched has the hidden
requirement that all job objects outlive the scheduler. I've run into
several UAF bugs due to this. Not only that, it also currently has the
requirement that all drm_sched fences outlive the scheduler object.

These requirements are subtle and only manifest as kernel oopses in rare
corner cases, so it wasn't at all obvious to me that this was somehow a
fundamental design assumption when I started using it.

As far as I can tell, this design is going to work in 99% of cases for
global-schedulers-per-GPU models, where those corner cases would have to be
hit on top of a GPU removal scenario (and GPU remove is... well, not the
most tested/exercised use case). When the scheduler basically lives forever,
none of this really matters.

But with a one-scheduler-per-queue model, how do you deal with this when the
queue goes away? So far, without any of the partial bugfixes I have sent so
far (which Christian objected to):

- If you try to tear down a scheduler with any jobs currently scheduled at
the hardware, drm_sched will oops when those jobs complete and the hw fences
signal.
- If you try to tear down an entity (which should cancel all its pending
jobs) and then the scheduler it was attached to without actually waiting for
all the free_job() callbacks to be called on every job that ever existed for
that entity, you can oops (entity cleanup is asynchronous in some cases like
killed processes, so it will return before all jobs are freed and then that
asynchronous process will crash and burn if the scheduler goes away out from
under its feet). Waiting for job completion fences is not enough for this,
you have to wait until free_job() has actually been called for all jobs.
- Even if you actually wait for all jobs to be truly gone and then tear down
the scheduler, if any scheduler job fences remain alive, that will then oops
if you try to call the debug functions on them (like cat
/sys/kernel/debug/dma_buf/bufinfo).

I tried to fix these things, but Christian objected implying it was the
driver's job to keep a reference from jobs and hw fences to the scheduler.
But I find that completely broken, because besides the extra memory/resource
usage keeping the scheduler alive when you're trying to free resources as
fast as possible when a process goes away, you can't even use normal
reference counting for that: if you try to drop the last drm_sched reference
from within a free_job() callback, the whole thing

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-11 Thread Daniel Vetter

On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
> The point is that this not only requires some work in the drm_scheduler, but
> rather it then makes only little sense to use the drm_scheduler in the first
> place.
> 
> The whole point of the drm_scheduler is to provide dma_fence implementation
> for the submitted jobs.
> 
> We also have dependency handling, but as Daniel and I said this can be
> easily extracted into a separate object/component.

Uh that's not what I meant. My take is that minimally patching drm/sched
to make the out-fence either optional, or complete it right away, is the
simplest way to get at the dependency handling. For me at least the major
part of drm/sched is the dep handling and timeout stuff. And the later can
be reused with some glue to handle preempt timeouts too and other things,
since tdr is a work struct you can just issue any other gpu timeouts on
the same workqueue and using the roughly same pattern as the ->timed_out
hook and it'll just work.

The entire "oh we also make sure your hw fence doesn't leak into public
fences and causes lifetime mayhem" seems pretty minor. And maybe also
something we want to replicate for the preempt-ctx dma_fence that some
long-running context need (but more as part of drm_sched_entity I guess).

We can of course bikeshed how much flexibility really should be in the
different parts of drm/sched, but imo that's a bikeshed.
-Daniel


> 
> Regards,
> Christian.
> 
> Am 07.04.23 um 02:20 schrieb Zeng, Oak:
> > So this series basically go with option 2. The part that option2 makes me 
> > uncomfortable is, dma-fence doesn't work for long running workload, why we 
> > generate it in the first place? As long as dma-fence is generated, it will 
> > become a source of confusion in the future. It doesn't matter how much you 
> > annotate it/document it. So if we decide to go with option2, the bottom 
> > line is, don't generate dma-fence for long running workload during job 
> > submission. This requires some rework in drm scheduler.
> > 
> > The cleanest solution to me is option3. Dma-fence is a very old technology. 
> > When it was created, no gpu support page fault. Obviously this is not a 
> > good technology for modern gpu with page fault support. I think the best 
> > way is to create a new scheduler and dependency tracking mechanism works 
> > for both page fault enabled and page fault disabled context. I think this 
> > matches what Christian said below. Maybe nobody think this is easy?
> > 
> > Thanks,
> > Oak
> > 
> > > -Original Message-
> > > From: Brost, Matthew 
> > > Sent: April 5, 2023 2:53 PM
> > > To: Zeng, Oak 
> > > Cc: Christian König ; Vetter, Daniel
> > > ; Thomas Hellström
> > > ; dri-devel@lists.freedesktop.org; 
> > > intel-
> > > x...@lists.freedesktop.org; robdcl...@chromium.org; airl...@linux.ie;
> > > l...@asahilina.net; boris.brezil...@collabora.com; 
> > > faith.ekstr...@collabora.com
> > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > > 
> > > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > > > Hi,
> > > > 
> > > > Using dma-fence for completion/dependency tracking for long-run
> > > workload(more precisely on-demand paging/page fault enabled workload) can
> > > cause deadlock. This seems the significant issue here. Other issues such 
> > > as the
> > > drm scheduler completion order implication etc are minors which can be 
> > > solve
> > > inside the framework of drm scheduler. We need to evaluate below paths:
> > > > 1) still use drm scheduler for job submission, and use 
> > > > dma-fence for job
> > > completion waiting/dependency tracking. This is solution proposed in this 
> > > series.
> > > Annotate dma-fence for long-run workload: user can still wait dma-fence 
> > > for job
> > > completion but can't wait dma-fence while holding any memory management
> > > locks.  We still use dma-fence for dependency tracking. But it is just 
> > > very easily
> > > run into deadlock when on-demand paging is in the picture. The annotation 
> > > helps
> > > us to detect deadlock but not solve deadlock problems. Seems *not* a 
> > > complete
> > > solution: It is almost impossible to completely avoid dependency deadlock 
> > > in
> > > complex runtime environment
> > > No one can wait on LR fence, so it is impossible to deadlock. The
> > > annotations enforce this. Literally t

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-11 Thread Daniel Vetter

On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> On 04/04/2023 10.58, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > Hi, thanks for the Cc!
> > > 
> > 
> > No problem.
> > 
> > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between 
> > > > scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > 
> > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > welcome! We've been using the existing scheduler as-is, but this should 
> > > help
> > > remove some unneeded complexity in this use case.
> > > 
> > 
> > That's the idea.
> > 
> > > Do you want me to pull in this series into our tree and make sure this all
> > > works out for us?
> > > 
> > 
> > We tested this in Xe and it definitely works for us but the more testing
> > the better.
> > 
> 
> I haven't gotten around to testing this series yet, but after more debugging
> of drm_sched issues I want to hear more about how Xe uses the scheduler.
> 
> From what I can tell, and from what Christian says, drm_sched has the hidden
> requirement that all job objects outlive the scheduler. I've run into
> several UAF bugs due to this. Not only that, it also currently has the
> requirement that all drm_sched fences outlive the scheduler object.
> 
> These requirements are subtle and only manifest as kernel oopses in rare
> corner cases, so it wasn't at all obvious to me that this was somehow a
> fundamental design assumption when I started using it.
> 
> As far as I can tell, this design is going to work in 99% of cases for
> global-schedulers-per-GPU models, where those corner cases would have to be
> hit on top of a GPU removal scenario (and GPU remove is... well, not the
> most tested/exercised use case). When the scheduler basically lives forever,
> none of this really matters.
> 
> But with a one-scheduler-per-queue model, how do you deal with this when the
> queue goes away? So far, without any of the partial bugfixes I have sent so
> far (which Christian objected to):
> 
> - If you try to tear down a scheduler with any jobs currently scheduled at
> the hardware, drm_sched will oops when those jobs complete and the hw fences
> signal.
> - If you try to tear down an entity (which should cancel all its pending
> jobs) and then the scheduler it was attached to without actually waiting for
> all the free_job() callbacks to be called on every job that ever existed for
> that entity, you can oops (entity cleanup is asynchronous in some cases like
> killed processes, so it will return before all jobs are freed and then that
> asynchronous process will crash and burn if the scheduler goes away out from
> under its feet). Waiting for job completion fences is not enough for this,
> you have to wait until free_job() has actually been called for all jobs.
> - Even if you actually wait for all jobs to be truly gone and then tear down
> the scheduler, if any scheduler job fences remain alive, that will then oops
> if you try to call the debug functions on them (like cat
> /sys/kernel/debug/dma_buf/bufinfo).
> 
> I tried to fix these things, but Christian objected implying it was the
> driver's job to keep a

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-11 Thread Christian König

The point is that this not only requires some work in the drm_scheduler, 
but rather it then makes only little sense to use the drm_scheduler in 
the first place.


The whole point of the drm_scheduler is to provide dma_fence 
implementation for the submitted jobs.


We also have dependency handling, but as Daniel and I said this can be 
easily extracted into a separate object/component.


Regards,
Christian.

Am 07.04.23 um 02:20 schrieb Zeng, Oak:

So this series basically go with option 2. The part that option2 makes me 
uncomfortable is, dma-fence doesn't work for long running workload, why we 
generate it in the first place? As long as dma-fence is generated, it will 
become a source of confusion in the future. It doesn't matter how much you 
annotate it/document it. So if we decide to go with option2, the bottom line 
is, don't generate dma-fence for long running workload during job submission. 
This requires some rework in drm scheduler.

The cleanest solution to me is option3. Dma-fence is a very old technology. 
When it was created, no gpu support page fault. Obviously this is not a good 
technology for modern gpu with page fault support. I think the best way is to 
create a new scheduler and dependency tracking mechanism works for both page 
fault enabled and page fault disabled context. I think this matches what 
Christian said below. Maybe nobody think this is easy?

Thanks,
Oak


-Original Message-
From: Brost, Matthew 
Sent: April 5, 2023 2:53 PM
To: Zeng, Oak 
Cc: Christian König ; Vetter, Daniel
; Thomas Hellström
; dri-devel@lists.freedesktop.org; intel-
x...@lists.freedesktop.org; robdcl...@chromium.org; airl...@linux.ie;
l...@asahilina.net; boris.brezil...@collabora.com; faith.ekstr...@collabora.com
Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
plans

On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:

Hi,

Using dma-fence for completion/dependency tracking for long-run

workload(more precisely on-demand paging/page fault enabled workload) can
cause deadlock. This seems the significant issue here. Other issues such as the
drm scheduler completion order implication etc are minors which can be solve
inside the framework of drm scheduler. We need to evaluate below paths:

1) still use drm scheduler for job submission, and use dma-fence for job

completion waiting/dependency tracking. This is solution proposed in this 
series.
Annotate dma-fence for long-run workload: user can still wait dma-fence for job
completion but can't wait dma-fence while holding any memory management
locks.  We still use dma-fence for dependency tracking. But it is just very 
easily
run into deadlock when on-demand paging is in the picture. The annotation helps
us to detect deadlock but not solve deadlock problems. Seems *not* a complete
solution: It is almost impossible to completely avoid dependency deadlock in
complex runtime environment
No one can wait on LR fence, so it is impossible to deadlock. The
annotations enforce this. Literally this is only for flow controling the
ring / hold pending jobs in in the DRM schedule list.


2) Still use drm scheduler but not use dma-fence for completion 
signaling

and dependency tracking. This way we still get some free functions (reset, err
handling ring flow control as Matt said)from drm scheduler, but push the
dependency/completion tracking completely to user space using techniques such
as user space fence. User space doesn't have chance to wait fence while holding
a kernel memory management lock, thus the dma-fence deadlock issue is solved.
We use user space fence for syncs.


3) Completely discard drm scheduler and dma-fence for long-run

workload. Use user queue/doorbell for super fast submission, directly interact
with fw scheduler. Use user fence for completion/dependency tracking.
This is a hard no from me, I want 1 submission path in Xe. Either we use
the DRM scheduler or we don't.

Matt


Thanks,
Oak


-Original Message-
From: Christian König 
Sent: April 5, 2023 3:30 AM
To: Brost, Matthew ; Zeng, Oak

Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org;
robdcl...@chromium.org; thomas.hellst...@linux.intel.com;

airl...@linux.ie;

l...@asahilina.net; boris.brezil...@collabora.com;

faith.ekstr...@collabora.com

Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
plans

Am 04.04.23 um 20:08 schrieb Matthew Brost:

On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:

Hi Matt, Thomas,

Some very bold out of box thinking in this area:

1. so you want to use drm scheduler and dma-fence for long running

workload.

Why you want to do this in the first place? What is the benefit? Drm scheduler

is

pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
level, as you said below for intel this is Guc. Can xe driver just directly 
submit

job

to Guc, bypassing drm scheduler?

If we did that now we have 2 paths for dependency track

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-08 Thread Asahi Lina


On 04/04/2023 10.58, Matthew Brost wrote:

On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:

Hi, thanks for the Cc!



No problem.


On 04/04/2023 09.22, Matthew Brost wrote:

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.


Yup, we're in the exact same situation with drm/asahi, so this is very
welcome! We've been using the existing scheduler as-is, but this should help
remove some unneeded complexity in this use case.



That's the idea.


Do you want me to pull in this series into our tree and make sure this all
works out for us?



We tested this in Xe and it definitely works for us but the more testing
the better.



I haven't gotten around to testing this series yet, but after more 
debugging of drm_sched issues I want to hear more about how Xe uses the 
scheduler.


From what I can tell, and from what Christian says, drm_sched has the 
hidden requirement that all job objects outlive the scheduler. I've run 
into several UAF bugs due to this. Not only that, it also currently has 
the requirement that all drm_sched fences outlive the scheduler object.


These requirements are subtle and only manifest as kernel oopses in rare 
corner cases, so it wasn't at all obvious to me that this was somehow a 
fundamental design assumption when I started using it.


As far as I can tell, this design is going to work in 99% of cases for 
global-schedulers-per-GPU models, where those corner cases would have to 
be hit on top of a GPU removal scenario (and GPU remove is... well, not 
the most tested/exercised use case). When the scheduler basically lives 
forever, none of this really matters.


But with a one-scheduler-per-queue model, how do you deal with this when 
the queue goes away? So far, without any of the partial bugfixes I have 
sent so far (which Christian objected to):


- If you try to tear down a scheduler with any jobs currently scheduled 
at the hardware, drm_sched will oops when those jobs complete and the hw 
fences signal.
- If you try to tear down an entity (which should cancel all its pending 
jobs) and then the scheduler it was attached to without actually waiting 
for all the free_job() callbacks to be called on every job that ever 
existed for that entity, you can oops (entity cleanup is asynchronous in 
some cases like killed processes, so it will return before all jobs are 
freed and then that asynchronous process will crash and burn if the 
scheduler goes away out from under its feet). Waiting for job completion 
fences is not enough for this, you have to wait until free_job() has 
actually been called for all jobs.
- Even if you actually wait for all jobs to be truly gone and then tear 
down the scheduler, if any scheduler job fences remain alive, that will 
then oops if you try to call the debug functions on them (like cat 
/sys/kernel/debug/dma_buf/bufinfo).


I tried to fix these things, but Christian objected implying it was the 
driver's job to keep a reference from jobs and hw fences to the 
scheduler. But I find that completely broken, because besides the extra 
memory/resource usage keeping the scheduler alive when you're trying to 
free resources as fast as possible when a process goes away, you can't 
even use normal reference counting for that: if you try to drop the last 
drm_sched reference from within a free_job() callback, the whole thing 
deadlocks since that will be running in the scheduler's

RE: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-06 Thread Zeng, Oak

So this series basically go with option 2. The part that option2 makes me 
uncomfortable is, dma-fence doesn't work for long running workload, why we 
generate it in the first place? As long as dma-fence is generated, it will 
become a source of confusion in the future. It doesn't matter how much you 
annotate it/document it. So if we decide to go with option2, the bottom line 
is, don't generate dma-fence for long running workload during job submission. 
This requires some rework in drm scheduler.

The cleanest solution to me is option3. Dma-fence is a very old technology. 
When it was created, no gpu support page fault. Obviously this is not a good 
technology for modern gpu with page fault support. I think the best way is to 
create a new scheduler and dependency tracking mechanism works for both page 
fault enabled and page fault disabled context. I think this matches what 
Christian said below. Maybe nobody think this is easy?  

Thanks,
Oak

> -Original Message-
> From: Brost, Matthew 
> Sent: April 5, 2023 2:53 PM
> To: Zeng, Oak 
> Cc: Christian König ; Vetter, Daniel
> ; Thomas Hellström
> ; dri-devel@lists.freedesktop.org; intel-
> x...@lists.freedesktop.org; robdcl...@chromium.org; airl...@linux.ie;
> l...@asahilina.net; boris.brezil...@collabora.com; 
> faith.ekstr...@collabora.com
> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> 
> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > Hi,
> >
> > Using dma-fence for completion/dependency tracking for long-run
> workload(more precisely on-demand paging/page fault enabled workload) can
> cause deadlock. This seems the significant issue here. Other issues such as 
> the
> drm scheduler completion order implication etc are minors which can be solve
> inside the framework of drm scheduler. We need to evaluate below paths:
> >
> > 1) still use drm scheduler for job submission, and use dma-fence for job
> completion waiting/dependency tracking. This is solution proposed in this 
> series.
> Annotate dma-fence for long-run workload: user can still wait dma-fence for 
> job
> completion but can't wait dma-fence while holding any memory management
> locks.  We still use dma-fence for dependency tracking. But it is just very 
> easily
> run into deadlock when on-demand paging is in the picture. The annotation 
> helps
> us to detect deadlock but not solve deadlock problems. Seems *not* a complete
> solution: It is almost impossible to completely avoid dependency deadlock in
> complex runtime environment
> >
> 
> No one can wait on LR fence, so it is impossible to deadlock. The
> annotations enforce this. Literally this is only for flow controling the
> ring / hold pending jobs in in the DRM schedule list.
> 
> > 2) Still use drm scheduler but not use dma-fence for completion 
> > signaling
> and dependency tracking. This way we still get some free functions (reset, err
> handling ring flow control as Matt said)from drm scheduler, but push the
> dependency/completion tracking completely to user space using techniques such
> as user space fence. User space doesn't have chance to wait fence while 
> holding
> a kernel memory management lock, thus the dma-fence deadlock issue is solved.
> >
> 
> We use user space fence for syncs.
> 
> > 3) Completely discard drm scheduler and dma-fence for long-run
> workload. Use user queue/doorbell for super fast submission, directly interact
> with fw scheduler. Use user fence for completion/dependency tracking.
> >
> 
> This is a hard no from me, I want 1 submission path in Xe. Either we use
> the DRM scheduler or we don't.
> 
> Matt
> 
> > Thanks,
> > Oak
> >
> > > -Original Message-
> > > From: Christian König 
> > > Sent: April 5, 2023 3:30 AM
> > > To: Brost, Matthew ; Zeng, Oak
> > > 
> > > Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org;
> > > robdcl...@chromium.org; thomas.hellst...@linux.intel.com;
> airl...@linux.ie;
> > > l...@asahilina.net; boris.brezil...@collabora.com;
> faith.ekstr...@collabora.com
> > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > >
> > > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > >> Hi Matt, Thomas,
> > > >>
> > > >> Some very bold out of box thinking in this area:
> > > >>
> > > >> 1. so you want to use drm scheduler and dma-fence for long running
> workload.
> > > Why you want to do this in the first place? Wha

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-06 Thread Daniel Vetter

On Thu, Apr 06, 2023 at 12:14:36PM +0200, Christian König wrote:
> Am 06.04.23 um 08:37 schrieb Daniel Vetter:
> > On Thu, Apr 06, 2023 at 02:08:10AM +, Matthew Brost wrote:
> > > On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> > > > On Wed, 5 Apr 2023 at 11:57, Christian König  
> > > > wrote:
> > > > > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > > I would approach it from the complete other side. This component 
> > > > > > > here is a
> > > > > > > tool to decide what job should run next.
> > > > > > > 
> > > > > > > How that is then signaled and run should not be part of the 
> > > > > > > scheduler, but
> > > > > > > another more higher level component.
> > > > > > > 
> > > > > > > This way you also don't have a problem with not using DMA-fences 
> > > > > > > as
> > > > > > > dependencies as well as constrains for running more jobs.
> > > > > > I think we're talking about two things here and mixing them up.
> > > > > > 
> > > > > > For the dependencies I agree with you, and imo that higher level 
> > > > > > tool
> > > > > > should probably just be an on-demand submit thread in userspace for 
> > > > > > the
> > > > > > rare case where the kernel would need to sort out a dependency 
> > > > > > otherwise
> > > > > > (due to running out of ringspace in the per-ctx ringbuffer).
> > > > > > 
> > > > > > The other thing is the message passing stuff, and this is what I was
> > > > > > talking about above. This has nothing to do with handling 
> > > > > > dependencies,
> > > > > > but with talking to the gpu fw. Here the intel design issue is that 
> > > > > > the fw
> > > > > > only provides a single queue, and it's in-order. Which means it
> > > > > > fundamentally has the stalling issue you describe as a point 
> > > > > > against a
> > > > > > message passing design. And fundamentally we need to be able to 
> > > > > > talk to
> > > > > > the fw in the scheduler ->run_job callback.
> > > > > > 
> > > > > > The proposal here for the message passing part is that since it has 
> > > > > > the
> > > > > > stalling issue already anyway, and the scheduler needs to be 
> > > > > > involved
> > > > > > anyway, it makes sense to integrated this (as an optional thing, 
> > > > > > only for
> > > > > > drivers which have this kind of fw interface) into the scheduler.
> > > > > > Otherwise you just end up with two layers for no reason and more 
> > > > > > ping-pong
> > > > > > delay because the ->run_job needs to kick off the subordinate 
> > > > > > driver layer
> > > > > > first. Note that for this case the optional message passing support 
> > > > > > in the
> > > > > > drm/scheduler actually makes things better, because it allows you 
> > > > > > to cut
> > > > > > out one layer.
> > > > > > 
> > > > > > Of course if a driver with better fw interface uses this message 
> > > > > > passing
> > > > > > support, then that's bad. Hence the big warning in the kerneldoc.
> > > > > Well what I wanted to say is that if you design the dependency 
> > > > > handling
> > > > > / scheduler properly you don't need the message passing through it.
> > > > > 
> > > > > For example if the GPU scheduler component uses a work item to do it's
> > > > > handling instead of a kthread you could also let the driver specify 
> > > > > the
> > > > > work queue where this work item is executed on.
> > > > > 
> > > > > When you design it like this the driver specifies the thread context 
> > > > > of
> > > > > execution for it's job. In other words it can specify a single 
> > > > > threaded
> > > > > firmware work queue as well.
> > > > > 
> > > > > When you then have other messages which needs to be passed to the
> > > > > firmware you can also use the same single threaded workqueue for this.
> > > > > 
> > > > > Drivers which have a different firmware interface would just use one 
> > > > > of
> > > > > the system work queues instead.
> > > > > 
> > > > > This approach basically decouples the GPU scheduler component from the
> > > > > message passing functionality.
> > > > Hm I guess we've been talking past each another big time, because
> > > > that's really what I thought was under discussions? Essentially the
> > > > current rfc, but implementing with some polish.
> > > > 
> > > I think Daniel pretty much nailed it here (thanks), to recap:
> > > 
> > > 1. I want the messages in the same worker so run_job / free_job /
> > > process_msg execution is mutual exclusive and also so during reset paths
> > > if the worker is stopped all the entry points can't be entered.
> > > 
> > > If this is a NAK, then another worker is fine I guess. A lock between
> > > run_job / free_job + process_msg should solve the exclusion issue and the
> > > reset paths can also stop this new worker too. That being said I'd
> > > rather leave this as is but will not fight this point.
> > > 
> > > 2. process_msg is just used to communicate with the firmware using the
> > > same queue as submission. Waiting

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-06 Thread Christian König


Am 06.04.23 um 08:37 schrieb Daniel Vetter:

On Thu, Apr 06, 2023 at 02:08:10AM +, Matthew Brost wrote:

On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:

On Wed, 5 Apr 2023 at 11:57, Christian König  wrote:

Am 05.04.23 um 11:07 schrieb Daniel Vetter:

[SNIP]

I would approach it from the complete other side. This component here is a
tool to decide what job should run next.

How that is then signaled and run should not be part of the scheduler, but
another more higher level component.

This way you also don't have a problem with not using DMA-fences as
dependencies as well as constrains for running more jobs.

I think we're talking about two things here and mixing them up.

For the dependencies I agree with you, and imo that higher level tool
should probably just be an on-demand submit thread in userspace for the
rare case where the kernel would need to sort out a dependency otherwise
(due to running out of ringspace in the per-ctx ringbuffer).

The other thing is the message passing stuff, and this is what I was
talking about above. This has nothing to do with handling dependencies,
but with talking to the gpu fw. Here the intel design issue is that the fw
only provides a single queue, and it's in-order. Which means it
fundamentally has the stalling issue you describe as a point against a
message passing design. And fundamentally we need to be able to talk to
the fw in the scheduler ->run_job callback.

The proposal here for the message passing part is that since it has the
stalling issue already anyway, and the scheduler needs to be involved
anyway, it makes sense to integrated this (as an optional thing, only for
drivers which have this kind of fw interface) into the scheduler.
Otherwise you just end up with two layers for no reason and more ping-pong
delay because the ->run_job needs to kick off the subordinate driver layer
first. Note that for this case the optional message passing support in the
drm/scheduler actually makes things better, because it allows you to cut
out one layer.

Of course if a driver with better fw interface uses this message passing
support, then that's bad. Hence the big warning in the kerneldoc.

Well what I wanted to say is that if you design the dependency handling
/ scheduler properly you don't need the message passing through it.

For example if the GPU scheduler component uses a work item to do it's
handling instead of a kthread you could also let the driver specify the
work queue where this work item is executed on.

When you design it like this the driver specifies the thread context of
execution for it's job. In other words it can specify a single threaded
firmware work queue as well.

When you then have other messages which needs to be passed to the
firmware you can also use the same single threaded workqueue for this.

Drivers which have a different firmware interface would just use one of
the system work queues instead.

This approach basically decouples the GPU scheduler component from the
message passing functionality.

Hm I guess we've been talking past each another big time, because
that's really what I thought was under discussions? Essentially the
current rfc, but implementing with some polish.


I think Daniel pretty much nailed it here (thanks), to recap:

1. I want the messages in the same worker so run_job / free_job /
process_msg execution is mutual exclusive and also so during reset paths
if the worker is stopped all the entry points can't be entered.

If this is a NAK, then another worker is fine I guess. A lock between
run_job / free_job + process_msg should solve the exclusion issue and the
reset paths can also stop this new worker too. That being said I'd
rather leave this as is but will not fight this point.

2. process_msg is just used to communicate with the firmware using the
same queue as submission. Waiting for space in this queue is the only
place this function can block (same as submission), well actually we
have the concept a preempt time slice but that sleeps for 10 ms by
default. Also preempt is only used in LR entities so I don't think it is
relavent in this case either.

3. Agree this is in the dma-fence signaling path (if process_msg is in
the submission worker) so we can't block indefinitely or an unreasonable
period of time (i.e. we must obey dma-fence rules).

Just to hammer this in: Not just process_msg is in the dma_fence signaling
path, but the entire fw queue where everything is being funneled through,
including whatever the fw is doing to process these.

Yes this is terrible and blew up a few times already :-/

But also, probably something that the docs really need to hammer in, to
make sure people don't look at this and thinkg "hey this seems to be the
recommended way to do this on linux". We don't want hw people to build
more of these designs, they're an absolute pain to deal with with Linux'
dma_fence signalling and gpu job scheduling rules.

It's just that if you're stuck with such fw, then integrating

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-06 Thread Christian König


Am 05.04.23 um 20:53 schrieb Matthew Brost:

On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:

Hi,

Using dma-fence for completion/dependency tracking for long-run workload(more 
precisely on-demand paging/page fault enabled workload) can cause deadlock. 
This seems the significant issue here. Other issues such as the drm scheduler 
completion order implication etc are minors which can be solve inside the 
framework of drm scheduler. We need to evaluate below paths:

1) still use drm scheduler for job submission, and use dma-fence for 
job completion waiting/dependency tracking. This is solution proposed in this 
series. Annotate dma-fence for long-run workload: user can still wait dma-fence 
for job completion but can't wait dma-fence while holding any memory management 
locks.  We still use dma-fence for dependency tracking. But it is just very 
easily run into deadlock when on-demand paging is in the picture. The 
annotation helps us to detect deadlock but not solve deadlock problems. Seems 
*not* a complete solution: It is almost impossible to completely avoid 
dependency deadlock in complex runtime environment


No one can wait on LR fence, so it is impossible to deadlock. The
annotations enforce this. Literally this is only for flow controling the
ring / hold pending jobs in in the DRM schedule list.


You can still have someone depend on the LR fence and cause a deadlock 
without waiting on it.


See my attempted solution to this problem. It tries to inherit the LR 
flag of a fence when something depended on it.


For example if you create a fence container and one of the fences inside 
the container is a LR fence then the container itself would be an LR 
fence as well.



2) Still use drm scheduler but not use dma-fence for completion 
signaling and dependency tracking. This way we still get some free functions 
(reset, err handling ring flow control as Matt said)from drm scheduler, but 
push the dependency/completion tracking completely to user space using 
techniques such as user space fence. User space doesn't have chance to wait 
fence while holding a kernel memory management lock, thus the dma-fence 
deadlock issue is solved.


We use user space fence for syncs.


3) Completely discard drm scheduler and dma-fence for long-run 
workload. Use user queue/doorbell for super fast submission, directly interact 
with fw scheduler. Use user fence for completion/dependency tracking.


This is a hard no from me, I want 1 submission path in Xe. Either we use
the DRM scheduler or we don't.


Well I don't think that this will be acceptable. Especially if you not 
only have long running submission, but things like page faults/HMM in 
those jobs.


Regards,
Christian.



Matt


Thanks,
Oak


-Original Message-
From: Christian König 
Sent: April 5, 2023 3:30 AM
To: Brost, Matthew ; Zeng, Oak

Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org;
robdcl...@chromium.org; thomas.hellst...@linux.intel.com; airl...@linux.ie;
l...@asahilina.net; boris.brezil...@collabora.com; faith.ekstr...@collabora.com
Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
plans

Am 04.04.23 um 20:08 schrieb Matthew Brost:

On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:

Hi Matt, Thomas,

Some very bold out of box thinking in this area:

1. so you want to use drm scheduler and dma-fence for long running workload.

Why you want to do this in the first place? What is the benefit? Drm scheduler 
is
pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
level, as you said below for intel this is Guc. Can xe driver just directly 
submit job
to Guc, bypassing drm scheduler?

If we did that now we have 2 paths for dependency track, flow controling
the ring, resets / error handling / backend submission implementations.
We don't want this.

Well exactly that's the point: Why?

As far as I can see that are two completely distinct use cases, so you
absolutely do want two completely distinct implementations for this.


2. using dma-fence for long run workload: I am well aware that page fault (and

the consequent memory allocation/lock acquiring to fix the fault) can cause
deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
used purely because the nature of the workload that it runs very long 
(indefinite).
I did a math: the dma_fence_wait_timeout function's third param is the timeout
which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days 
is not long
enough, can we just change the timeout parameter to signed 64 bits so it is much
longer than our life time...

So I mainly argue we can't use dma-fence for long-run workload is not

because the workload runs very long, rather because of the fact that we use
page fault for long-run workload. If we enable page fault for short-run 
workload,
we can't use dma-fence either. Page fault is the key thing here.

Now since we use page fault which

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-06 Thread Daniel Vetter

On Thu, Apr 06, 2023 at 02:08:10AM +, Matthew Brost wrote:
> On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> > On Wed, 5 Apr 2023 at 11:57, Christian König  
> > wrote:
> > >
> > > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > > [SNIP]
> > > >> I would approach it from the complete other side. This component here 
> > > >> is a
> > > >> tool to decide what job should run next.
> > > >>
> > > >> How that is then signaled and run should not be part of the scheduler, 
> > > >> but
> > > >> another more higher level component.
> > > >>
> > > >> This way you also don't have a problem with not using DMA-fences as
> > > >> dependencies as well as constrains for running more jobs.
> > > > I think we're talking about two things here and mixing them up.
> > > >
> > > > For the dependencies I agree with you, and imo that higher level tool
> > > > should probably just be an on-demand submit thread in userspace for the
> > > > rare case where the kernel would need to sort out a dependency otherwise
> > > > (due to running out of ringspace in the per-ctx ringbuffer).
> > > >
> > > > The other thing is the message passing stuff, and this is what I was
> > > > talking about above. This has nothing to do with handling dependencies,
> > > > but with talking to the gpu fw. Here the intel design issue is that the 
> > > > fw
> > > > only provides a single queue, and it's in-order. Which means it
> > > > fundamentally has the stalling issue you describe as a point against a
> > > > message passing design. And fundamentally we need to be able to talk to
> > > > the fw in the scheduler ->run_job callback.
> > > >
> > > > The proposal here for the message passing part is that since it has the
> > > > stalling issue already anyway, and the scheduler needs to be involved
> > > > anyway, it makes sense to integrated this (as an optional thing, only 
> > > > for
> > > > drivers which have this kind of fw interface) into the scheduler.
> > > > Otherwise you just end up with two layers for no reason and more 
> > > > ping-pong
> > > > delay because the ->run_job needs to kick off the subordinate driver 
> > > > layer
> > > > first. Note that for this case the optional message passing support in 
> > > > the
> > > > drm/scheduler actually makes things better, because it allows you to cut
> > > > out one layer.
> > > >
> > > > Of course if a driver with better fw interface uses this message passing
> > > > support, then that's bad. Hence the big warning in the kerneldoc.
> > >
> > > Well what I wanted to say is that if you design the dependency handling
> > > / scheduler properly you don't need the message passing through it.
> > >
> > > For example if the GPU scheduler component uses a work item to do it's
> > > handling instead of a kthread you could also let the driver specify the
> > > work queue where this work item is executed on.
> > >
> > > When you design it like this the driver specifies the thread context of
> > > execution for it's job. In other words it can specify a single threaded
> > > firmware work queue as well.
> > >
> > > When you then have other messages which needs to be passed to the
> > > firmware you can also use the same single threaded workqueue for this.
> > >
> > > Drivers which have a different firmware interface would just use one of
> > > the system work queues instead.
> > >
> > > This approach basically decouples the GPU scheduler component from the
> > > message passing functionality.
> > 
> > Hm I guess we've been talking past each another big time, because
> > that's really what I thought was under discussions? Essentially the
> > current rfc, but implementing with some polish.
> >
> 
> I think Daniel pretty much nailed it here (thanks), to recap:
> 
> 1. I want the messages in the same worker so run_job / free_job /
> process_msg execution is mutual exclusive and also so during reset paths
> if the worker is stopped all the entry points can't be entered.
> 
> If this is a NAK, then another worker is fine I guess. A lock between
> run_job / free_job + process_msg should solve the exclusion issue and the
> reset paths can also stop this new worker too. That being said I'd
> rather leave this as is but will not fight this point.
> 
> 2. process_msg is just used to communicate with the firmware using the
> same queue as submission. Waiting for space in this queue is the only
> place this function can block (same as submission), well actually we
> have the concept a preempt time slice but that sleeps for 10 ms by
> default. Also preempt is only used in LR entities so I don't think it is
> relavent in this case either.
> 
> 3. Agree this is in the dma-fence signaling path (if process_msg is in
> the submission worker) so we can't block indefinitely or an unreasonable
> period of time (i.e. we must obey dma-fence rules).

Just to hammer this in: Not just process_msg is in the dma_fence signaling
path, but the entire fw queue where everything is being funneled through,
including

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Matthew Brost

On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> On Wed, 5 Apr 2023 at 11:57, Christian König  wrote:
> >
> > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > [SNIP]
> > >> I would approach it from the complete other side. This component here is 
> > >> a
> > >> tool to decide what job should run next.
> > >>
> > >> How that is then signaled and run should not be part of the scheduler, 
> > >> but
> > >> another more higher level component.
> > >>
> > >> This way you also don't have a problem with not using DMA-fences as
> > >> dependencies as well as constrains for running more jobs.
> > > I think we're talking about two things here and mixing them up.
> > >
> > > For the dependencies I agree with you, and imo that higher level tool
> > > should probably just be an on-demand submit thread in userspace for the
> > > rare case where the kernel would need to sort out a dependency otherwise
> > > (due to running out of ringspace in the per-ctx ringbuffer).
> > >
> > > The other thing is the message passing stuff, and this is what I was
> > > talking about above. This has nothing to do with handling dependencies,
> > > but with talking to the gpu fw. Here the intel design issue is that the fw
> > > only provides a single queue, and it's in-order. Which means it
> > > fundamentally has the stalling issue you describe as a point against a
> > > message passing design. And fundamentally we need to be able to talk to
> > > the fw in the scheduler ->run_job callback.
> > >
> > > The proposal here for the message passing part is that since it has the
> > > stalling issue already anyway, and the scheduler needs to be involved
> > > anyway, it makes sense to integrated this (as an optional thing, only for
> > > drivers which have this kind of fw interface) into the scheduler.
> > > Otherwise you just end up with two layers for no reason and more ping-pong
> > > delay because the ->run_job needs to kick off the subordinate driver layer
> > > first. Note that for this case the optional message passing support in the
> > > drm/scheduler actually makes things better, because it allows you to cut
> > > out one layer.
> > >
> > > Of course if a driver with better fw interface uses this message passing
> > > support, then that's bad. Hence the big warning in the kerneldoc.
> >
> > Well what I wanted to say is that if you design the dependency handling
> > / scheduler properly you don't need the message passing through it.
> >
> > For example if the GPU scheduler component uses a work item to do it's
> > handling instead of a kthread you could also let the driver specify the
> > work queue where this work item is executed on.
> >
> > When you design it like this the driver specifies the thread context of
> > execution for it's job. In other words it can specify a single threaded
> > firmware work queue as well.
> >
> > When you then have other messages which needs to be passed to the
> > firmware you can also use the same single threaded workqueue for this.
> >
> > Drivers which have a different firmware interface would just use one of
> > the system work queues instead.
> >
> > This approach basically decouples the GPU scheduler component from the
> > message passing functionality.
> 
> Hm I guess we've been talking past each another big time, because
> that's really what I thought was under discussions? Essentially the
> current rfc, but implementing with some polish.
>

I think Daniel pretty much nailed it here (thanks), to recap:

1. I want the messages in the same worker so run_job / free_job /
process_msg execution is mutual exclusive and also so during reset paths
if the worker is stopped all the entry points can't be entered.

If this is a NAK, then another worker is fine I guess. A lock between
run_job / free_job + process_msg should solve the exclusion issue and the
reset paths can also stop this new worker too. That being said I'd
rather leave this as is but will not fight this point.

2. process_msg is just used to communicate with the firmware using the
same queue as submission. Waiting for space in this queue is the only
place this function can block (same as submission), well actually we
have the concept a preempt time slice but that sleeps for 10 ms by
default. Also preempt is only used in LR entities so I don't think it is
relavent in this case either.

3. Agree this is in the dma-fence signaling path (if process_msg is in
the submission worker) so we can't block indefinitely or an unreasonable
period of time (i.e. we must obey dma-fence rules).

4. Agree the documentation for thw usage of the messaging interface
needs to be clear.

5. Agree that my code could alway use polishing.

Lets close on #1 then can I get on general Ack on this part of the RFC
and apply the polish in the full review process?

Matt

> iow I agree with you (I think at least).
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Matthew Brost

On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> Hi,
> 
> Using dma-fence for completion/dependency tracking for long-run workload(more 
> precisely on-demand paging/page fault enabled workload) can cause deadlock. 
> This seems the significant issue here. Other issues such as the drm scheduler 
> completion order implication etc are minors which can be solve inside the 
> framework of drm scheduler. We need to evaluate below paths:
> 
>   1) still use drm scheduler for job submission, and use dma-fence for 
> job completion waiting/dependency tracking. This is solution proposed in this 
> series. Annotate dma-fence for long-run workload: user can still wait 
> dma-fence for job completion but can't wait dma-fence while holding any 
> memory management locks.  We still use dma-fence for dependency tracking. But 
> it is just very easily run into deadlock when on-demand paging is in the 
> picture. The annotation helps us to detect deadlock but not solve deadlock 
> problems. Seems *not* a complete solution: It is almost impossible to 
> completely avoid dependency deadlock in complex runtime environment
>

No one can wait on LR fence, so it is impossible to deadlock. The
annotations enforce this. Literally this is only for flow controling the
ring / hold pending jobs in in the DRM schedule list.

>   2) Still use drm scheduler but not use dma-fence for completion 
> signaling and dependency tracking. This way we still get some free functions 
> (reset, err handling ring flow control as Matt said)from drm scheduler, but 
> push the dependency/completion tracking completely to user space using 
> techniques such as user space fence. User space doesn't have chance to wait 
> fence while holding a kernel memory management lock, thus the dma-fence 
> deadlock issue is solved.
>

We use user space fence for syncs.

>   3) Completely discard drm scheduler and dma-fence for long-run 
> workload. Use user queue/doorbell for super fast submission, directly 
> interact with fw scheduler. Use user fence for completion/dependency tracking.
> 

This is a hard no from me, I want 1 submission path in Xe. Either we use
the DRM scheduler or we don't.

Matt

> Thanks,
> Oak
> 
> > -Original Message-
> > From: Christian König 
> > Sent: April 5, 2023 3:30 AM
> > To: Brost, Matthew ; Zeng, Oak
> > 
> > Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org;
> > robdcl...@chromium.org; thomas.hellst...@linux.intel.com; airl...@linux.ie;
> > l...@asahilina.net; boris.brezil...@collabora.com; 
> > faith.ekstr...@collabora.com
> > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > plans
> > 
> > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > >> Hi Matt, Thomas,
> > >>
> > >> Some very bold out of box thinking in this area:
> > >>
> > >> 1. so you want to use drm scheduler and dma-fence for long running 
> > >> workload.
> > Why you want to do this in the first place? What is the benefit? Drm 
> > scheduler is
> > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > level, as you said below for intel this is Guc. Can xe driver just directly 
> > submit job
> > to Guc, bypassing drm scheduler?
> > >>
> > > If we did that now we have 2 paths for dependency track, flow controling
> > > the ring, resets / error handling / backend submission implementations.
> > > We don't want this.
> > 
> > Well exactly that's the point: Why?
> > 
> > As far as I can see that are two completely distinct use cases, so you
> > absolutely do want two completely distinct implementations for this.
> > 
> > >> 2. using dma-fence for long run workload: I am well aware that page 
> > >> fault (and
> > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't 
> > be
> > used purely because the nature of the workload that it runs very long 
> > (indefinite).
> > I did a math: the dma_fence_wait_timeout function's third param is the 
> > timeout
> > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 
> > days is not long
> > enough, can we just change the timeout parameter to signed 64 bits so it is 
> > much
> > longer than our life time...
> > >>
> > >> So I mainly argue we can't use dma-fence for long-run workload is not
> > because the workload runs very long, rather because of

RE: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Zeng, Oak

Hi,

Using dma-fence for completion/dependency tracking for long-run workload(more 
precisely on-demand paging/page fault enabled workload) can cause deadlock. 
This seems the significant issue here. Other issues such as the drm scheduler 
completion order implication etc are minors which can be solve inside the 
framework of drm scheduler. We need to evaluate below paths:

1) still use drm scheduler for job submission, and use dma-fence for 
job completion waiting/dependency tracking. This is solution proposed in this 
series. Annotate dma-fence for long-run workload: user can still wait dma-fence 
for job completion but can't wait dma-fence while holding any memory management 
locks.  We still use dma-fence for dependency tracking. But it is just very 
easily run into deadlock when on-demand paging is in the picture. The 
annotation helps us to detect deadlock but not solve deadlock problems. Seems 
*not* a complete solution: It is almost impossible to completely avoid 
dependency deadlock in complex runtime environment

2) Still use drm scheduler but not use dma-fence for completion 
signaling and dependency tracking. This way we still get some free functions 
(reset, err handling ring flow control as Matt said)from drm scheduler, but 
push the dependency/completion tracking completely to user space using 
techniques such as user space fence. User space doesn't have chance to wait 
fence while holding a kernel memory management lock, thus the dma-fence 
deadlock issue is solved.

3) Completely discard drm scheduler and dma-fence for long-run 
workload. Use user queue/doorbell for super fast submission, directly interact 
with fw scheduler. Use user fence for completion/dependency tracking.

Thanks,
Oak

> -Original Message-
> From: Christian König 
> Sent: April 5, 2023 3:30 AM
> To: Brost, Matthew ; Zeng, Oak
> 
> Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org;
> robdcl...@chromium.org; thomas.hellst...@linux.intel.com; airl...@linux.ie;
> l...@asahilina.net; boris.brezil...@collabora.com; 
> faith.ekstr...@collabora.com
> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> 
> Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> >> Hi Matt, Thomas,
> >>
> >> Some very bold out of box thinking in this area:
> >>
> >> 1. so you want to use drm scheduler and dma-fence for long running 
> >> workload.
> Why you want to do this in the first place? What is the benefit? Drm 
> scheduler is
> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> level, as you said below for intel this is Guc. Can xe driver just directly 
> submit job
> to Guc, bypassing drm scheduler?
> >>
> > If we did that now we have 2 paths for dependency track, flow controling
> > the ring, resets / error handling / backend submission implementations.
> > We don't want this.
> 
> Well exactly that's the point: Why?
> 
> As far as I can see that are two completely distinct use cases, so you
> absolutely do want two completely distinct implementations for this.
> 
> >> 2. using dma-fence for long run workload: I am well aware that page fault 
> >> (and
> the consequent memory allocation/lock acquiring to fix the fault) can cause
> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
> used purely because the nature of the workload that it runs very long 
> (indefinite).
> I did a math: the dma_fence_wait_timeout function's third param is the timeout
> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days 
> is not long
> enough, can we just change the timeout parameter to signed 64 bits so it is 
> much
> longer than our life time...
> >>
> >> So I mainly argue we can't use dma-fence for long-run workload is not
> because the workload runs very long, rather because of the fact that we use
> page fault for long-run workload. If we enable page fault for short-run 
> workload,
> we can't use dma-fence either. Page fault is the key thing here.
> >>
> >> Now since we use page fault which is *fundamentally* controversial with
> dma-fence design, why now just introduce a independent concept such as user-
> fence instead of extending existing dma-fence?
> >>
> >> I like unified design. If drm scheduler, dma-fence can be extended to work 
> >> for
> everything, it is beautiful. But seems we have some fundamental problem here.
> >>
> > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > the signal / CB infrastructure) and enforce we don't use use these
> > dma-fenc

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Daniel Vetter

On Wed, 5 Apr 2023 at 11:57, Christian König  wrote:
>
> Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > [SNIP]
> >> I would approach it from the complete other side. This component here is a
> >> tool to decide what job should run next.
> >>
> >> How that is then signaled and run should not be part of the scheduler, but
> >> another more higher level component.
> >>
> >> This way you also don't have a problem with not using DMA-fences as
> >> dependencies as well as constrains for running more jobs.
> > I think we're talking about two things here and mixing them up.
> >
> > For the dependencies I agree with you, and imo that higher level tool
> > should probably just be an on-demand submit thread in userspace for the
> > rare case where the kernel would need to sort out a dependency otherwise
> > (due to running out of ringspace in the per-ctx ringbuffer).
> >
> > The other thing is the message passing stuff, and this is what I was
> > talking about above. This has nothing to do with handling dependencies,
> > but with talking to the gpu fw. Here the intel design issue is that the fw
> > only provides a single queue, and it's in-order. Which means it
> > fundamentally has the stalling issue you describe as a point against a
> > message passing design. And fundamentally we need to be able to talk to
> > the fw in the scheduler ->run_job callback.
> >
> > The proposal here for the message passing part is that since it has the
> > stalling issue already anyway, and the scheduler needs to be involved
> > anyway, it makes sense to integrated this (as an optional thing, only for
> > drivers which have this kind of fw interface) into the scheduler.
> > Otherwise you just end up with two layers for no reason and more ping-pong
> > delay because the ->run_job needs to kick off the subordinate driver layer
> > first. Note that for this case the optional message passing support in the
> > drm/scheduler actually makes things better, because it allows you to cut
> > out one layer.
> >
> > Of course if a driver with better fw interface uses this message passing
> > support, then that's bad. Hence the big warning in the kerneldoc.
>
> Well what I wanted to say is that if you design the dependency handling
> / scheduler properly you don't need the message passing through it.
>
> For example if the GPU scheduler component uses a work item to do it's
> handling instead of a kthread you could also let the driver specify the
> work queue where this work item is executed on.
>
> When you design it like this the driver specifies the thread context of
> execution for it's job. In other words it can specify a single threaded
> firmware work queue as well.
>
> When you then have other messages which needs to be passed to the
> firmware you can also use the same single threaded workqueue for this.
>
> Drivers which have a different firmware interface would just use one of
> the system work queues instead.
>
> This approach basically decouples the GPU scheduler component from the
> message passing functionality.

Hm I guess we've been talking past each another big time, because
that's really what I thought was under discussions? Essentially the
current rfc, but implementing with some polish.

iow I agree with you (I think at least).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Christian König


Am 05.04.23 um 11:07 schrieb Daniel Vetter:

[SNIP]

I would approach it from the complete other side. This component here is a
tool to decide what job should run next.

How that is then signaled and run should not be part of the scheduler, but
another more higher level component.

This way you also don't have a problem with not using DMA-fences as
dependencies as well as constrains for running more jobs.

I think we're talking about two things here and mixing them up.

For the dependencies I agree with you, and imo that higher level tool
should probably just be an on-demand submit thread in userspace for the
rare case where the kernel would need to sort out a dependency otherwise
(due to running out of ringspace in the per-ctx ringbuffer).

The other thing is the message passing stuff, and this is what I was
talking about above. This has nothing to do with handling dependencies,
but with talking to the gpu fw. Here the intel design issue is that the fw
only provides a single queue, and it's in-order. Which means it
fundamentally has the stalling issue you describe as a point against a
message passing design. And fundamentally we need to be able to talk to
the fw in the scheduler ->run_job callback.

The proposal here for the message passing part is that since it has the
stalling issue already anyway, and the scheduler needs to be involved
anyway, it makes sense to integrated this (as an optional thing, only for
drivers which have this kind of fw interface) into the scheduler.
Otherwise you just end up with two layers for no reason and more ping-pong
delay because the ->run_job needs to kick off the subordinate driver layer
first. Note that for this case the optional message passing support in the
drm/scheduler actually makes things better, because it allows you to cut
out one layer.

Of course if a driver with better fw interface uses this message passing
support, then that's bad. Hence the big warning in the kerneldoc.


Well what I wanted to say is that if you design the dependency handling 
/ scheduler properly you don't need the message passing through it.


For example if the GPU scheduler component uses a work item to do it's 
handling instead of a kthread you could also let the driver specify the 
work queue where this work item is executed on.


When you design it like this the driver specifies the thread context of 
execution for it's job. In other words it can specify a single threaded 
firmware work queue as well.


When you then have other messages which needs to be passed to the 
firmware you can also use the same single threaded workqueue for this.


Drivers which have a different firmware interface would just use one of 
the system work queues instead.


This approach basically decouples the GPU scheduler component from the 
message passing functionality.


Regards,
Christian.



-Daniel

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Daniel Vetter

On Wed, Apr 05, 2023 at 10:53:26AM +0200, Christian König wrote:
> Am 05.04.23 um 10:34 schrieb Daniel Vetter:
> > On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote:
> > > Am 04.04.23 um 15:37 schrieb Matthew Brost:
> > > > On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
> > > > > Hi,
> > > > > 
> > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > Hello,
> > > > > > 
> > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > have been asked to merge our common DRM scheduler patches first as 
> > > > > > well
> > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > welcome any and all feedback.
> > > > > > 
> > > > > > This can we thought of as 4 parts detailed below.
> > > > > > 
> > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler 
> > > > > > and
> > > > > > entity (patches 1-3)
> > > > > > 
> > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler 
> > > > > > (the
> > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > severals problems as the DRM was originally designed to schedule 
> > > > > > jobs on
> > > > > > hardware queues. The main problem being that DRM scheduler expects 
> > > > > > the
> > > > > > submission order of jobs to be the completion order of jobs even 
> > > > > > across
> > > > > > multiple entities. This assumption falls apart with a firmware 
> > > > > > scheduler
> > > > > > as a firmware scheduler has no concept of jobs and jobs can 
> > > > > > complete out
> > > > > > of order. A novel solution for was originally thought of by Faith 
> > > > > > during
> > > > > > the initial prototype of Xe, create a 1 to 1 relationship between 
> > > > > > scheduler
> > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > 
> > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 
> > > > > > relationship.
> > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > relationship and can be thought of as using scheduler as a 
> > > > > > dependency /
> > > > > > infligt job tracker rather than a true scheduler.
> > > > > > 
> > > > > > - Generic messaging interface for DRM scheduler
> > > > > > 
> > > > > > Idea is to be able to communicate to the submission backend with in 
> > > > > > band
> > > > > > (relative to main execution function) messages. Messages are backend
> > > > > > defined and flexable enough for any use case. In Xe we use these
> > > > > > messages to clean up entites, set properties for entites, and 
> > > > > > suspend /
> > > > > > resume execution of an entity [5]. I suspect other driver can 
> > > > > > leverage
> > > > > > this messaging concept too as it a convenient way to avoid races in 
> > > > > > the
> > > > > > backend.
> > > > > Oh, please absolutely *don't* do this.
> > > > > 
> > > > > This is basically the design which makes a bunch of stuff so horrible 
> > > > > broken
> > > > > on Windows.
> > > > > 
> > > > > I can explain it in more detail if necessary, but I strongly 
> > > > > recommend to
> > > > > not go down this path.
> > > > > 
> > > > I'm afraid we are going to have to discuss this further. Let me explain
> > > > my reasoning, basically the idea is to have a single main entry point to
> > > > backend - the work queue. This avoids the need for lock between run_job
> > > > and any message that changes an entites state, also it really helps
> > > > during the reset flows (either TDR or GT reset) as we can call
> > > > drm_sched_run_wq_stop can ensure that nothing else is in the backend
> > > > changing an entity state. It all works out really nicely actually, our
> > > > GuC backend is incredibly stable (hasn't really had a bug pop in about a
> > > > year) and way simpler than what we did in the i915. I think the simplity
> > > > to largely due to this design of limiting the entry points.
> > > > 
> > > > I personally don't see how this a poor design, limiting entry points
> > > > absolutely makes sense to me, if it didn't why not just call cleanup_job
> > > > bypassing the main execution thread (now worker), this is the exact same
> > > > concept.
> > > Well then I strongly suggest to read a few analyses on the failure of the
> > > message processing loop on Windows.
> > > 
> > > Have you ever wondered why classic Win32 applications sometimes seems to 
> > > be
> > > stuck and don't do anything? This design pattern combine with timeouts to
> > > solve deadlocks is the reason for that.
> > > 
> > > The major problem with this approach is that analyzing tools like lockdep
> >

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Christian König


Am 05.04.23 um 10:34 schrieb Daniel Vetter:

On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote:

Am 04.04.23 um 15:37 schrieb Matthew Brost:

On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:

Hi,

Am 04.04.23 um 02:22 schrieb Matthew Brost:

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.

- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.

Oh, please absolutely *don't* do this.

This is basically the design which makes a bunch of stuff so horrible broken
on Windows.

I can explain it in more detail if necessary, but I strongly recommend to
not go down this path.


I'm afraid we are going to have to discuss this further. Let me explain
my reasoning, basically the idea is to have a single main entry point to
backend - the work queue. This avoids the need for lock between run_job
and any message that changes an entites state, also it really helps
during the reset flows (either TDR or GT reset) as we can call
drm_sched_run_wq_stop can ensure that nothing else is in the backend
changing an entity state. It all works out really nicely actually, our
GuC backend is incredibly stable (hasn't really had a bug pop in about a
year) and way simpler than what we did in the i915. I think the simplity
to largely due to this design of limiting the entry points.

I personally don't see how this a poor design, limiting entry points
absolutely makes sense to me, if it didn't why not just call cleanup_job
bypassing the main execution thread (now worker), this is the exact same
concept.

Well then I strongly suggest to read a few analyses on the failure of the
message processing loop on Windows.

Have you ever wondered why classic Win32 applications sometimes seems to be
stuck and don't do anything? This design pattern combine with timeouts to
solve deadlocks is the reason for that.

The major problem with this approach is that analyzing tools like lockdep
have a hard time grasping the dependencies.

wq is fully annotated and actually splats. Plain kthread doesn't, without
adding something like the dma_fence_signalling stuff.

But yeah if you block badly in the work items and stall the entire queue,
then things go sideways real bad. There's not really any tools we have in
the kernel to enforce this, since we still want to allow mutex and
sleeping and stuff like that.


What you can do is to offload all your operations which are supposed to be
run in the same thread as work items into a work queue. This is something
lockdep understands and is able to scream out lout if someone messes up the
deadlock dependencies.

I thought that's the plan here?


At least from my impression that didn't looked like what was implemented 
here.



  Or at least what I thought the plan was,
and why I really think we need a per engine worqqueue to make it work well
(and also why I suggested the refactoring to split up drm_scheduler into
the driver api struct, which stays per-engine, and the internal backend
which would be per drm_sched_entity

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Daniel Vetter

On Wed, Apr 05, 2023 at 09:30:11AM +0200, Christian König wrote:
> Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > Hi Matt, Thomas,
> > > 
> > > Some very bold out of box thinking in this area:
> > > 
> > > 1. so you want to use drm scheduler and dma-fence for long running 
> > > workload. Why you want to do this in the first place? What is the 
> > > benefit? Drm scheduler is pretty much a software scheduler. Modern gpu 
> > > has scheduler built at fw/hw level, as you said below for intel this is 
> > > Guc. Can xe driver just directly submit job to Guc, bypassing drm 
> > > scheduler?
> > > 
> > If we did that now we have 2 paths for dependency track, flow controling
> > the ring, resets / error handling / backend submission implementations.
> > We don't want this.
> 
> Well exactly that's the point: Why?
> 
> As far as I can see that are two completely distinct use cases, so you
> absolutely do want two completely distinct implementations for this.
> 
> > > 2. using dma-fence for long run workload: I am well aware that page fault 
> > > (and the consequent memory allocation/lock acquiring to fix the fault) 
> > > can cause deadlock for a dma-fence wait. But I am not convinced that 
> > > dma-fence can't be used purely because the nature of the workload that it 
> > > runs very long (indefinite). I did a math: the dma_fence_wait_timeout 
> > > function's third param is the timeout which is a signed long type. If HZ 
> > > is 1000, this is about 23 days. If 23 days is not long enough, can we 
> > > just change the timeout parameter to signed 64 bits so it is much longer 
> > > than our life time...
> > > 
> > > So I mainly argue we can't use dma-fence for long-run workload is not 
> > > because the workload runs very long, rather because of the fact that we 
> > > use page fault for long-run workload. If we enable page fault for 
> > > short-run workload, we can't use dma-fence either. Page fault is the key 
> > > thing here.
> > > 
> > > Now since we use page fault which is *fundamentally* controversial with 
> > > dma-fence design, why now just introduce a independent concept such as 
> > > user-fence instead of extending existing dma-fence?
> > > 
> > > I like unified design. If drm scheduler, dma-fence can be extended to 
> > > work for everything, it is beautiful. But seems we have some fundamental 
> > > problem here.
> > > 
> > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > the signal / CB infrastructure) and enforce we don't use use these
> > dma-fences from the scheduler in memory reclaim paths or export these to
> > user space or other drivers. Think of this mode as SW only fence.
> 
> Yeah and I truly think this is an really bad idea.
> 
> The signal/CB infrastructure in the dma_fence turned out to be the
> absolutely nightmare I initially predicted. Sorry to say that, but in this
> case the "I've told you so" is appropriate in my opinion.
> 
> If we need infrastructure for long running dependency tracking we should
> encapsulate that in a new framework and not try to mangle the existing code
> for something it was never intended for.

Concurring hard (already typed that up somewhere else). I'd go one step
further and ask whether the kernel really has to handle dependencies for
these long-running compute jobs. The entire design with userspace memory
fences assumes that this is userspace's job.

Also for drm_syncobj we've also pushed a lot of the dependency handling to
userspace, with submit threads in mesa. So if there is any blocking to be
done (running out of ring space), why can't we sort that out the same way?
Meaning:
1. superfast direct-to-hw submit path (using doorbells or whatever)
2. submit ioctl which only succeds if it doesn't have to do any userspace
memory fence waits, otherwise you get EWOULDBLOCK
3. userspace sorts out the mess in a submit thread if it gets an
EWOULDBLOCK, because fundamentally the kernel cannot guarantee a
bottomless queue. If userspace wants bottomless, they need to handle the
allocating and delaying imo

You can even make 3 entirely as-needed, which means for the usual
fast-path you'll never see the userspace thread created unless you do hit
an EWOULDBLOCK.

If we insist that the kernel handles the long-running dependencies fully
then all we end up doing is implementing step 3, but entirely in the
kernel instead of userspace. And in the kernel every bug gets you halfway
to a CVE, and I just don't think that makes much sense for something which
is the fallback of the fallback - once you run out of ring space you're
not going to have a great day not matter what.

I'd go as far and say if we want step 3 in the kernel someone needs to
supply the real-world (i.e. real application running real workloads, not
some microbenchmark) benchmark to proof it's actually worth the pain.
Otherwise on-demand userspace submit thread.
-Daniel

> 
> Christian.
> 
> > 
> > Matt
> > >

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Daniel Vetter

On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote:
> Am 04.04.23 um 15:37 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
> > > Hi,
> > > 
> > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between 
> > > > scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > > 
> > > > - Generic messaging interface for DRM scheduler
> > > > 
> > > > Idea is to be able to communicate to the submission backend with in band
> > > > (relative to main execution function) messages. Messages are backend
> > > > defined and flexable enough for any use case. In Xe we use these
> > > > messages to clean up entites, set properties for entites, and suspend /
> > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > this messaging concept too as it a convenient way to avoid races in the
> > > > backend.
> > > Oh, please absolutely *don't* do this.
> > > 
> > > This is basically the design which makes a bunch of stuff so horrible 
> > > broken
> > > on Windows.
> > > 
> > > I can explain it in more detail if necessary, but I strongly recommend to
> > > not go down this path.
> > > 
> > I'm afraid we are going to have to discuss this further. Let me explain
> > my reasoning, basically the idea is to have a single main entry point to
> > backend - the work queue. This avoids the need for lock between run_job
> > and any message that changes an entites state, also it really helps
> > during the reset flows (either TDR or GT reset) as we can call
> > drm_sched_run_wq_stop can ensure that nothing else is in the backend
> > changing an entity state. It all works out really nicely actually, our
> > GuC backend is incredibly stable (hasn't really had a bug pop in about a
> > year) and way simpler than what we did in the i915. I think the simplity
> > to largely due to this design of limiting the entry points.
> > 
> > I personally don't see how this a poor design, limiting entry points
> > absolutely makes sense to me, if it didn't why not just call cleanup_job
> > bypassing the main execution thread (now worker), this is the exact same
> > concept.
> 
> Well then I strongly suggest to read a few analyses on the failure of the
> message processing loop on Windows.
> 
> Have you ever wondered why classic Win32 applications sometimes seems to be
> stuck and don't do anything? This design pattern combine with timeouts to
> solve deadlocks is the reason for that.
> 
> The major problem with this approach is that analyzing tools like lockdep
> have a hard time grasping the dependencies.

wq is fully annotated and actually splats. Plain kthread doesn't, without
adding something like the dma_fence_signalling stuff.

But yeah if you block badly in the work items and stall the entire queue,
then things go sideways real bad. There's not really any tools we have in
the kernel to enforce this, since we still want to allow mutex and
sleeping and stuff like that.

> What you can do is to offload all your operations which are supposed to be
> run in the same thread as work items into a work queue. This is something
> lockdep understands and is able to scream out lout if someone messes

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Christian König


Am 04.04.23 um 15:37 schrieb Matthew Brost:

On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:

Hi,

Am 04.04.23 um 02:22 schrieb Matthew Brost:

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.

- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.

Oh, please absolutely *don't* do this.

This is basically the design which makes a bunch of stuff so horrible broken
on Windows.

I can explain it in more detail if necessary, but I strongly recommend to
not go down this path.


I'm afraid we are going to have to discuss this further. Let me explain
my reasoning, basically the idea is to have a single main entry point to
backend - the work queue. This avoids the need for lock between run_job
and any message that changes an entites state, also it really helps
during the reset flows (either TDR or GT reset) as we can call
drm_sched_run_wq_stop can ensure that nothing else is in the backend
changing an entity state. It all works out really nicely actually, our
GuC backend is incredibly stable (hasn't really had a bug pop in about a
year) and way simpler than what we did in the i915. I think the simplity
to largely due to this design of limiting the entry points.

I personally don't see how this a poor design, limiting entry points
absolutely makes sense to me, if it didn't why not just call cleanup_job
bypassing the main execution thread (now worker), this is the exact same
concept.


Well then I strongly suggest to read a few analyses on the failure of 
the message processing loop on Windows.


Have you ever wondered why classic Win32 applications sometimes seems to 
be stuck and don't do anything? This design pattern combine with 
timeouts to solve deadlocks is the reason for that.


The major problem with this approach is that analyzing tools like 
lockdep have a hard time grasping the dependencies.


What you can do is to offload all your operations which are supposed to 
be run in the same thread as work items into a work queue. This is 
something lockdep understands and is able to scream out lout if someone 
messes up the deadlock dependencies.


Regards,
Christian.



FWIW Asahi liked the idea as well and think it could be useful for AGX.
Matt


Regards,
Christian.


- Support for using TDR for all error paths of a scheduler / entity

Fix a few races / bugs, add function to dynamically set the TDR timeout.

- Annotate dma-fences for long running workloads.

The idea here is to use dma-fences only as sync points within the
scheduler and never export them for long running workloads. By
annotating these fences as long running we ensure that these dma-fences
are never used in a way that breaks the dma-fence rules. A benefit of
thus approach is the scheduler can still safely flow control the
execution ring buffer via the job limit without breaking the dma-fence
rules.

Again this a first draft and looking forward to feedback.

Enjoy - Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel
[2]

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-05 Thread Christian König


Am 04.04.23 um 20:08 schrieb Matthew Brost:

On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:

Hi Matt, Thomas,

Some very bold out of box thinking in this area:

1. so you want to use drm scheduler and dma-fence for long running workload. 
Why you want to do this in the first place? What is the benefit? Drm scheduler 
is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw 
level, as you said below for intel this is Guc. Can xe driver just directly 
submit job to Guc, bypassing drm scheduler?


If we did that now we have 2 paths for dependency track, flow controling
the ring, resets / error handling / backend submission implementations.
We don't want this.


Well exactly that's the point: Why?

As far as I can see that are two completely distinct use cases, so you 
absolutely do want two completely distinct implementations for this.



2. using dma-fence for long run workload: I am well aware that page fault (and 
the consequent memory allocation/lock acquiring to fix the fault) can cause 
deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be 
used purely because the nature of the workload that it runs very long 
(indefinite). I did a math: the dma_fence_wait_timeout function's third param 
is the timeout which is a signed long type. If HZ is 1000, this is about 23 
days. If 23 days is not long enough, can we just change the timeout parameter 
to signed 64 bits so it is much longer than our life time...

So I mainly argue we can't use dma-fence for long-run workload is not because 
the workload runs very long, rather because of the fact that we use page fault 
for long-run workload. If we enable page fault for short-run workload, we can't 
use dma-fence either. Page fault is the key thing here.

Now since we use page fault which is *fundamentally* controversial with 
dma-fence design, why now just introduce a independent concept such as 
user-fence instead of extending existing dma-fence?

I like unified design. If drm scheduler, dma-fence can be extended to work for 
everything, it is beautiful. But seems we have some fundamental problem here.


Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
the signal / CB infrastructure) and enforce we don't use use these
dma-fences from the scheduler in memory reclaim paths or export these to
user space or other drivers. Think of this mode as SW only fence.


Yeah and I truly think this is an really bad idea.

The signal/CB infrastructure in the dma_fence turned out to be the 
absolutely nightmare I initially predicted. Sorry to say that, but in 
this case the "I've told you so" is appropriate in my opinion.


If we need infrastructure for long running dependency tracking we should 
encapsulate that in a new framework and not try to mangle the existing 
code for something it was never intended for.


Christian.



Matt
  

Thanks,
Oak


-Original Message-
From: dri-devel  On Behalf Of
Matthew Brost
Sent: April 3, 2023 8:22 PM
To: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org
Cc: robdcl...@chromium.org; thomas.hellst...@linux.intel.com; airl...@linux.ie;
l...@asahilina.net; boris.brezil...@collabora.com; Brost, Matthew
; christian.koe...@amd.com;
faith.ekstr...@collabora.com
Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.

- Generic

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Daniel Vetter

On Tue, 4 Apr 2023 at 19:29, Tvrtko Ursulin
 wrote:
>
>
> On 04/04/2023 14:52, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:
> >>
> >> On 04/04/2023 01:22, Matthew Brost wrote:
> >>> Hello,
> >>>
> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> >>> have been asked to merge our common DRM scheduler patches first as well
> >>> as develop a common solution for long running workloads with the DRM
> >>> scheduler. This RFC series is our first attempt at doing this. We
> >>> welcome any and all feedback.
> >>>
> >>> This can we thought of as 4 parts detailed below.
> >>>
> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> >>> entity (patches 1-3)
> >>>
> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> >>> severals problems as the DRM was originally designed to schedule jobs on
> >>> hardware queues. The main problem being that DRM scheduler expects the
> >>> submission order of jobs to be the completion order of jobs even across
> >>> multiple entities. This assumption falls apart with a firmware scheduler
> >>> as a firmware scheduler has no concept of jobs and jobs can complete out
> >>> of order. A novel solution for was originally thought of by Faith during
> >>> the initial prototype of Xe, create a 1 to 1 relationship between 
> >>> scheduler
> >>> and entity. I believe the AGX driver [3] is using this approach and
> >>> Boris may use approach as well for the Mali driver [4].
> >>>
> >>> To support a 1 to 1 relationship we move the main execution function
> >>> from a kthread to a work queue and add a new scheduling mode which
> >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> >>> relationship and can be thought of as using scheduler as a dependency /
> >>> infligt job tracker rather than a true scheduler.
> >>
> >> Once you add capability for a more proper 1:1 via
> >> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
> >> kthreads with a wq?
> >>
> >> Or in other words, what purpose does the offloading of a job picking code 
> >> to
> >> a separate execution context serve? Could it be done directly in the 1:1
> >> mode and leave kthread setup for N:M?
> >>
> >
> > Addressed the other two on my reply to Christian...
> >
> > For this one basically the concept of a single entity point IMO is a
> > very good concept which I'd like to keep. But most important reason
> > being the main execution thread (now worker) is kicked when a dependency
> > for a job is resolved, dependencies are dma-fences signaled via a
> > callback, and these call backs can be signaled in IRQ contexts. We
> > absolutely do not want to enter the backend in an IRQ context for a
> > variety of reasons.
>
> Sounds like a fair enough requirement but if drivers will not be
> comfortable with the wq conversion, it is probably possible to introduce
> some vfuncs for the 1:1 case which would allow scheduler users override
> the scheduler wakeup and select a special "pick one job" path. That
> could allow 1:1 users do their thing, leaving rest as is. I mean you
> already have the special single entity scheduler, you'd just need to add
> some more specialization on the init, wake up, etc paths.
>
> And I will mention once more that I find a wq item with a loop such as:
>
> while (!READ_ONCE(sched->pause_run_wq)) {
> ...
>
> A bit dodgy. If you piggy back on any system_wq it smells of system wide
> starvation so for me any proposal with an option to use a system shared
> wq is a no go.

Yeah I think the argument for wq based scheduler would need a
per-drm_scheduler wq, like we currently have a per-scheduler kthread.
It might still need some serious work to replace the
kthread_stop/start() with a wq-native equivalent (because really this
is the tricky stuff we shouldn't hand-roll unless someone is willing
to write a few papers on the lockless design that's done), but would
look a bunch more reasonable. Having a per-sched workqueue might also
help with the big sched_stop/start/fini state transitions, which I
really think should still go over all the per-entity schedulers even
in the 1:1 case (because otherwise you get some funky code in drivers
that do the iterations needed, which probably tosses the fairly nice
design the current scheduler has by relying on the kthread_stop/start
primitives for this.
-Daniel

>
> Regards,
>
> Tvrtko
>
>
> >> Apart from those design level questions, low level open IMO still is that
> >> default fallback of using the system_wq has the potential to affect latency
> >> for other drivers. But that's for those driver owners to approve.
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>> - Generic messaging interface for DRM scheduler
> >>>
> >>> Idea is to be able to communicate to the

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Matthew Brost

On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> Hi Matt, Thomas,
> 
> Some very bold out of box thinking in this area:
> 
> 1. so you want to use drm scheduler and dma-fence for long running workload. 
> Why you want to do this in the first place? What is the benefit? Drm 
> scheduler is pretty much a software scheduler. Modern gpu has scheduler built 
> at fw/hw level, as you said below for intel this is Guc. Can xe driver just 
> directly submit job to Guc, bypassing drm scheduler? 
>

If we did that now we have 2 paths for dependency track, flow controling
the ring, resets / error handling / backend submission implementations.
We don't want this.
 
> 2. using dma-fence for long run workload: I am well aware that page fault 
> (and the consequent memory allocation/lock acquiring to fix the fault) can 
> cause deadlock for a dma-fence wait. But I am not convinced that dma-fence 
> can't be used purely because the nature of the workload that it runs very 
> long (indefinite). I did a math: the dma_fence_wait_timeout function's third 
> param is the timeout which is a signed long type. If HZ is 1000, this is 
> about 23 days. If 23 days is not long enough, can we just change the timeout 
> parameter to signed 64 bits so it is much longer than our life time... 
> 
> So I mainly argue we can't use dma-fence for long-run workload is not because 
> the workload runs very long, rather because of the fact that we use page 
> fault for long-run workload. If we enable page fault for short-run workload, 
> we can't use dma-fence either. Page fault is the key thing here.
> 
> Now since we use page fault which is *fundamentally* controversial with 
> dma-fence design, why now just introduce a independent concept such as 
> user-fence instead of extending existing dma-fence? 
> 
> I like unified design. If drm scheduler, dma-fence can be extended to work 
> for everything, it is beautiful. But seems we have some fundamental problem 
> here.
>

Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
the signal / CB infrastructure) and enforce we don't use use these
dma-fences from the scheduler in memory reclaim paths or export these to
user space or other drivers. Think of this mode as SW only fence.

Matt
 
> Thanks,
> Oak
> 
> > -Original Message-
> > From: dri-devel  On Behalf Of
> > Matthew Brost
> > Sent: April 3, 2023 8:22 PM
> > To: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org
> > Cc: robdcl...@chromium.org; thomas.hellst...@linux.intel.com; 
> > airl...@linux.ie;
> > l...@asahilina.net; boris.brezil...@collabora.com; Brost, Matthew
> > ; christian.koe...@amd.com;
> > faith.ekstr...@collabora.com
> > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
> > 
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> > 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too

RE: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Zeng, Oak

Hi Matt, Thomas,

Some very bold out of box thinking in this area:

1. so you want to use drm scheduler and dma-fence for long running workload. 
Why you want to do this in the first place? What is the benefit? Drm scheduler 
is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw 
level, as you said below for intel this is Guc. Can xe driver just directly 
submit job to Guc, bypassing drm scheduler? 

2. using dma-fence for long run workload: I am well aware that page fault (and 
the consequent memory allocation/lock acquiring to fix the fault) can cause 
deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be 
used purely because the nature of the workload that it runs very long 
(indefinite). I did a math: the dma_fence_wait_timeout function's third param 
is the timeout which is a signed long type. If HZ is 1000, this is about 23 
days. If 23 days is not long enough, can we just change the timeout parameter 
to signed 64 bits so it is much longer than our life time... 

So I mainly argue we can't use dma-fence for long-run workload is not because 
the workload runs very long, rather because of the fact that we use page fault 
for long-run workload. If we enable page fault for short-run workload, we can't 
use dma-fence either. Page fault is the key thing here.

Now since we use page fault which is *fundamentally* controversial with 
dma-fence design, why now just introduce a independent concept such as 
user-fence instead of extending existing dma-fence? 

I like unified design. If drm scheduler, dma-fence can be extended to work for 
everything, it is beautiful. But seems we have some fundamental problem here.

Thanks,
Oak

> -Original Message-
> From: dri-devel  On Behalf Of
> Matthew Brost
> Sent: April 3, 2023 8:22 PM
> To: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org
> Cc: robdcl...@chromium.org; thomas.hellst...@linux.intel.com; 
> airl...@linux.ie;
> l...@asahilina.net; boris.brezil...@collabora.com; Brost, Matthew
> ; christian.koe...@amd.com;
> faith.ekstr...@collabora.com
> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
> 
> Hello,
> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
> 
> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
> 
> - Support for using TDR for all error paths of a scheduler / entity
> 
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> 
> - Annotate dma-fences for long running workloads.
> 
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
>

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Tvrtko Ursulin




On 04/04/2023 14:52, Matthew Brost wrote:

On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:


On 04/04/2023 01:22, Matthew Brost wrote:

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.


Once you add capability for a more proper 1:1 via
DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
kthreads with a wq?

Or in other words, what purpose does the offloading of a job picking code to
a separate execution context serve? Could it be done directly in the 1:1
mode and leave kthread setup for N:M?



Addressed the other two on my reply to Christian...

For this one basically the concept of a single entity point IMO is a
very good concept which I'd like to keep. But most important reason
being the main execution thread (now worker) is kicked when a dependency
for a job is resolved, dependencies are dma-fences signaled via a
callback, and these call backs can be signaled in IRQ contexts. We
absolutely do not want to enter the backend in an IRQ context for a
variety of reasons.


Sounds like a fair enough requirement but if drivers will not be 
comfortable with the wq conversion, it is probably possible to introduce 
some vfuncs for the 1:1 case which would allow scheduler users override 
the scheduler wakeup and select a special "pick one job" path. That 
could allow 1:1 users do their thing, leaving rest as is. I mean you 
already have the special single entity scheduler, you'd just need to add 
some more specialization on the init, wake up, etc paths.


And I will mention once more that I find a wq item with a loop such as:

while (!READ_ONCE(sched->pause_run_wq)) {
...

A bit dodgy. If you piggy back on any system_wq it smells of system wide 
starvation so for me any proposal with an option to use a system shared 
wq is a no go.


Regards,

Tvrtko



Apart from those design level questions, low level open IMO still is that
default fallback of using the system_wq has the potential to affect latency
for other drivers. But that's for those driver owners to approve.

Regards,

Tvrtko


- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.

- Support for using TDR for all error paths of a scheduler / entity

Fix a few races / bugs, add function to dynamically set the TDR timeout.

- Annotate dma-fences for long running workloads.

The idea here is to use dma-fences only as sync points within the
scheduler and never export them for long running workloads. By
annotating these fences as long running we ensure that these dma-fences
are never used in a way that breaks the dma-fence rules. A benefit of
thus approach is the scheduler can still safely flow control the
execution ring buffer via the job limit without breaking the dma-fence
rules.

Again this a first draft and looking forward to feedback.

Enjoy - Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel
[2] https://patchwork.freedesktop.org/series/112188/
[3] https://patchwork.freedesktop.org/series/114772/
[4]

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Matthew Brost

On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:
> 
> On 04/04/2023 01:22, Matthew Brost wrote:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> 
> Once you add capability for a more proper 1:1 via
> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
> kthreads with a wq?
> 
> Or in other words, what purpose does the offloading of a job picking code to
> a separate execution context serve? Could it be done directly in the 1:1
> mode and leave kthread setup for N:M?
> 

Addressed the other two on my reply to Christian...

For this one basically the concept of a single entity point IMO is a
very good concept which I'd like to keep. But most important reason
being the main execution thread (now worker) is kicked when a dependency
for a job is resolved, dependencies are dma-fences signaled via a
callback, and these call backs can be signaled in IRQ contexts. We
absolutely do not want to enter the backend in an IRQ context for a
variety of reasons.

Matt

> Apart from those design level questions, low level open IMO still is that
> default fallback of using the system_wq has the potential to affect latency
> for other drivers. But that's for those driver owners to approve.
> 
> Regards,
> 
> Tvrtko
> 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188=1
> > [5] 
> > https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >drm/sched: Convert drm scheduler to use a work queue rather than
> >  kthread
> >drm/sched: Move schedule policy to scheduler / entity
> >drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >drm/sched: Add generic scheduler message interface
> >drm/sched: Start run wq before TDR in drm_sched_start
> >drm/sched: Submit job before

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Matthew Brost

On Tue, Apr 04, 2023 at 11:48:36AM +0200, Christian König wrote:
> Am 04.04.23 um 11:43 schrieb Tvrtko Ursulin:
> > 
> > On 04/04/2023 01:22, Matthew Brost wrote:
> > > Hello,
> > > 
> > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > have been asked to merge our common DRM scheduler patches first as well
> > > as develop a common solution for long running workloads with the DRM
> > > scheduler. This RFC series is our first attempt at doing this. We
> > > welcome any and all feedback.
> > > 
> > > This can we thought of as 4 parts detailed below.
> > > 
> > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > entity (patches 1-3)
> > > 
> > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > severals problems as the DRM was originally designed to schedule jobs on
> > > hardware queues. The main problem being that DRM scheduler expects the
> > > submission order of jobs to be the completion order of jobs even across
> > > multiple entities. This assumption falls apart with a firmware scheduler
> > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > of order. A novel solution for was originally thought of by Faith during
> > > the initial prototype of Xe, create a 1 to 1 relationship between
> > > scheduler
> > > and entity. I believe the AGX driver [3] is using this approach and
> > > Boris may use approach as well for the Mali driver [4].
> > > 
> > > To support a 1 to 1 relationship we move the main execution function
> > > from a kthread to a work queue and add a new scheduling mode which
> > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > relationship and can be thought of as using scheduler as a dependency /
> > > infligt job tracker rather than a true scheduler.
> > 
> > Once you add capability for a more proper 1:1 via
> > DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to
> > replace kthreads with a wq?
> 
> Yeah, I fail to see the need for that as well. On the other hand it would be
> really nice to get rid of the rq/priority design in general.
> 

wrt to replacing kthread with a worker I think the idea is you don't
want to tie a kthread creation directly to a uAPI as a user then could
create 1000s of kthreads.

fwiw in a private email about a year yoy actually suggest using a work
queue Christian.

> > 
> > Or in other words, what purpose does the offloading of a job picking
> > code to a separate execution context serve? Could it be done directly in
> > the 1:1 mode and leave kthread setup for N:M?
> 
> Well moving from kthread to work item is beneficial on it's own since the
> later usually follows the the source of it's queue. E.g. when this is
> triggered by an interrupt we run on the CPU of the interrupt and not have
> inter CPU signaling.
> 
> > 
> > Apart from those design level questions, low level open IMO still is
> > that default fallback of using the system_wq has the potential to affect
> > latency for other drivers. But that's for those driver owners to
> > approve.
> 
> Oh, yeah that's a good point as well. This needs some high priority queue.
>

system_highpri_wq?

Matt

> Christian.
> 
> > 
> > Regards,
> > 
> > Tvrtko
> > 
> > > - Generic messaging interface for DRM scheduler
> > > 
> > > Idea is to be able to communicate to the submission backend with in band
> > > (relative to main execution function) messages. Messages are backend
> > > defined and flexable enough for any use case. In Xe we use these
> > > messages to clean up entites, set properties for entites, and suspend /
> > > resume execution of an entity [5]. I suspect other driver can leverage
> > > this messaging concept too as it a convenient way to avoid races in the
> > > backend.
> > > 
> > > - Support for using TDR for all error paths of a scheduler / entity
> > > 
> > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > 
> > > - Annotate dma-fences for long running workloads.
> > > 
> > > The idea here is to use dma-fences only as sync points within the
> > > scheduler and never export them for long running workloads. By
> > > annotating these fences as long running we ensure that these dma-fences
> > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > thus approach is the scheduler can still safely flow control the
> > > execution ring buffer via the job limit without breaking the dma-fence
> > > rules.
> > > 
> > > Again this a first draft and looking forward to feedback.
> > > 
> > > Enjoy - Matt
> > > 
> > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > [2] https://patchwork.freedesktop.org/series/112188/
> > > [3] https://patchwork.freedesktop.org/series/114772/
> > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188=1
> > > [5]

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Matthew Brost

On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
> Hi,
> 
> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> > 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> 
> Oh, please absolutely *don't* do this.
> 
> This is basically the design which makes a bunch of stuff so horrible broken
> on Windows.
> 
> I can explain it in more detail if necessary, but I strongly recommend to
> not go down this path.
> 

I'm afraid we are going to have to discuss this further. Let me explain
my reasoning, basically the idea is to have a single main entry point to
backend - the work queue. This avoids the need for lock between run_job
and any message that changes an entites state, also it really helps
during the reset flows (either TDR or GT reset) as we can call
drm_sched_run_wq_stop can ensure that nothing else is in the backend
changing an entity state. It all works out really nicely actually, our
GuC backend is incredibly stable (hasn't really had a bug pop in about a
year) and way simpler than what we did in the i915. I think the simplity
to largely due to this design of limiting the entry points.

I personally don't see how this a poor design, limiting entry points
absolutely makes sense to me, if it didn't why not just call cleanup_job
bypassing the main execution thread (now worker), this is the exact same
concept.

FWIW Asahi liked the idea as well and think it could be useful for AGX.
Matt

> Regards,
> Christian.
> 
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188=1
> > [5] 
> > https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >drm/sched: Convert drm scheduler to use a work queue rather than
> >  kthread
> >drm/sched: Move schedule policy to scheduler / entity
> >drm/sched: Add

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Matthew Brost

On Tue, Apr 04, 2023 at 11:04:54AM +0200, Christian König wrote:
> Please make sure to CC Luben on scheduler patches.
> 

Sure, figured I was missing a few people.

Matt

> Regards,
> Christian.
> 
> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> > 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188=1
> > [5] 
> > https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >drm/sched: Convert drm scheduler to use a work queue rather than
> >  kthread
> >drm/sched: Move schedule policy to scheduler / entity
> >drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >drm/sched: Add generic scheduler message interface
> >drm/sched: Start run wq before TDR in drm_sched_start
> >drm/sched: Submit job before starting TDR
> >drm/sched: Add helper to set TDR timeout
> >drm/syncobj: Warn on long running dma-fences
> > 
> > Thomas Hellström (2):
> >dma-buf/dma-fence: Introduce long-running completion fences
> >drm/sched: Support long-running sched entities
> > 
> >   drivers/dma-buf/dma-fence.c | 142 +++---
> >   drivers/dma-buf/dma-resv.c  |   5 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >   drivers/gpu/drm/drm_syncobj.c   |   5 +-
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c |   5 +-
> >   drivers/gpu/drm/lima/lima_sched.c   |   5 +-
> >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c|   5 +-
> >   drivers/gpu/drm/panfrost/panfrost_job.c |   5 +-
> >   drivers/gpu/drm/scheduler/sched_entity.c| 127 +++--
> >

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Christian König

Am 04.04.23 um 11:43 schrieb Tvrtko Ursulin:

On 04/04/2023 01:22, Matthew Brost wrote:

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between
scheduler

and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.

Once you add capability for a more proper 1:1 via
DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to
replace kthreads with a wq?

Yeah, I fail to see the need for that as well. On the other hand it
would be really nice to get rid of the rq/priority design in general.

Or in other words, what purpose does the offloading of a job picking
code to a separate execution context serve? Could it be done directly
in the 1:1 mode and leave kthread setup for N:M?

Well moving from kthread to work item is beneficial on it's own since
the later usually follows the the source of it's queue. E.g. when this
is triggered by an interrupt we run on the CPU of the interrupt and not
have inter CPU signaling.

Apart from those design level questions, low level open IMO still is
that default fallback of using the system_wq has the potential to
affect latency for other drivers. But that's for those driver owners
to approve.

Oh, yeah that's a good point as well. This needs some high priority queue.

Christian.

Regards,

Tvrtko

- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.

- Support for using TDR for all error paths of a scheduler / entity

Fix a few races / bugs, add function to dynamically set the TDR timeout.

- Annotate dma-fences for long running workloads.

The idea here is to use dma-fences only as sync points within the
scheduler and never export them for long running workloads. By
annotating these fences as long running we ensure that these dma-fences
are never used in a way that breaks the dma-fence rules. A benefit of
thus approach is the scheduler can still safely flow control the
execution ring buffer via the job limit without breaking the dma-fence
rules.

Again this a first draft and looking forward to feedback.

Enjoy - Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel
[2] https://patchwork.freedesktop.org/series/112188/
[3] https://patchwork.freedesktop.org/series/114772/
[4] https://patchwork.freedesktop.org/patch/515854/?series=112188=1
[5]
https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031

Matthew Brost (8):
drm/sched: Convert drm scheduler to use a work queue rather than
kthread
drm/sched: Move schedule policy to scheduler / entity
drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
drm/sched: Add generic scheduler message interface
drm/sched: Start run wq before TDR in drm_sched_start
drm/sched: Submit job before starting TDR
drm/sched: Add helper to set TDR timeout
drm/syncobj: Warn on long running dma-fences

Thomas Hellström (2):
dma-buf/dma-fence: Introduce long-running completion fences
drm/sched: Support long-running sched entities

drivers/dma-buf/dma-fence.c | 142 +++---
drivers/dma-buf/dma-resv.c

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Tvrtko Ursulin




On 04/04/2023 01:22, Matthew Brost wrote:

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.


Once you add capability for a more proper 1:1 via 
DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to 
replace kthreads with a wq?


Or in other words, what purpose does the offloading of a job picking 
code to a separate execution context serve? Could it be done directly in 
the 1:1 mode and leave kthread setup for N:M?


Apart from those design level questions, low level open IMO still is 
that default fallback of using the system_wq has the potential to 
affect latency for other drivers. But that's for those driver owners to 
approve.


Regards,

Tvrtko


- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.

- Support for using TDR for all error paths of a scheduler / entity

Fix a few races / bugs, add function to dynamically set the TDR timeout.

- Annotate dma-fences for long running workloads.

The idea here is to use dma-fences only as sync points within the
scheduler and never export them for long running workloads. By
annotating these fences as long running we ensure that these dma-fences
are never used in a way that breaks the dma-fence rules. A benefit of
thus approach is the scheduler can still safely flow control the
execution ring buffer via the job limit without breaking the dma-fence
rules.

Again this a first draft and looking forward to feedback.

Enjoy - Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel
[2] https://patchwork.freedesktop.org/series/112188/
[3] https://patchwork.freedesktop.org/series/114772/
[4] https://patchwork.freedesktop.org/patch/515854/?series=112188=1
[5] 
https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031

Matthew Brost (8):
   drm/sched: Convert drm scheduler to use a work queue rather than
 kthread
   drm/sched: Move schedule policy to scheduler / entity
   drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
   drm/sched: Add generic scheduler message interface
   drm/sched: Start run wq before TDR in drm_sched_start
   drm/sched: Submit job before starting TDR
   drm/sched: Add helper to set TDR timeout
   drm/syncobj: Warn on long running dma-fences

Thomas Hellström (2):
   dma-buf/dma-fence: Introduce long-running completion fences
   drm/sched: Support long-running sched entities

  drivers/dma-buf/dma-fence.c | 142 +++---
  drivers/dma-buf/dma-resv.c  |   5 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
  drivers/gpu/drm/drm_syncobj.c   |   5 +-
  drivers/gpu/drm/etnaviv/etnaviv_sched.c |   5 +-
  drivers/gpu/drm/lima/lima_sched.c   |   5 +-
  drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
  drivers/gpu/drm/msm/msm_ringbuffer.c|   5 +-
  drivers/gpu/drm/panfrost/panfrost_job.c |   5 +-
  drivers/gpu/drm/scheduler/sched_entity.c| 127 +++--

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Christian König

Hi,

Am 04.04.23 um 02:22 schrieb Matthew Brost:

Hello,

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

- Generic messaging interface for DRM scheduler

Oh, please absolutely *don't* do this.

This is basically the design which makes a bunch of stuff so horrible
broken on Windows.

I can explain it in more detail if necessary, but I strongly recommend
to not go down this path.

Regards,
Christian.

- Support for using TDR for all error paths of a scheduler / entity

Fix a few races / bugs, add function to dynamically set the TDR timeout.

- Annotate dma-fences for long running workloads.

Again this a first draft and looking forward to feedback.

Enjoy - Matt

Thomas Hellström (2):
dma-buf/dma-fence: Introduce long-running completion fences
drm/sched: Support long-running sched entities

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-04 Thread Christian König


Please make sure to CC Luben on scheduler patches.

Regards,
Christian.

Am 04.04.23 um 02:22 schrieb Matthew Brost:

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.

- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.

- Support for using TDR for all error paths of a scheduler / entity

Fix a few races / bugs, add function to dynamically set the TDR timeout.

- Annotate dma-fences for long running workloads.

The idea here is to use dma-fences only as sync points within the
scheduler and never export them for long running workloads. By
annotating these fences as long running we ensure that these dma-fences
are never used in a way that breaks the dma-fence rules. A benefit of
thus approach is the scheduler can still safely flow control the
execution ring buffer via the job limit without breaking the dma-fence
rules.

Again this a first draft and looking forward to feedback.

Enjoy - Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel
[2] https://patchwork.freedesktop.org/series/112188/
[3] https://patchwork.freedesktop.org/series/114772/
[4] https://patchwork.freedesktop.org/patch/515854/?series=112188=1
[5] 
https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031

Matthew Brost (8):
   drm/sched: Convert drm scheduler to use a work queue rather than
 kthread
   drm/sched: Move schedule policy to scheduler / entity
   drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
   drm/sched: Add generic scheduler message interface
   drm/sched: Start run wq before TDR in drm_sched_start
   drm/sched: Submit job before starting TDR
   drm/sched: Add helper to set TDR timeout
   drm/syncobj: Warn on long running dma-fences

Thomas Hellström (2):
   dma-buf/dma-fence: Introduce long-running completion fences
   drm/sched: Support long-running sched entities

  drivers/dma-buf/dma-fence.c | 142 +++---
  drivers/dma-buf/dma-resv.c  |   5 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
  drivers/gpu/drm/drm_syncobj.c   |   5 +-
  drivers/gpu/drm/etnaviv/etnaviv_sched.c |   5 +-
  drivers/gpu/drm/lima/lima_sched.c   |   5 +-
  drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
  drivers/gpu/drm/msm/msm_ringbuffer.c|   5 +-
  drivers/gpu/drm/panfrost/panfrost_job.c |   5 +-
  drivers/gpu/drm/scheduler/sched_entity.c| 127 +++--
  drivers/gpu/drm/scheduler/sched_fence.c |   6 +-
  drivers/gpu/drm/scheduler/sched_main.c  | 278 +++-
  drivers/gpu/drm/v3d/v3d_sched.c |  25 +-
  include/drm/gpu_scheduler.h | 130 +++--
  include/linux/dma-fence.h   |  60 -
  16 files changed, 649 insertions(+), 184 deletions(-)

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-03 Thread Matthew Brost

On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> Hi, thanks for the Cc!
> 

No problem.

> On 04/04/2023 09.22, Matthew Brost wrote:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> 
> Yup, we're in the exact same situation with drm/asahi, so this is very
> welcome! We've been using the existing scheduler as-is, but this should help
> remove some unneeded complexity in this use case.
>

That's the idea.

> Do you want me to pull in this series into our tree and make sure this all
> works out for us?
>

We tested this in Xe and it definitely works for us but the more testing
the better.

> I also have a couple bugfixes for drm/sched I need to send out, but I think
> the rebase/merge with this series should be trivial. I'll send that out this
> week.
> 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> 
> We haven't needed this so far (mostly by using fine-grained locking and
> refcounting all over the place) but I can see it being useful to simplify
> some of those constructs and maybe avoid potential deadlocks in some places.
> I'm not sure yet whether we can fully get rid of the main queue
> refcounting/locking (our completion/error signaling path doesn't map well to
> DMA fences directly so we still need something there to get from the global
> GPU completion signaling thread to individual queues) but it might be a step
> in the right direction at least!
>

With this messaging interface we essentially have a lockless submission
backend which is really nice compared to what we did in the i915.

Matt

> ~~ Lina
>

Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans

2023-04-03 Thread Asahi Lina


Hi, thanks for the Cc!

On 04/04/2023 09.22, Matthew Brost wrote:

Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.


Yup, we're in the exact same situation with drm/asahi, so this is very 
welcome! We've been using the existing scheduler as-is, but this should 
help remove some unneeded complexity in this use case.


Do you want me to pull in this series into our tree and make sure this 
all works out for us?


I also have a couple bugfixes for drm/sched I need to send out, but I 
think the rebase/merge with this series should be trivial. I'll send 
that out this week.



- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.


We haven't needed this so far (mostly by using fine-grained locking and 
refcounting all over the place) but I can see it being useful to 
simplify some of those constructs and maybe avoid potential deadlocks in 
some places. I'm not sure yet whether we can fully get rid of the main 
queue refcounting/locking (our completion/error signaling path doesn't 
map well to DMA fences directly so we still need something there to get 
from the global GPU completion signaling thread to individual queues) 
but it might be a step in the right direction at least!


~~ Lina

40 matches

Mail list logo