On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> Hi,
> 
> Using dma-fence for completion/dependency tracking for long-run workload(more 
> precisely on-demand paging/page fault enabled workload) can cause deadlock. 
> This seems the significant issue here. Other issues such as the drm scheduler 
> completion order implication etc are minors which can be solve inside the 
> framework of drm scheduler. We need to evaluate below paths:
> 
>       1) still use drm scheduler for job submission, and use dma-fence for 
> job completion waiting/dependency tracking. This is solution proposed in this 
> series. Annotate dma-fence for long-run workload: user can still wait 
> dma-fence for job completion but can't wait dma-fence while holding any 
> memory management locks.  We still use dma-fence for dependency tracking. But 
> it is just very easily run into deadlock when on-demand paging is in the 
> picture. The annotation helps us to detect deadlock but not solve deadlock 
> problems. Seems *not* a complete solution: It is almost impossible to 
> completely avoid dependency deadlock in complex runtime environment
>

No one can wait on LR fence, so it is impossible to deadlock. The
annotations enforce this. Literally this is only for flow controling the
ring / hold pending jobs in in the DRM schedule list.

>       2) Still use drm scheduler but not use dma-fence for completion 
> signaling and dependency tracking. This way we still get some free functions 
> (reset, err handling ring flow control as Matt said)from drm scheduler, but 
> push the dependency/completion tracking completely to user space using 
> techniques such as user space fence. User space doesn't have chance to wait 
> fence while holding a kernel memory management lock, thus the dma-fence 
> deadlock issue is solved.
>

We use user space fence for syncs.

>       3) Completely discard drm scheduler and dma-fence for long-run 
> workload. Use user queue/doorbell for super fast submission, directly 
> interact with fw scheduler. Use user fence for completion/dependency tracking.
> 

This is a hard no from me, I want 1 submission path in Xe. Either we use
the DRM scheduler or we don't.

Matt

> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Christian König <christian.koe...@amd.com>
> > Sent: April 5, 2023 3:30 AM
> > To: Brost, Matthew <matthew.br...@intel.com>; Zeng, Oak
> > <oak.z...@intel.com>
> > Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org;
> > robdcl...@chromium.org; thomas.hellst...@linux.intel.com; airl...@linux.ie;
> > l...@asahilina.net; boris.brezil...@collabora.com; 
> > faith.ekstr...@collabora.com
> > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > plans
> > 
> > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > >> Hi Matt, Thomas,
> > >>
> > >> Some very bold out of box thinking in this area:
> > >>
> > >> 1. so you want to use drm scheduler and dma-fence for long running 
> > >> workload.
> > Why you want to do this in the first place? What is the benefit? Drm 
> > scheduler is
> > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > level, as you said below for intel this is Guc. Can xe driver just directly 
> > submit job
> > to Guc, bypassing drm scheduler?
> > >>
> > > If we did that now we have 2 paths for dependency track, flow controling
> > > the ring, resets / error handling / backend submission implementations.
> > > We don't want this.
> > 
> > Well exactly that's the point: Why?
> > 
> > As far as I can see that are two completely distinct use cases, so you
> > absolutely do want two completely distinct implementations for this.
> > 
> > >> 2. using dma-fence for long run workload: I am well aware that page 
> > >> fault (and
> > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't 
> > be
> > used purely because the nature of the workload that it runs very long 
> > (indefinite).
> > I did a math: the dma_fence_wait_timeout function's third param is the 
> > timeout
> > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 
> > days is not long
> > enough, can we just change the timeout parameter to signed 64 bits so it is 
> > much
> > longer than our life time...
> > >>
> > >> So I mainly argue we can't use dma-fence for long-run workload is not
> > because the workload runs very long, rather because of the fact that we use
> > page fault for long-run workload. If we enable page fault for short-run 
> > workload,
> > we can't use dma-fence either. Page fault is the key thing here.
> > >>
> > >> Now since we use page fault which is *fundamentally* controversial with
> > dma-fence design, why now just introduce a independent concept such as user-
> > fence instead of extending existing dma-fence?
> > >>
> > >> I like unified design. If drm scheduler, dma-fence can be extended to 
> > >> work for
> > everything, it is beautiful. But seems we have some fundamental problem 
> > here.
> > >>
> > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > the signal / CB infrastructure) and enforce we don't use use these
> > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > user space or other drivers. Think of this mode as SW only fence.
> > 
> > Yeah and I truly think this is an really bad idea.
> > 
> > The signal/CB infrastructure in the dma_fence turned out to be the
> > absolutely nightmare I initially predicted. Sorry to say that, but in
> > this case the "I've told you so" is appropriate in my opinion.
> > 
> > If we need infrastructure for long running dependency tracking we should
> > encapsulate that in a new framework and not try to mangle the existing
> > code for something it was never intended for.
> > 
> > Christian.
> > 
> > >
> > > Matt
> > >
> > >> Thanks,
> > >> Oak
> > >>
> > >>> -----Original Message-----
> > >>> From: dri-devel <dri-devel-boun...@lists.freedesktop.org> On Behalf Of
> > >>> Matthew Brost
> > >>> Sent: April 3, 2023 8:22 PM
> > >>> To: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org
> > >>> Cc: robdcl...@chromium.org; thomas.hellst...@linux.intel.com;
> > airl...@linux.ie;
> > >>> l...@asahilina.net; boris.brezil...@collabora.com; Brost, Matthew
> > >>> <matthew.br...@intel.com>; christian.koe...@amd.com;
> > >>> faith.ekstr...@collabora.com
> > >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > plans
> > >>>
> > >>> Hello,
> > >>>
> > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > >>> have been asked to merge our common DRM scheduler patches first as well
> > >>> as develop a common solution for long running workloads with the DRM
> > >>> scheduler. This RFC series is our first attempt at doing this. We
> > >>> welcome any and all feedback.
> > >>>
> > >>> This can we thought of as 4 parts detailed below.
> > >>>
> > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > >>> entity (patches 1-3)
> > >>>
> > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > >>> severals problems as the DRM was originally designed to schedule jobs on
> > >>> hardware queues. The main problem being that DRM scheduler expects the
> > >>> submission order of jobs to be the completion order of jobs even across
> > >>> multiple entities. This assumption falls apart with a firmware scheduler
> > >>> as a firmware scheduler has no concept of jobs and jobs can complete out
> > >>> of order. A novel solution for was originally thought of by Faith during
> > >>> the initial prototype of Xe, create a 1 to 1 relationship between 
> > >>> scheduler
> > >>> and entity. I believe the AGX driver [3] is using this approach and
> > >>> Boris may use approach as well for the Mali driver [4].
> > >>>
> > >>> To support a 1 to 1 relationship we move the main execution function
> > >>> from a kthread to a work queue and add a new scheduling mode which
> > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> > >>> relationship and can be thought of as using scheduler as a dependency /
> > >>> infligt job tracker rather than a true scheduler.
> > >>>
> > >>> - Generic messaging interface for DRM scheduler
> > >>>
> > >>> Idea is to be able to communicate to the submission backend with in band
> > >>> (relative to main execution function) messages. Messages are backend
> > >>> defined and flexable enough for any use case. In Xe we use these
> > >>> messages to clean up entites, set properties for entites, and suspend /
> > >>> resume execution of an entity [5]. I suspect other driver can leverage
> > >>> this messaging concept too as it a convenient way to avoid races in the
> > >>> backend.
> > >>>
> > >>> - Support for using TDR for all error paths of a scheduler / entity
> > >>>
> > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > >>>
> > >>> - Annotate dma-fences for long running workloads.
> > >>>
> > >>> The idea here is to use dma-fences only as sync points within the
> > >>> scheduler and never export them for long running workloads. By
> > >>> annotating these fences as long running we ensure that these dma-fences
> > >>> are never used in a way that breaks the dma-fence rules. A benefit of
> > >>> thus approach is the scheduler can still safely flow control the
> > >>> execution ring buffer via the job limit without breaking the dma-fence
> > >>> rules.
> > >>>
> > >>> Again this a first draft and looking forward to feedback.
> > >>>
> > >>> Enjoy - Matt
> > >>>
> > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > >>> [2] https://patchwork.freedesktop.org/series/112188/
> > >>> [3] https://patchwork.freedesktop.org/series/114772/
> > >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > >>>
> > >>> Matthew Brost (8):
> > >>>    drm/sched: Convert drm scheduler to use a work queue rather than
> > >>>      kthread
> > >>>    drm/sched: Move schedule policy to scheduler / entity
> > >>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > >>>    drm/sched: Add generic scheduler message interface
> > >>>    drm/sched: Start run wq before TDR in drm_sched_start
> > >>>    drm/sched: Submit job before starting TDR
> > >>>    drm/sched: Add helper to set TDR timeout
> > >>>    drm/syncobj: Warn on long running dma-fences
> > >>>
> > >>> Thomas Hellström (2):
> > >>>    dma-buf/dma-fence: Introduce long-running completion fences
> > >>>    drm/sched: Support long-running sched entities
> > >>>
> > >>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > >>>   drivers/dma-buf/dma-resv.c                  |   5 +
> > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > >>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > >>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > >>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > >>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > >>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > >>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > >>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > >>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > >>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > >>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > >>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
> > >>>   include/linux/dma-fence.h                   |  60 ++++-
> > >>>   16 files changed, 649 insertions(+), 184 deletions(-)
> > >>>
> > >>> --
> > >>> 2.34.1
> 

Reply via email to