On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote: > Hi, > > Using dma-fence for completion/dependency tracking for long-run workload(more > precisely on-demand paging/page fault enabled workload) can cause deadlock. > This seems the significant issue here. Other issues such as the drm scheduler > completion order implication etc are minors which can be solve inside the > framework of drm scheduler. We need to evaluate below paths: > > 1) still use drm scheduler for job submission, and use dma-fence for > job completion waiting/dependency tracking. This is solution proposed in this > series. Annotate dma-fence for long-run workload: user can still wait > dma-fence for job completion but can't wait dma-fence while holding any > memory management locks. We still use dma-fence for dependency tracking. But > it is just very easily run into deadlock when on-demand paging is in the > picture. The annotation helps us to detect deadlock but not solve deadlock > problems. Seems *not* a complete solution: It is almost impossible to > completely avoid dependency deadlock in complex runtime environment >
No one can wait on LR fence, so it is impossible to deadlock. The annotations enforce this. Literally this is only for flow controling the ring / hold pending jobs in in the DRM schedule list. > 2) Still use drm scheduler but not use dma-fence for completion > signaling and dependency tracking. This way we still get some free functions > (reset, err handling ring flow control as Matt said)from drm scheduler, but > push the dependency/completion tracking completely to user space using > techniques such as user space fence. User space doesn't have chance to wait > fence while holding a kernel memory management lock, thus the dma-fence > deadlock issue is solved. > We use user space fence for syncs. > 3) Completely discard drm scheduler and dma-fence for long-run > workload. Use user queue/doorbell for super fast submission, directly > interact with fw scheduler. Use user fence for completion/dependency tracking. > This is a hard no from me, I want 1 submission path in Xe. Either we use the DRM scheduler or we don't. Matt > Thanks, > Oak > > > -----Original Message----- > > From: Christian König <christian.koe...@amd.com> > > Sent: April 5, 2023 3:30 AM > > To: Brost, Matthew <matthew.br...@intel.com>; Zeng, Oak > > <oak.z...@intel.com> > > Cc: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org; > > robdcl...@chromium.org; thomas.hellst...@linux.intel.com; airl...@linux.ie; > > l...@asahilina.net; boris.brezil...@collabora.com; > > faith.ekstr...@collabora.com > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > plans > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost: > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: > > >> Hi Matt, Thomas, > > >> > > >> Some very bold out of box thinking in this area: > > >> > > >> 1. so you want to use drm scheduler and dma-fence for long running > > >> workload. > > Why you want to do this in the first place? What is the benefit? Drm > > scheduler is > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw > > level, as you said below for intel this is Guc. Can xe driver just directly > > submit job > > to Guc, bypassing drm scheduler? > > >> > > > If we did that now we have 2 paths for dependency track, flow controling > > > the ring, resets / error handling / backend submission implementations. > > > We don't want this. > > > > Well exactly that's the point: Why? > > > > As far as I can see that are two completely distinct use cases, so you > > absolutely do want two completely distinct implementations for this. > > > > >> 2. using dma-fence for long run workload: I am well aware that page > > >> fault (and > > the consequent memory allocation/lock acquiring to fix the fault) can cause > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't > > be > > used purely because the nature of the workload that it runs very long > > (indefinite). > > I did a math: the dma_fence_wait_timeout function's third param is the > > timeout > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 > > days is not long > > enough, can we just change the timeout parameter to signed 64 bits so it is > > much > > longer than our life time... > > >> > > >> So I mainly argue we can't use dma-fence for long-run workload is not > > because the workload runs very long, rather because of the fact that we use > > page fault for long-run workload. If we enable page fault for short-run > > workload, > > we can't use dma-fence either. Page fault is the key thing here. > > >> > > >> Now since we use page fault which is *fundamentally* controversial with > > dma-fence design, why now just introduce a independent concept such as user- > > fence instead of extending existing dma-fence? > > >> > > >> I like unified design. If drm scheduler, dma-fence can be extended to > > >> work for > > everything, it is beautiful. But seems we have some fundamental problem > > here. > > >> > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use > > > the signal / CB infrastructure) and enforce we don't use use these > > > dma-fences from the scheduler in memory reclaim paths or export these to > > > user space or other drivers. Think of this mode as SW only fence. > > > > Yeah and I truly think this is an really bad idea. > > > > The signal/CB infrastructure in the dma_fence turned out to be the > > absolutely nightmare I initially predicted. Sorry to say that, but in > > this case the "I've told you so" is appropriate in my opinion. > > > > If we need infrastructure for long running dependency tracking we should > > encapsulate that in a new framework and not try to mangle the existing > > code for something it was never intended for. > > > > Christian. > > > > > > > > Matt > > > > > >> Thanks, > > >> Oak > > >> > > >>> -----Original Message----- > > >>> From: dri-devel <dri-devel-boun...@lists.freedesktop.org> On Behalf Of > > >>> Matthew Brost > > >>> Sent: April 3, 2023 8:22 PM > > >>> To: dri-devel@lists.freedesktop.org; intel...@lists.freedesktop.org > > >>> Cc: robdcl...@chromium.org; thomas.hellst...@linux.intel.com; > > airl...@linux.ie; > > >>> l...@asahilina.net; boris.brezil...@collabora.com; Brost, Matthew > > >>> <matthew.br...@intel.com>; christian.koe...@amd.com; > > >>> faith.ekstr...@collabora.com > > >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > plans > > >>> > > >>> Hello, > > >>> > > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > >>> have been asked to merge our common DRM scheduler patches first as well > > >>> as develop a common solution for long running workloads with the DRM > > >>> scheduler. This RFC series is our first attempt at doing this. We > > >>> welcome any and all feedback. > > >>> > > >>> This can we thought of as 4 parts detailed below. > > >>> > > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and > > >>> entity (patches 1-3) > > >>> > > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents > > >>> severals problems as the DRM was originally designed to schedule jobs on > > >>> hardware queues. The main problem being that DRM scheduler expects the > > >>> submission order of jobs to be the completion order of jobs even across > > >>> multiple entities. This assumption falls apart with a firmware scheduler > > >>> as a firmware scheduler has no concept of jobs and jobs can complete out > > >>> of order. A novel solution for was originally thought of by Faith during > > >>> the initial prototype of Xe, create a 1 to 1 relationship between > > >>> scheduler > > >>> and entity. I believe the AGX driver [3] is using this approach and > > >>> Boris may use approach as well for the Mali driver [4]. > > >>> > > >>> To support a 1 to 1 relationship we move the main execution function > > >>> from a kthread to a work queue and add a new scheduling mode which > > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > >>> The new scheduling mode should unify all drivers usage with a 1 to 1 > > >>> relationship and can be thought of as using scheduler as a dependency / > > >>> infligt job tracker rather than a true scheduler. > > >>> > > >>> - Generic messaging interface for DRM scheduler > > >>> > > >>> Idea is to be able to communicate to the submission backend with in band > > >>> (relative to main execution function) messages. Messages are backend > > >>> defined and flexable enough for any use case. In Xe we use these > > >>> messages to clean up entites, set properties for entites, and suspend / > > >>> resume execution of an entity [5]. I suspect other driver can leverage > > >>> this messaging concept too as it a convenient way to avoid races in the > > >>> backend. > > >>> > > >>> - Support for using TDR for all error paths of a scheduler / entity > > >>> > > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout. > > >>> > > >>> - Annotate dma-fences for long running workloads. > > >>> > > >>> The idea here is to use dma-fences only as sync points within the > > >>> scheduler and never export them for long running workloads. By > > >>> annotating these fences as long running we ensure that these dma-fences > > >>> are never used in a way that breaks the dma-fence rules. A benefit of > > >>> thus approach is the scheduler can still safely flow control the > > >>> execution ring buffer via the job limit without breaking the dma-fence > > >>> rules. > > >>> > > >>> Again this a first draft and looking forward to feedback. > > >>> > > >>> Enjoy - Matt > > >>> > > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel > > >>> [2] https://patchwork.freedesktop.org/series/112188/ > > >>> [3] https://patchwork.freedesktop.org/series/114772/ > > >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- > > >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > >>> > > >>> Matthew Brost (8): > > >>> drm/sched: Convert drm scheduler to use a work queue rather than > > >>> kthread > > >>> drm/sched: Move schedule policy to scheduler / entity > > >>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > >>> drm/sched: Add generic scheduler message interface > > >>> drm/sched: Start run wq before TDR in drm_sched_start > > >>> drm/sched: Submit job before starting TDR > > >>> drm/sched: Add helper to set TDR timeout > > >>> drm/syncobj: Warn on long running dma-fences > > >>> > > >>> Thomas Hellström (2): > > >>> dma-buf/dma-fence: Introduce long-running completion fences > > >>> drm/sched: Support long-running sched entities > > >>> > > >>> drivers/dma-buf/dma-fence.c | 142 +++++++--- > > >>> drivers/dma-buf/dma-resv.c | 5 + > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > >>> drivers/gpu/drm/drm_syncobj.c | 5 +- > > >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > >>> drivers/gpu/drm/lima/lima_sched.c | 5 +- > > >>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > >>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > >>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > >>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > >>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > >>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > >>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > >>> include/drm/gpu_scheduler.h | 130 +++++++-- > > >>> include/linux/dma-fence.h | 60 ++++- > > >>> 16 files changed, 649 insertions(+), 184 deletions(-) > > >>> > > >>> -- > > >>> 2.34.1 >