Re: [PATCH 01/13] drm: execution context for GEM buffers v4
On 6/19/23 11:48, Christian König wrote: Hi, Am 19.06.23 um 11:33 schrieb Thomas Hellström (Intel): [SNIP] Sometimes you want to just drop the contended lock after the above relaxation. (Eviction would be one example), and not add as prelocked, if the contended object goes out of scope. Eviction would in some situations be one such example, -EDEADLOCK leading to an error path where the object should otherwise be freed is another. Perhaps we could add an argument to prepare_obj() as to whether the object should be immediately put after relaxation. I was considering a try_prepare version as well, that should cover this use case. That sounds a bit different from this use-case. The use-case above would, on -EDEADLOCK actually unlock everything, then lock-slow the contending lock and then immediately unlock it and drop. Hui? What would that be good for? It's for the case where you have nested locking, the contended lock has gone out-of-scope and you're probably not going to need it on the next attempt. I think the refcounted "prelocked" object that is lacking in the i915 variant will resolve all correctness / uaf issues, but still the object might be needlessly carried around for yet another locking round. It sounds like try_prepare would just skip locking and continue with everything locked so far still locked? Correct. + ret = drm_exec_obj_locked(exec, obj); + if (unlikely(ret)) { + dma_resv_unlock(obj->resv); + goto error_dropref; + } + + swap(exec->prelocked, obj); + +error_dropref: + /* Always cleanup the contention so that error handling can kick in */ + drm_gem_object_put(obj); + exec->contended = NULL; + return ret; +} + +/** + * drm_exec_prepare_obj - prepare a GEM object for use + * @exec: the drm_exec object with the state + * @obj: the GEM object to prepare + * @num_fences: how many fences to reserve + * + * Prepare a GEM object for use by locking it and reserving fence slots. All + * successfully locked objects are put into the locked container. + * + * Returns: -EDEADLK if a contention is detected, -EALREADY when object is + * already locked, -ENOMEM when memory allocation failed and zero for success. + */ +int drm_exec_prepare_obj(struct drm_exec *exec, struct drm_gem_object *obj, + unsigned int num_fences) +{ + int ret; + + ret = drm_exec_lock_contended(exec); + if (unlikely(ret)) + return ret; + + if (exec->prelocked == obj) { + drm_gem_object_put(exec->prelocked); + exec->prelocked = NULL; + + return dma_resv_reserve_fences(obj->resv, num_fences); + } + + if (exec->flags & DRM_EXEC_FLAG_INTERRUPTIBLE) + ret = dma_resv_lock_interruptible(obj->resv, >ticket); + else + ret = dma_resv_lock(obj->resv, >ticket); + + if (unlikely(ret == -EDEADLK)) { + drm_gem_object_get(obj); + exec->contended = obj; + return -EDEADLK; + } + + if (unlikely(ret == -EALREADY && + (exec->flags & DRM_EXEC_FLAG_ALLOW_DUPLICATES))) + goto reserve_fences; + + if (unlikely(ret)) + return ret; + + ret = drm_exec_obj_locked(exec, obj); + if (ret) + goto error_unlock; + +reserve_fences: + /* Keep locked when reserving fences fails */ + return dma_resv_reserve_fences(obj->resv, num_fences); Ugh, what is the use-case for keeping things locked here? How would a caller tell the difference between an error where everything is locked and nothing is locked? IMO, we should unlock on error here. If there indeed is a use-case we should add a separate function for reserving fences for all locked objects, rather than returning sometimes locked on error sometime not. We return the object locked here because it was to much churn to remove it again from the array and we are getting fully cleaned up at the end anyway. OK, so if we add an unlock functionality, we could just have a consistent locking state on error return? Yeah, that should work. Going to work on this. Great. Thanks, Thomas Regards, Christian. Thanks, Thomas Regards, Christian. Thanks, Thomas + +error_unlock: + dma_resv_unlock(obj->resv); + return ret; +} +EXPORT_SYMBOL(drm_exec_prepare_obj); + +/** + * drm_exec_prepare_array - helper to prepare an array of objects + * @exec: the drm_exec object with the state + * @objects: array of GEM object to prepare + * @num_objects: number of GEM objects in the array + * @num_fences: number of fences to reserve on each GEM object + * + * Prepares all GEM objects in an array, handles contention but aports on first + * error otherwise. Reserves @num_fences on each GEM object after locking it. + * + * Returns: -EALREADY when object is already locked, -ENOMEM when memory + * allocation failed and zero for success. + */ +int drm_exec_prepare_array(struct drm_exec *exec, +
Re: [PATCH 01/13] drm: execution context for GEM buffers v4
Hi! On 6/19/23 11:20, Christian König wrote: Hi guys, Am 19.06.23 um 10:59 schrieb Thomas Hellström (Intel): [SNIP] I really need to find some time to work on that anyway. I've been playing with drm_exec for a couple weeks now, and I wanted to share something I hacked to try and make the API simpler and more robust against misuse (see the below diff, which is a slightly adjusted version of your work). It would be good if we could have someone taking charge of this series and address all review comments, I see some of my comments getting lost, we have multiple submitters and I can't find a dri-devel patchwork entry for this. Anyway some comments below. I can try to find some time for the series this week (As long as nobody comes along and has any burning roof). In this version, the user is no longer in control of the retry loop. Instead, it provides an expression (a call to a sub-function) to be re-evaluated each time a contention is detected. IMHO, this makes the 'prepare-objs' functions easier to apprehend, and avoids any mistake like calling drm_exec_continue_on_contention() in an inner loop, or breaking out of the drm_exec_while_all_locked() loop unintentionally. In i915 we've had a very similar helper to this, and while I agree this newer version would probably help make code cleaner, but OTOH there also are some places where the short drm_exec_while_all_locked() -likeblock don't really motivate a separate function. Porting i915 to the current version will take some work, For the xe driver both versions would work fine. Yeah, this is actually what my first version of this looked like. But I abandoned that approach because we have a lot of cases were we just quickly want to lock a few GEM objects and don't want the extra overhead of putting all the state into some bag to forward it to a function. Some additional review comments not related to the interface change below: It also makes the internal management a bit simpler, since we no longer call drm_exec_cleanup() on the first attempt, and can thus get rid of the DRM_EXEC_DUMMY trick. In the below diff, I also re-introduced native support for duplicates as an opt-in, so we don't have to do things like: ret = drm_exec_prepare_obj(exec, obj, num_fences); if (ret == -EALREADY) ret = dma_resv_reserve_fences(obj->resv, num_fences); if (ret) return ret; and can just do: ret = drm_exec_prepare_obj(exec, obj, num_fences); if (ret) return; Of course drivers can open-code a wrapper doing the same thing, but given at least pvr and radeon need this, it'd be nice if the core could support it natively. That's mostly it. Just wanted to share what I had in case you're interested. If not, that's fine too. Regards, Boris --- Documentation/gpu/drm-mm.rst | 12 ++ drivers/gpu/drm/Kconfig | 6 + drivers/gpu/drm/Makefile | 2 + drivers/gpu/drm/drm_exec.c | 274 +++ include/drm/drm_exec.h | 130 + 5 files changed, 424 insertions(+) create mode 100644 drivers/gpu/drm/drm_exec.c create mode 100644 include/drm/drm_exec.h diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst index fe40ee686f6e..c9f120cfe730 100644 --- a/Documentation/gpu/drm-mm.rst +++ b/Documentation/gpu/drm-mm.rst @@ -524,6 +524,18 @@ DRM Sync Objects .. kernel-doc:: drivers/gpu/drm/drm_syncobj.c :export: +DRM Execution context += + +.. kernel-doc:: drivers/gpu/drm/drm_exec.c + :doc: Overview + +.. kernel-doc:: include/drm/drm_exec.h + :internal: + +.. kernel-doc:: drivers/gpu/drm/drm_exec.c + :export: + GPU Scheduler = diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index 76991720637c..01a38fcdb1c4 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -194,6 +194,12 @@ config DRM_TTM GPU memory types. Will be enabled automatically if a device driver uses it. +config DRM_EXEC + tristate + depends on DRM + help + Execution context for command submissions + config DRM_BUDDY tristate depends on DRM diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile index 1873f64db171..18a02eaf2d49 100644 --- a/drivers/gpu/drm/Makefile +++ b/drivers/gpu/drm/Makefile @@ -79,6 +79,8 @@ obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o # # Memory-management helpers # +# +obj-$(CONFIG_DRM_EXEC) += drm_exec.o obj-$(CONFIG_DRM_BUDDY) += drm_buddy.o diff --git a/drivers/gpu/drm/drm_exec.c b/drivers/gpu/drm/drm_exec.c new file mode 100644 index ..e0ad1a3e1610 --- /dev/null +++ b/drivers/gpu/drm/drm_exec.c @@ -0,0 +1,274 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ + +#include +#include +#include + +/** + * DOC: Overview + * + * This component mainly abstracts the retry loop necessary for locking + * multiple GEM objects while prepar
Re: [PATCH 01/13] drm: execution context for GEM buffers v4
On 6/17/23 13:54, Boris Brezillon wrote: +Matthew who's been using drm_exec in Xe if I'm correct. Hello Christian, On Wed, 14 Jun 2023 15:02:52 +0200 Boris Brezillon wrote: On Wed, 14 Jun 2023 14:30:53 +0200 Christian König wrote: Am 14.06.23 um 14:23 schrieb Boris Brezillon: Hi Christian, On Thu, 4 May 2023 13:51:47 +0200 "Christian König" wrote: This adds the infrastructure for an execution context for GEM buffers which is similar to the existing TTMs execbuf util and intended to replace it in the long term. The basic functionality is that we abstracts the necessary loop to lock many different GEM buffers with automated deadlock and duplicate handling. As many other drivers do already, we are considering using drm_exec() for our resv locking in the PowerVR driver, so we might have more questions/comments in the coming days/weeks, but I already have a couple right now (see below). v3: drop duplicate tracking, radeon is really the only one needing that I think we'd actually be interested in duplicate tracking. Is there any way we can make it an optional feature through some extra helpers/flags? Doesn't have to be done in this patch series, I'm just wondering if this is something we can share as well. You can still capture the -EALREADY error and act appropriately in your driver. For radeon it just means ignoring the error code and going ahead, but that behavior doesn't seem to be desired in most cases. Initially I though we need to separately track how many and how often BOs are duplicated, but there is simply no use for this. [...] +/** + * DOC: Overview + * + * This component mainly abstracts the retry loop necessary for locking + * multiple GEM objects while preparing hardware operations (e.g. command + * submissions, page table updates etc..). + * + * If a contention is detected while locking a GEM object the cleanup procedure + * unlocks all previously locked GEM objects and locks the contended one first + * before locking any further objects. + * + * After an object is locked fences slots can optionally be reserved on the + * dma_resv object inside the GEM object. + * + * A typical usage pattern should look like this:: + * + * struct drm_gem_object *obj; + * struct drm_exec exec; + * unsigned long index; + * int ret; + * + * drm_exec_init(, true); + * drm_exec_while_not_all_locked() { + * ret = drm_exec_prepare_obj(, boA, 1); + * drm_exec_continue_on_contention(); + * if (ret) + * goto error; + * Have you considered defining a drm_exec_try_prepare_obj_or_retry() combining drm_exec_prepare_obj() and drm_exec_continue_on_contention()? #define drm_exec_try_prepare_obj_or_retry(exec, obj, num_fences) \ ({ \ int __ret = drm_exec_prepare_obj(exec, bo, num_fences); \ if (unlikely(drm_exec_is_contended(exec))) \ continue; \ __ret; \ }) This way the following pattern ret = drm_exec_prepare_obj(, boA, 1); drm_exec_continue_on_contention(); if (ret) goto error; can be turned into something more conventional: ret = drm_exec_try_prepare_obj_or_retry(, boA, 1); if (ret) goto error; Yeah, I was considering that as well. But then abandoned it as to complicated. I really need to find some time to work on that anyway. I've been playing with drm_exec for a couple weeks now, and I wanted to share something I hacked to try and make the API simpler and more robust against misuse (see the below diff, which is a slightly adjusted version of your work). It would be good if we could have someone taking charge of this series and address all review comments, I see some of my comments getting lost, we have multiple submitters and I can't find a dri-devel patchwork entry for this. Anyway some comments below. In this version, the user is no longer in control of the retry loop. Instead, it provides an expression (a call to a sub-function) to be re-evaluated each time a contention is detected. IMHO, this makes the 'prepare-objs' functions easier to apprehend, and avoids any mistake like calling drm_exec_continue_on_contention() in an inner loop, or breaking out of the drm_exec_while_all_locked() loop unintentionally. In i915 we've had a very similar helper to this, and while I agree this newer version would probably help make code cleaner, but OTOH there also are some places where the short drm_exec_while_all_locked() -likeblock don't really motivate a separate function. Porting i915 to the current version will take some work, For the xe driver both versions would work fine. Some additional review comments not related to the interface change below: It also makes the internal management a bit simpler, since we no longer call drm_exec_cleanup() on the first
Re: [PATCH 01/13] drm: execution context for GEM buffers v4
On 5/4/23 13:51, Christian König wrote: This adds the infrastructure for an execution context for GEM buffers which is similar to the existing TTMs execbuf util and intended to replace it in the long term. The basic functionality is that we abstracts the necessary loop to lock many different GEM buffers with automated deadlock and duplicate handling. v2: drop xarray and use dynamic resized array instead, the locking overhead is unecessary and measurable. v3: drop duplicate tracking, radeon is really the only one needing that. v4: fixes issues pointed out by Danilo, some typos in comments and a helper for lock arrays of GEM objects. Signed-off-by: Christian König ... +/** + * struct drm_exec - Execution context + */ +struct drm_exec { + /** +* @interruptible: If locks should be taken interruptible +*/ + boolinterruptible; + + /** +* @ticket: WW ticket used for acquiring locks +*/ + struct ww_acquire_ctx ticket; + + /** +* @num_objects: number of objects locked +*/ + unsigned intnum_objects; + + /** +* @max_objects: maximum objects in array +*/ + unsigned intmax_objects; + + /** +* @objects: array of the locked objects +*/ + struct drm_gem_object **objects; Hi, Christian. Did you consider using a list here with links embedded in gem objects, now that only locked objects are to be put on the list / array. That should work as only the process owning the lock may access the list link. Apart from getting rid of reallocing this is beneficial for the more general types of ww transactions that are used by i915 (and to a minor extent xe as well, I think). In those cases we would want to unlock a temporary held object within the while_not_all_locked() loop and would then have to search the entire array for the correct pointer. Instead one could just remove it from the list of locked objects. Thanks, Thomas
Re: [PATCH 1/9] drm: execution context for GEM buffers v3
Hi Christian On 2/28/23 09:33, Christian König wrote: This adds the infrastructure for an execution context for GEM buffers which is similar to the existinc TTMs execbuf util and intended to replace it in the long term. The basic functionality is that we abstracts the necessary loop to lock many different GEM buffers with automated deadlock and duplicate handling. v2: drop xarray and use dynamic resized array instead, the locking overhead is unecessary and measureable. v3: drop duplicate tracking, radeon is really the only one needing that. Signed-off-by: Christian König Nice. This seems to have all we need for now for Xe as well, although not for i915 ATM. I see future exension of this for things like sleeping lock evictions (where locks may go out-of-scope during the locking loop), including other resv containers, passing through dma-buf move_notify() etc, what are your thoughts around that? Extend this or build a different functionality and make this a specialization of that? What about the drm_gem_lock_reservations() interface? While a bit less capable it's confusing having two apis doing essentially the same thing, could we deprecate that? Finally a comment below: --- Documentation/gpu/drm-mm.rst | 12 ++ drivers/gpu/drm/Kconfig | 6 + drivers/gpu/drm/Makefile | 2 + drivers/gpu/drm/drm_exec.c | 249 +++ include/drm/drm_exec.h | 115 5 files changed, 384 insertions(+) create mode 100644 drivers/gpu/drm/drm_exec.c create mode 100644 include/drm/drm_exec.h diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst index a79fd3549ff8..a52e6f4117d6 100644 --- a/Documentation/gpu/drm-mm.rst +++ b/Documentation/gpu/drm-mm.rst @@ -493,6 +493,18 @@ DRM Sync Objects .. kernel-doc:: drivers/gpu/drm/drm_syncobj.c :export: +DRM Execution context += + +.. kernel-doc:: drivers/gpu/drm/drm_exec.c + :doc: Overview + +.. kernel-doc:: include/drm/drm_exec.h + :internal: + +.. kernel-doc:: drivers/gpu/drm/drm_exec.c + :export: + GPU Scheduler = diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index 17d252dc25e2..84a5fc28c48d 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -200,6 +200,12 @@ config DRM_TTM GPU memory types. Will be enabled automatically if a device driver uses it. +config DRM_EXEC + tristate + depends on DRM + help + Execution context for command submissions + config DRM_BUDDY tristate depends on DRM diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile index ab4460fcd63f..d40defbb0347 100644 --- a/drivers/gpu/drm/Makefile +++ b/drivers/gpu/drm/Makefile @@ -78,6 +78,8 @@ obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o # # Memory-management helpers # +# +obj-$(CONFIG_DRM_EXEC) += drm_exec.o obj-$(CONFIG_DRM_BUDDY) += drm_buddy.o diff --git a/drivers/gpu/drm/drm_exec.c b/drivers/gpu/drm/drm_exec.c new file mode 100644 index ..df546cc5a227 --- /dev/null +++ b/drivers/gpu/drm/drm_exec.c @@ -0,0 +1,249 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ + +#include +#include +#include + +/** + * DOC: Overview + * + * This component mainly abstracts the retry loop necessary for locking + * multiple GEM objects while preparing hardware operations (e.g. command + * submissions, page table updates etc..). + * + * If a contention is detected while locking a GEM object the cleanup procedure + * unlocks all previously locked GEM objects and locks the contended one first + * before locking any further objects. + * + * After an object is locked fences slots can optionally be reserved on the + * dma_resv object inside the GEM object. + * + * A typical usage pattern should look like this:: + * + * struct drm_gem_object *obj; + * struct drm_exec exec; + * unsigned long index; + * int ret; + * + * drm_exec_init(, true); + * drm_exec_while_not_all_locked() { + * ret = drm_exec_prepare_obj(, boA, 1); + * drm_exec_continue_on_contention(); + * if (ret) + * goto error; + * + * ret = drm_exec_lock(, boB, 1); Should be drm_exec_prepare_obj()? + * drm_exec_continue_on_contention(); + * if (ret) + * goto error; + * } + * + * drm_exec_for_each_locked_object(, index, obj) { + * dma_resv_add_fence(obj->resv, fence, DMA_RESV_USAGE_READ); + * ... + * } + * drm_exec_fini(); + * + * See struct dma_exec for more details. + */ + +/* Dummy value used to initially enter the retry loop */ +#define DRM_EXEC_DUMMY (void*)~0 + +/* Unlock all objects and drop references */ +static void drm_exec_unlock_all(struct drm_exec *exec) +{ + struct drm_gem_object *obj; + unsigned long index; + +
Re: [PATCH 9/9] drm: move ttm_execbuf_util into vmwgfx
Hi, On 3/9/23 06:14, Zack Rusin wrote: On Wed, 2023-03-08 at 10:10 +0100, Christian König wrote: Am 08.03.23 um 06:14 schrieb Zack Rusin: On Tue, 2023-02-28 at 09:34 +0100, Christian König wrote: VMWGFX is the only remaining user of this and should probably moved over to drm_exec when it starts using GEM as well. Is this because vmwgfx piggybacks buffer-id relocations on top of ttm validations or did you just find it too hard to port it over? I'd prefer to avoid ttm moves to vmwgfx and at least have a clear idea of what we need to do to port. I've just found it to hard to port it over because vmwgfx does some strange things with the validation code here. If you want we can take a deeper look at this together, but I need to find some time. Alternatively just tell me how to do it and I will add that to the patch set :) I don't want to hold up the set (it looks good btw), because I had to look at something else today and tomorrow. We overload the validation lists to do quite a bit more than just reservations though. There are, I think, four separate things that need to be refactored there (Christian, feel free to skip this section, this is mainly for VMware folks on the team): 1) Relocations - userspace uses the id's of the bo's in the command stream, but on the kernel side those id's are different (or in vmwgfx terminology gem id != mob id), so the buffer id's in the command stream need to be replaced, 2) Resource validation. vmwgfx splits the userspace objects into buffers and resources (shaders, surfaces, contexts). The resources are not buffers but are backed by them. A single buffer can back multiple different resources and sometimes the kernel has to actually allocate a buffer to back a resource and attach it to it (i.e. in common terminology buffer is the memory and resources are placed in it) . Now this shouldn't be in the kernel at all, the resources shouldn't have been kernel objects and instead we should have left them completely to userspace. The reason behind this is partly historical, since initially the resources (IIRC surfaces and shaders at that time), were per-device and not per context and not backed by buffer objects. The main reason for not doing the transition to user-space for resources was the SVGA device "between-draw-call-validation". If you for example unbound a mob that was backing a resource that was bound to a context, you'd need to start a whole chain of resource invalidations to avoid the device complaining about invalid setups at awkward moments, for example when it felt like suspending. Not sure whether that is gone now though, or whether there are better ways to handle that. /Thomas 3) Coherency tracking. We use validation lists as a central place for tracking which bo's/resources are used in a command buffer and we use it to keep track of which buffers/resources will endup dirty to implement coherency. 4) Central place to allocate memory for relocation/validation nodes. Where we want to endup is with 2 completely gone from the kernel side and 1, 3 and 4 refactored and cleaned up. I think there's at least 4 separate patches to this port, so it's not a trivial thing. We will take a look at this on Friday in more detail to see what we can do. z
Re: [PATCH 9/9] drm: move ttm_execbuf_util into vmwgfx
On 3/8/23 10:10, Christian König wrote: Am 08.03.23 um 06:14 schrieb Zack Rusin: On Tue, 2023-02-28 at 09:34 +0100, Christian König wrote: VMWGFX is the only remaining user of this and should probably moved over to drm_exec when it starts using GEM as well. Is this because vmwgfx piggybacks buffer-id relocations on top of ttm validations or did you just find it too hard to port it over? I'd prefer to avoid ttm moves to vmwgfx and at least have a clear idea of what we need to do to port. I've just found it to hard to port it over because vmwgfx does some strange things with the validation code here. If you want we can take a deeper look at this together, but I need to find some time. Alternatively just tell me how to do it and I will add that to the patch set :) Regards, Christian. Hmm. It turns out Xe was using these from the very start as well. But I will take a look at what it take to deprecate that usage, so don't let that stop this removal. We need a more flexible WW transaction handling anyway. /Thomas z
Re: [PATCH v6 02/22] drm/gem: Move mapping of imported dma-bufs to drm_gem_mmap_obj()
On 6/29/22 10:22, Dmitry Osipenko wrote: On 6/29/22 09:40, Thomas Hellström (Intel) wrote: On 5/27/22 01:50, Dmitry Osipenko wrote: Drivers that use drm_gem_mmap() and drm_gem_mmap_obj() helpers don't handle imported dma-bufs properly, which results in mapping of something else than the imported dma-buf. For example, on NVIDIA Tegra we get a hard lockup when userspace writes to the memory mapping of a dma-buf that was imported into Tegra's DRM GEM. To fix this bug, move mapping of imported dma-bufs to drm_gem_mmap_obj(). Now mmaping of imported dma-bufs works properly for all DRM drivers. Same comment about Fixes: as in patch 1, Cc: sta...@vger.kernel.org Signed-off-by: Dmitry Osipenko --- drivers/gpu/drm/drm_gem.c | 3 +++ drivers/gpu/drm/drm_gem_shmem_helper.c | 9 - drivers/gpu/drm/tegra/gem.c | 4 3 files changed, 7 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 86d670c71286..7c0b025508e4 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -1038,6 +1038,9 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size, if (obj_size < vma->vm_end - vma->vm_start) return -EINVAL; + if (obj->import_attach) + return dma_buf_mmap(obj->dma_buf, vma, 0); If we start enabling mmaping of imported dma-bufs on a majority of drivers in this way, how do we ensure that user-space is not blindly using the object mmap without calling the needed DMA_BUF_IOCTL_SYNC which is needed before and after cpu access of mmap'ed dma-bufs? I was under the impression (admittedly without looking) that the few drivers that actually called into dma_buf_mmap() had some private user-mode driver code in place that ensured this happened. Since it's a userspace who does the mapping, then it should be a responsibility of userspace to do all the necessary syncing. Sure, but nothing prohibits user-space to ignore the syncing thinking "It works anyway", testing those drivers where the syncing is a NOP. And when a driver that finally needs syncing is tested it's too late to fix all broken user-space. I'm not sure whether anyone in userspace really needs to map imported dma-bufs in practice. Nevertheless, this use-case is broken and should be fixed by either allowing to do the mapping or prohibiting it. Then I'd vote for prohibiting it, at least for now. And for the future moving forward we could perhaps revisit the dma-buf need for syncing, requiring those drivers that actually need it to implement emulated coherent memory which can be done not too inefficiently (vmwgfx being one example). /Thomas
Re: [PATCH v6 08/22] drm/virtio: Unlock reservations on dma_resv_reserve_fences() error
On 5/27/22 01:50, Dmitry Osipenko wrote: Unlock reservations on dma_resv_reserve_fences() error to fix recursive locking of the reservations when this error happens. Fixes: Cc: sta...@vger.kernel.org With that fixed, Reviewed-by: Thomas Hellström Signed-off-by: Dmitry Osipenko --- drivers/gpu/drm/virtio/virtgpu_gem.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/virtio/virtgpu_gem.c b/drivers/gpu/drm/virtio/virtgpu_gem.c index 580a78809836..7db48d17ee3a 100644 --- a/drivers/gpu/drm/virtio/virtgpu_gem.c +++ b/drivers/gpu/drm/virtio/virtgpu_gem.c @@ -228,8 +228,10 @@ int virtio_gpu_array_lock_resv(struct virtio_gpu_object_array *objs) for (i = 0; i < objs->nents; ++i) { ret = dma_resv_reserve_fences(objs->objs[i]->resv, 1); - if (ret) + if (ret) { + virtio_gpu_array_unlock_resv(objs); return ret; + } } return ret; }
Re: [PATCH v6 02/22] drm/gem: Move mapping of imported dma-bufs to drm_gem_mmap_obj()
On 5/27/22 01:50, Dmitry Osipenko wrote: Drivers that use drm_gem_mmap() and drm_gem_mmap_obj() helpers don't handle imported dma-bufs properly, which results in mapping of something else than the imported dma-buf. For example, on NVIDIA Tegra we get a hard lockup when userspace writes to the memory mapping of a dma-buf that was imported into Tegra's DRM GEM. To fix this bug, move mapping of imported dma-bufs to drm_gem_mmap_obj(). Now mmaping of imported dma-bufs works properly for all DRM drivers. Same comment about Fixes: as in patch 1, Cc: sta...@vger.kernel.org Signed-off-by: Dmitry Osipenko --- drivers/gpu/drm/drm_gem.c | 3 +++ drivers/gpu/drm/drm_gem_shmem_helper.c | 9 - drivers/gpu/drm/tegra/gem.c| 4 3 files changed, 7 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 86d670c71286..7c0b025508e4 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -1038,6 +1038,9 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size, if (obj_size < vma->vm_end - vma->vm_start) return -EINVAL; + if (obj->import_attach) + return dma_buf_mmap(obj->dma_buf, vma, 0); If we start enabling mmaping of imported dma-bufs on a majority of drivers in this way, how do we ensure that user-space is not blindly using the object mmap without calling the needed DMA_BUF_IOCTL_SYNC which is needed before and after cpu access of mmap'ed dma-bufs? I was under the impression (admittedly without looking) that the few drivers that actually called into dma_buf_mmap() had some private user-mode driver code in place that ensured this happened. /Thomas + /* Take a ref for this mapping of the object, so that the fault * handler can dereference the mmap offset's pointer to the object. * This reference is cleaned up by the corresponding vm_close diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c index 8ad0e02991ca..6190f5018986 100644 --- a/drivers/gpu/drm/drm_gem_shmem_helper.c +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c @@ -609,17 +609,8 @@ EXPORT_SYMBOL_GPL(drm_gem_shmem_vm_ops); */ int drm_gem_shmem_mmap(struct drm_gem_shmem_object *shmem, struct vm_area_struct *vma) { - struct drm_gem_object *obj = >base; int ret; - if (obj->import_attach) { - /* Drop the reference drm_gem_mmap_obj() acquired.*/ - drm_gem_object_put(obj); - vma->vm_private_data = NULL; - - return dma_buf_mmap(obj->dma_buf, vma, 0); - } - ret = drm_gem_shmem_get_pages(shmem); if (ret) { drm_gem_vm_close(vma); diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c index 7c7dd84e6db8..f92aa20d63bb 100644 --- a/drivers/gpu/drm/tegra/gem.c +++ b/drivers/gpu/drm/tegra/gem.c @@ -564,6 +564,10 @@ int __tegra_gem_mmap(struct drm_gem_object *gem, struct vm_area_struct *vma) { struct tegra_bo *bo = to_tegra_bo(gem); + /* imported dmu-buf is mapped by drm_gem_mmap_obj() */ + if (gem->import_attach) + return 0; + if (!bo->pages) { unsigned long vm_pgoff = vma->vm_pgoff; int err;
Re: [PATCH v6 14/22] dma-buf: Introduce new locking convention
On 5/30/22 15:57, Dmitry Osipenko wrote: On 5/30/22 16:41, Christian König wrote: Hi Dmitry, Am 30.05.22 um 15:26 schrieb Dmitry Osipenko: Hello Christian, On 5/30/22 09:50, Christian König wrote: Hi Dmitry, First of all please separate out this patch from the rest of the series, since this is a complex separate structural change. I assume all the patches will go via the DRM tree in the end since the rest of the DRM patches in this series depend on this dma-buf change. But I see that separation may ease reviewing of the dma-buf changes, so let's try it. That sounds like you are underestimating a bit how much trouble this will be. I have tried this before and failed because catching all the locks in the right code paths are very tricky. So expect some fallout from this and make sure the kernel test robot and CI systems are clean. Sure, I'll fix up all the reported things in the next iteration. BTW, have you ever posted yours version of the patch? Will be great if we could compare the changed code paths. No, I never even finished creating it after realizing how much work it would be. This patch introduces new locking convention for dma-buf users. From now on all dma-buf importers are responsible for holding dma-buf reservation lock around operations performed over dma-bufs. This patch implements the new dma-buf locking convention by: 1. Making dma-buf API functions to take the reservation lock. 2. Adding new locked variants of the dma-buf API functions for drivers that need to manage imported dma-bufs under the held lock. Instead of adding new locked variants please mark all variants which expect to be called without a lock with an _unlocked postfix. This should make it easier to remove those in a follow up patch set and then fully move the locking into the importer. Do we really want to move all the locks to the importers? Seems the majority of drivers should be happy with the dma-buf helpers handling the locking for them. Yes, I clearly think so. 3. Converting all drivers to the new locking scheme. I have strong doubts that you got all of them. At least radeon and nouveau should grab the reservation lock in their ->attach callbacks somehow. Radeon and Nouveau use gem_prime_import_sg_table() and they take resv lock already, seems they should be okay (?) You are looking at the wrong side. You need to fix the export code path, not the import ones. See for example attach on radeon works like this drm_gem_map_attach->drm_gem_pin->radeon_gem_prime_pin->radeon_bo_reserve->ttm_bo_reserve->dma_resv_lock. Yeah, I was looking at the both sides, but missed this one. Also i915 will run into trouble with attach. In particular since i915 starts a full ww transaction in its attach callback to be able to lock other objects if migration is needed. I think i915 CI would catch this in a selftest. Perhaps it's worthwile to take a step back and figure out, if the importer is required to lock, which callbacks might need a ww acquire context? (And off-topic, Since we do a lot of fancy stuff under dma-resv locks including waiting for fences and other locks, IMO taking these locks uninterruptible should ring a warning bell) /Thomas Same for nouveau and probably a few other exporters as well. That will certainly cause a deadlock if you don't fix it. I strongly suggest to do this step by step, first attach/detach and then the rest. Thank you very much for the suggestions. I'll implement them in the next version.
Re: [PATCH v6 01/22] drm/gem: Properly annotate WW context on drm_gem_lock_reservations() error
Hi, On 5/27/22 01:50, Dmitry Osipenko wrote: Use ww_acquire_fini() in the error code paths. Otherwise lockdep thinks that lock is held when lock's memory is freed after the drm_gem_lock_reservations() error. The WW needs to be annotated as "freed" s /WW/ww_acquire_context/ ? s /"freed"/"released"/ ? , which fixes the noisy "WARNING: held lock freed!" splat of VirtIO-GPU driver with CONFIG_DEBUG_MUTEXES=y and enabled lockdep. Cc: sta...@vger.kernel.org Can you dig up the commit in error and add a Fixes: Tag? Using that and "dim fixes" will also make the Cc: stable tag a bit more verbose. With that fixed, Reviewed-by: Thomas Hellström Signed-off-by: Dmitry Osipenko --- drivers/gpu/drm/drm_gem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index eb0c2d041f13..86d670c71286 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -1226,7 +1226,7 @@ drm_gem_lock_reservations(struct drm_gem_object **objs, int count, ret = dma_resv_lock_slow_interruptible(obj->resv, acquire_ctx); if (ret) { - ww_acquire_done(acquire_ctx); + ww_acquire_fini(acquire_ctx); return ret; } } @@ -1251,7 +1251,7 @@ drm_gem_lock_reservations(struct drm_gem_object **objs, int count, goto retry; } - ww_acquire_done(acquire_ctx); + ww_acquire_fini(acquire_ctx); return ret; } }
Re: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY
On 5/31/21 2:02 PM, Christian König wrote: Am 31.05.21 um 13:19 schrieb Thomas Hellström (Intel): On 5/31/21 12:56 PM, Christian König wrote: Am 31.05.21 um 12:46 schrieb Thomas Hellström (Intel): On 5/31/21 12:32 PM, Christian König wrote: Am 31.05.21 um 11:52 schrieb Thomas Hellström (Intel): Hi, Lang, On 5/31/21 10:19 AM, Yu, Lang wrote: [Public] Hi, On 5/27/21 3:30 AM, Lang Yu wrote: Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY flag for temporary GTT allocation use. GTT is a driver private acronym, right? And it doesn't look like TTM_PL_FLAG_TEMPORARY will be used in core TTM, so should we instead set aside a mask in the PL flag for driver-private use? Hi Thomas, Thanks for your comments and advice, GTT means Graphics Translation Table here, seems a general acronym. TTM_PL_FLAG_TEMPORARY may also be used by other drives. I have made other patches for this. Please help review. Regards, Lang My point was really that the flag naming and documentation should reflect what core ttm is doing with that flag. If there is no specific core TTM usage, IMO we should move it to a driver specific flag to avoid future confusion. In particular a writer of a generic TTM resource manager should be able to know without looking at an old commit message what the placement flag is intended for. So here we add a flag where core TTM forces a bo move on validate and that's it. And that appears to be how it's used when an amdgpu bo is evicted to GTT, (btw should it be accounted in this situation?) The other use is to force the amdgpu driver to temporarily accept it into GTT when there is a lack of space, and IMHO that's a driver specific use and we should allocate a mask for driver specific flags for that. So shouldn't this be two flags, really? Well one flag makes sense for the use case at hand that drivers want to signal to TTM that an allocation is only temporary and not considered valid. That we then use this flag to implement temporary GTT allocations to avoid problems during eviction is just extending that use case. OK, but it looked like there were actually two use-cases. One for possibly long-term VRAM evictions to GTT, the other for the temporary use-case where the hop resource allocations sometimes failed. Or did I misunderstand the code? Ok sounds like we need more documentation here. That's really one use case. Key point is we need temporary allocation during multihop which should be handled differently to normal allocations. Yes, that part is clear from the patches. The part that I can't fit into that pattern is why the evict flags when evicting from visible VRAM to GTT or ordinary VRAM is marked with TTM_PL_FLAG_TEMPORARY. Wouldn't those remain evicted in that placement until re-validated to visible VRAM at an unknown future time? Not necessarily. The situation we ran into was the following: 1. OOM on VRAM, we try to evict. 2. GTT space is used up as well, ok evict directly to SYSTEM. 3. For VRAM->SYSTEM eviction we use a temporary bounce buffer. 4. Waiting for the bounce buffer to become idle is interrupted by a signal so BO is still backed by bounce buffer. 5. Next CS, BO is validated with VRAM|GTT. TTM sees that it is in GTT and doesn't move the buffer. 6. CS goes into nirvana because bounce buffers are not meant to be used for CS (we can ignore alignment and accounting for them). Yes, makes sense to me. /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY
On 5/31/21 12:56 PM, Christian König wrote: Am 31.05.21 um 12:46 schrieb Thomas Hellström (Intel): On 5/31/21 12:32 PM, Christian König wrote: Am 31.05.21 um 11:52 schrieb Thomas Hellström (Intel): Hi, Lang, On 5/31/21 10:19 AM, Yu, Lang wrote: [Public] Hi, On 5/27/21 3:30 AM, Lang Yu wrote: Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY flag for temporary GTT allocation use. GTT is a driver private acronym, right? And it doesn't look like TTM_PL_FLAG_TEMPORARY will be used in core TTM, so should we instead set aside a mask in the PL flag for driver-private use? Hi Thomas, Thanks for your comments and advice, GTT means Graphics Translation Table here, seems a general acronym. TTM_PL_FLAG_TEMPORARY may also be used by other drives. I have made other patches for this. Please help review. Regards, Lang My point was really that the flag naming and documentation should reflect what core ttm is doing with that flag. If there is no specific core TTM usage, IMO we should move it to a driver specific flag to avoid future confusion. In particular a writer of a generic TTM resource manager should be able to know without looking at an old commit message what the placement flag is intended for. So here we add a flag where core TTM forces a bo move on validate and that's it. And that appears to be how it's used when an amdgpu bo is evicted to GTT, (btw should it be accounted in this situation?) The other use is to force the amdgpu driver to temporarily accept it into GTT when there is a lack of space, and IMHO that's a driver specific use and we should allocate a mask for driver specific flags for that. So shouldn't this be two flags, really? Well one flag makes sense for the use case at hand that drivers want to signal to TTM that an allocation is only temporary and not considered valid. That we then use this flag to implement temporary GTT allocations to avoid problems during eviction is just extending that use case. OK, but it looked like there were actually two use-cases. One for possibly long-term VRAM evictions to GTT, the other for the temporary use-case where the hop resource allocations sometimes failed. Or did I misunderstand the code? Ok sounds like we need more documentation here. That's really one use case. Key point is we need temporary allocation during multihop which should be handled differently to normal allocations. Yes, that part is clear from the patches. The part that I can't fit into that pattern is why the evict flags when evicting from visible VRAM to GTT or ordinary VRAM is marked with TTM_PL_FLAG_TEMPORARY. Wouldn't those remain evicted in that placement until re-validated to visible VRAM at an unknown future time? Patch 3/3: amdgpu_bo_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT); abo->placements[0].flags |= TTM_PL_FLAG_TEMPORARY; Christian. /Thomas Christian. TTM_PL_FLAG_FORCE_MOVE and AMDGPU_PL_FLAG_TEMPORARY Thanks, /Thomas Thomas -Original Message- From: Yu, Lang Sent: Thursday, May 27, 2021 9:31 AM To: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org Cc: Koenig, Christian ; Huang, Ray ; Deucher, Alexander ; Yu, Lang Subject: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY flag for temporary GTT allocation use. Signed-off-by: Lang Yu --- include/drm/ttm/ttm_placement.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h index aa6ba4d0cf78..9f5cfc7c2d5a 100644 --- a/include/drm/ttm/ttm_placement.h +++ b/include/drm/ttm/ttm_placement.h @@ -47,8 +47,9 @@ * top of the memory area, instead of the bottom. */ -#define TTM_PL_FLAG_CONTIGUOUS (1 << 19) -#define TTM_PL_FLAG_TOPDOWN (1 << 22) +#define TTM_PL_FLAG_CONTIGUOUS (1 << 0) +#define TTM_PL_FLAG_TOPDOWN (1 << 1) +#define TTM_PL_FLAG_TEMPORARY (1 << 2) /** * struct ttm_place -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7CChristian.Koenig%40amd.com%7C3868af2bd5d94aeda94b08d924215b3a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637580547980190391%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=C7b5wz8Kph5bM8fkFVyXKwSNkSj3lDaxGUnww4jY%2FeM%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY
On 5/31/21 12:32 PM, Christian König wrote: Am 31.05.21 um 11:52 schrieb Thomas Hellström (Intel): Hi, Lang, On 5/31/21 10:19 AM, Yu, Lang wrote: [Public] Hi, On 5/27/21 3:30 AM, Lang Yu wrote: Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY flag for temporary GTT allocation use. GTT is a driver private acronym, right? And it doesn't look like TTM_PL_FLAG_TEMPORARY will be used in core TTM, so should we instead set aside a mask in the PL flag for driver-private use? Hi Thomas, Thanks for your comments and advice, GTT means Graphics Translation Table here, seems a general acronym. TTM_PL_FLAG_TEMPORARY may also be used by other drives. I have made other patches for this. Please help review. Regards, Lang My point was really that the flag naming and documentation should reflect what core ttm is doing with that flag. If there is no specific core TTM usage, IMO we should move it to a driver specific flag to avoid future confusion. In particular a writer of a generic TTM resource manager should be able to know without looking at an old commit message what the placement flag is intended for. So here we add a flag where core TTM forces a bo move on validate and that's it. And that appears to be how it's used when an amdgpu bo is evicted to GTT, (btw should it be accounted in this situation?) The other use is to force the amdgpu driver to temporarily accept it into GTT when there is a lack of space, and IMHO that's a driver specific use and we should allocate a mask for driver specific flags for that. So shouldn't this be two flags, really? Well one flag makes sense for the use case at hand that drivers want to signal to TTM that an allocation is only temporary and not considered valid. That we then use this flag to implement temporary GTT allocations to avoid problems during eviction is just extending that use case. OK, but it looked like there were actually two use-cases. One for possibly long-term VRAM evictions to GTT, the other for the temporary use-case where the hop resource allocations sometimes failed. Or did I misunderstand the code? /Thomas Christian. TTM_PL_FLAG_FORCE_MOVE and AMDGPU_PL_FLAG_TEMPORARY Thanks, /Thomas Thomas -Original Message- From: Yu, Lang Sent: Thursday, May 27, 2021 9:31 AM To: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org Cc: Koenig, Christian ; Huang, Ray ; Deucher, Alexander ; Yu, Lang Subject: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY flag for temporary GTT allocation use. Signed-off-by: Lang Yu --- include/drm/ttm/ttm_placement.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h index aa6ba4d0cf78..9f5cfc7c2d5a 100644 --- a/include/drm/ttm/ttm_placement.h +++ b/include/drm/ttm/ttm_placement.h @@ -47,8 +47,9 @@ * top of the memory area, instead of the bottom. */ -#define TTM_PL_FLAG_CONTIGUOUS (1 << 19) -#define TTM_PL_FLAG_TOPDOWN (1 << 22) +#define TTM_PL_FLAG_CONTIGUOUS (1 << 0) +#define TTM_PL_FLAG_TOPDOWN (1 << 1) +#define TTM_PL_FLAG_TEMPORARY (1 << 2) /** * struct ttm_place -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY
Hi, Lang, On 5/31/21 10:19 AM, Yu, Lang wrote: [Public] Hi, On 5/27/21 3:30 AM, Lang Yu wrote: Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY flag for temporary GTT allocation use. GTT is a driver private acronym, right? And it doesn't look like TTM_PL_FLAG_TEMPORARY will be used in core TTM, so should we instead set aside a mask in the PL flag for driver-private use? Hi Thomas, Thanks for your comments and advice, GTT means Graphics Translation Table here, seems a general acronym. TTM_PL_FLAG_TEMPORARY may also be used by other drives. I have made other patches for this. Please help review. Regards, Lang My point was really that the flag naming and documentation should reflect what core ttm is doing with that flag. If there is no specific core TTM usage, IMO we should move it to a driver specific flag to avoid future confusion. In particular a writer of a generic TTM resource manager should be able to know without looking at an old commit message what the placement flag is intended for. So here we add a flag where core TTM forces a bo move on validate and that's it. And that appears to be how it's used when an amdgpu bo is evicted to GTT, (btw should it be accounted in this situation?) The other use is to force the amdgpu driver to temporarily accept it into GTT when there is a lack of space, and IMHO that's a driver specific use and we should allocate a mask for driver specific flags for that. So shouldn't this be two flags, really? TTM_PL_FLAG_FORCE_MOVE and AMDGPU_PL_FLAG_TEMPORARY Thanks, /Thomas Thomas -Original Message- From: Yu, Lang Sent: Thursday, May 27, 2021 9:31 AM To: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org Cc: Koenig, Christian ; Huang, Ray ; Deucher, Alexander ; Yu, Lang Subject: [PATCH 1/2] drm/ttm: cleanup and add TTM_PL_FLAG_TEMPORARY Make TTM_PL_FLAG_* start from zero and add TTM_PL_FLAG_TEMPORARY flag for temporary GTT allocation use. Signed-off-by: Lang Yu --- include/drm/ttm/ttm_placement.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h index aa6ba4d0cf78..9f5cfc7c2d5a 100644 --- a/include/drm/ttm/ttm_placement.h +++ b/include/drm/ttm/ttm_placement.h @@ -47,8 +47,9 @@ * top of the memory area, instead of the bottom. */ -#define TTM_PL_FLAG_CONTIGUOUS (1 << 19) -#define TTM_PL_FLAG_TOPDOWN (1 << 22) +#define TTM_PL_FLAG_CONTIGUOUS (1 << 0) +#define TTM_PL_FLAG_TOPDOWN (1 << 1) +#define TTM_PL_FLAG_TEMPORARY (1 << 2) /** * struct ttm_place -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/ttm: stop warning on TT shrinker failure
On 3/23/21 4:45 PM, Christian König wrote: Am 23.03.21 um 16:13 schrieb Michal Hocko: On Tue 23-03-21 14:56:54, Christian König wrote: Am 23.03.21 um 14:41 schrieb Michal Hocko: [...] Anyway, I am wondering whether the overall approach is sound. Why don't you simply use shmem as your backing storage from the beginning and pin those pages if they are used by the device? Yeah, that is exactly what the Intel guys are doing for their integrated GPUs :) Problem is for TTM I need to be able to handle dGPUs and those have all kinds of funny allocation restrictions. In other words I need to guarantee that the allocated memory is coherent accessible to the GPU without using SWIOTLB. The simple case is that the device can only do DMA32, but you also got device which can only do 40bits or 48bits. On top of that you also got AGP, CMA and stuff like CPU cache behavior changes (write back vs. write through, vs. uncached). OK, so the underlying problem seems to be that gfp mask (thus mapping_gfp_mask) cannot really reflect your requirements, right? Would it help if shmem would allow to provide an allocation callback to override alloc_page_vma which is used currently? I am pretty sure there will be more to handle but going through shmem for the whole life time is just so much easier to reason about than some tricks to abuse shmem just for the swapout path. Well it's a start, but the pages can have special CPU cache settings. So direct IO from/to them usually doesn't work as expected. Additional to that for AGP and CMA I need to make sure that I give those pages back to the relevant subsystems instead of just dropping the page reference. So I would need to block for the swapio to be completed. Anyway I probably need to revert those patches for now since this isn't working as we hoped it would. Thanks for the explanation how stuff works here. Another alternative here that I've tried before without being successful would perhaps be to drop shmem completely and, if it's a normal page (no dma or funny caching attributes) just use add_to_swap_cache()? If it's something else, try alloc a page with relevant gfp attributes, copy and add_to_swap_cache()? Or perhaps that doesn't work well from a shrinker either? /Thomas Christian. ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone
On 3/4/21 6:58 PM, Felix Kuehling wrote: Am 2021-03-01 um 3:46 a.m. schrieb Thomas Hellström (Intel): On 3/1/21 9:32 AM, Daniel Vetter wrote: On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote: From: Philip Yang Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration. Signed-off-by: Philip Yang Signed-off-by: Felix Kuehling So maybe I'm getting this all wrong, but I think that the current ttm fault code relies on devmap pte entries (especially for hugepte entries) to stop get_user_pages. But this only works if the pte happens to not point at a range with devmap pages. I don't think that's in TTM yet, but the proposed fix, yes (see email I just sent in another thread), but only for huge ptes. This patch here changes that, and so probably breaks this devmap pte hack ttm is using? If I'm not wrong here then I think we need to first fix up the ttm code to not use the devmap hack anymore, before a ttm based driver can register a dev_pagemap. Also adding Thomas since that just came up in another discussion. It doesn't break the ttm devmap hack per se, but it indeed allows gup to the range registered, but here's where my lack of understanding why we can't allow gup-ing TTM ptes if there indeed is a backing struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that, right? I wasn't aware that TTM used devmap at all. If it does, what type of memory does it use? MEMORY_DEVICE_PRIVATE is like swapped out memory. It cannot be mapped in the CPU page table. GUP would cause a page fault to swap it back into system memory. We are looking into use MEMORY_DEVICE_GENERIC for a future coherent memory architecture, where device memory can be coherently accessed by the CPU and GPU. As I understand it, our DEVICE_PRIVATE registration is not tied to an actual physical address. Thus your devmap registration and our devmap registration could probably coexist without any conflict. You'll just have the overhead of two sets of struct pages for the same memory. Regards, Felix Hi, Felix. TTM doesn't use devmap yet, but thinking of using it for faking pmd_special() which isn't available. That would mean pmd_devmap() + no_registered_dev_pagemap meaning special in the sense documented by vm_normal_page(). The implication here would be that if you register memory like above, TTM would never be able to set up a huge page table entry to it. But it sounds like that's not an issue? /Thomas /Thomas -Daniel ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone
On 3/1/21 9:58 AM, Daniel Vetter wrote: On Mon, Mar 01, 2021 at 09:46:44AM +0100, Thomas Hellström (Intel) wrote: On 3/1/21 9:32 AM, Daniel Vetter wrote: On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote: From: Philip Yang Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration. Signed-off-by: Philip Yang Signed-off-by: Felix Kuehling So maybe I'm getting this all wrong, but I think that the current ttm fault code relies on devmap pte entries (especially for hugepte entries) to stop get_user_pages. But this only works if the pte happens to not point at a range with devmap pages. I don't think that's in TTM yet, but the proposed fix, yes (see email I just sent in another thread), but only for huge ptes. This patch here changes that, and so probably breaks this devmap pte hack ttm is using? If I'm not wrong here then I think we need to first fix up the ttm code to not use the devmap hack anymore, before a ttm based driver can register a dev_pagemap. Also adding Thomas since that just came up in another discussion. It doesn't break the ttm devmap hack per se, but it indeed allows gup to the range registered, but here's where my lack of understanding why we can't allow gup-ing TTM ptes if there indeed is a backing struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that, right? We need to keep supporting buffer based memory management for all the non-compute users. Because those require end-of-batch dma_fence semantics, which prevents us from using gpu page faults, which makes hmm not really work. And for buffer based memory manager we can't have gup pin random pages in there, that's not really how it works. Worst case ttm just assumes it can actually move buffers and reallocate them as it sees fit, and your gup mapping (for direct i/o or whatever) now points at a page of a buffer that you don't even own anymore. That's not good. Hence also all the discussions about preventing gup for bo mappings in general. Once we throw hmm into the mix we need to be really careful that the two worlds don't collide. Pure hmm is fine, pure bo managed memory is fine, mixing them is tricky. -Daniel Hmm, OK so then registering MEMORY_DEVICE_PRIVATE means we can't set pxx_devmap because that would allow gup, which, in turn, means no huge TTM ptes. /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 17/35] drm/amdkfd: register HMM device private zone
On 3/1/21 9:32 AM, Daniel Vetter wrote: On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote: From: Philip Yang Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration. Signed-off-by: Philip Yang Signed-off-by: Felix Kuehling So maybe I'm getting this all wrong, but I think that the current ttm fault code relies on devmap pte entries (especially for hugepte entries) to stop get_user_pages. But this only works if the pte happens to not point at a range with devmap pages. I don't think that's in TTM yet, but the proposed fix, yes (see email I just sent in another thread), but only for huge ptes. This patch here changes that, and so probably breaks this devmap pte hack ttm is using? If I'm not wrong here then I think we need to first fix up the ttm code to not use the devmap hack anymore, before a ttm based driver can register a dev_pagemap. Also adding Thomas since that just came up in another discussion. It doesn't break the ttm devmap hack per se, but it indeed allows gup to the range registered, but here's where my lack of understanding why we can't allow gup-ing TTM ptes if there indeed is a backing struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that, right? /Thomas -Daniel ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 2020-07-22 16:23, Christian König wrote: Am 22.07.20 um 16:07 schrieb Daniel Vetter: On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel) wrote: On 2020-07-22 14:41, Daniel Vetter wrote: I'm pretty sure there's more bugs, I just haven't heard from them yet. Also due to the opt-in nature of dma-fence we can limit the scope of what we fix fairly naturally, just don't put them where no one cares :-) Of course that also hides general locking issues in dma_fence signalling code, but well *shrug*. Hmm, yes. Another potential big problem would be drivers that want to use gpu page faults in the dma-fence critical sections with the batch-based programming model. Yeah that's a massive can of worms. But luckily there's no such driver merged in upstream, so hopefully we can think about all the constraints and how to best annotate this before we land any code and have big regrets. Do you want a bad news? I once made a prototype for that when Vega10 came out. But we abandoned this approach for the the batch based approach because of the horrible performance. In context of the previous discussion I'd consider the fact that it's not performant in the batch-based model good news :) Thomas KFD is going to see that, but this is only with user queues and no dma_fence involved whatsoever. Christian. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 2020-07-22 14:41, Daniel Vetter wrote: Ah I think I misunderstood which options you want to compare here. I'm not sure how much pain fixing up "dma-fence as memory fence" really is. That's kinda why I want a lot more testing on my annotation patches, to figure that out. Not much feedback aside from amdgpu and intel, and those two drivers pretty much need to sort out their memory fence issues anyway (because of userptr and stuff like that). The only other issues outside of these two drivers I'm aware of: - various scheduler drivers doing allocations in the drm/scheduler critical section. Since all arm-soc drivers have a mildly shoddy memory model of "we just pin everything" they don't really have to deal with this. So we might just declare arm as a platform broken and not taint the dma-fence critical sections with fs_reclaim. Otoh we need to fix this for drm/scheduler anyway, I think best option would be to have a mempool for hw fences in the scheduler itself, and at that point fixing the other drivers shouldn't be too onerous. - vmwgfx doing a dma_resv in the atomic commit tail. Entirely orthogonal to the entire memory fence discussion. With vmwgfx there is another issue that is hit when the gpu signals an error. At that point the batch might be restarted with a new meta command buffer that needs to be allocated out of a dma pool. in the fence critical section. That's probably a bit nasty to fix, but not impossible. I'm pretty sure there's more bugs, I just haven't heard from them yet. Also due to the opt-in nature of dma-fence we can limit the scope of what we fix fairly naturally, just don't put them where no one cares :-) Of course that also hides general locking issues in dma_fence signalling code, but well *shrug*. Hmm, yes. Another potential big problem would be drivers that want to use gpu page faults in the dma-fence critical sections with the batch-based programming model. /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 2020-07-22 13:39, Daniel Vetter wrote: On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel) wrote: On 2020-07-22 11:45, Daniel Vetter wrote: On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel) wrote: On 2020-07-22 09:11, Daniel Vetter wrote: On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel) wrote: On 2020-07-22 00:45, Dave Airlie wrote: On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel) wrote: On 7/21/20 9:45 AM, Christian König wrote: Am 21.07.20 um 09:41 schrieb Daniel Vetter: On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote: Hi, On 7/9/20 2:33 PM, Daniel Vetter wrote: Comes up every few years, gets somewhat tedious to discuss, let's write this down once and for all. What I'm not sure about is whether the text should be more explicit in flat out mandating the amdkfd eviction fences for long running compute workloads or workloads where userspace fencing is allowed. Although (in my humble opinion) it might be possible to completely untangle kernel-introduced fences for resource management and dma-fences used for completion- and dependency tracking and lift a lot of restrictions for the dma-fences, including prohibiting infinite ones, I think this makes sense describing the current state. Yeah I think a future patch needs to type up how we want to make that happen (for some cross driver consistency) and what needs to be considered. Some of the necessary parts are already there (with like the preemption fences amdkfd has as an example), but I think some clear docs on what's required from both hw, drivers and userspace would be really good. I'm currently writing that up, but probably still need a few days for this. Great! I put down some (very) initial thoughts a couple of weeks ago building on eviction fences for various hardware complexity levels here: https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt We are seeing HW that has recoverable GPU page faults but only for compute tasks, and scheduler without semaphores hw for graphics. So a single driver may have to expose both models to userspace and also introduces the problem of how to interoperate between the two models on one card. Dave. Hmm, yes to begin with it's important to note that this is not a replacement for new programming models or APIs, This is something that takes place internally in drivers to mitigate many of the restrictions that are currently imposed on dma-fence and documented in this and previous series. It's basically the driver-private narrow completions Jason suggested in the lockdep patches discussions implemented the same way as eviction-fences. The memory fence API would be local to helpers and middle-layers like TTM, and the corresponding drivers. The only cross-driver-like visibility would be that the dma-buf move_notify() callback would not be allowed to wait on dma-fences or something that depends on a dma-fence. Because we can't preempt (on some engines at least) we already have the requirement that cross driver buffer management can get stuck on a dma-fence. Not even taking into account the horrors we do with userptr, which are cross driver no matter what. Limiting move_notify to memory fences only doesn't work, since the pte clearing might need to wait for a dma_fence first. Hence this becomes a full end-of-batch fence, not just a limited kernel-internal memory fence. For non-preemptible hardware the memory fence typically *is* the end-of-batch fence. (Unless, as documented, there is a scheduler consuming sync-file dependencies in which case the memory fence wait needs to be able to break out of that). The key thing is not that we can break out of execution, but that we can break out of dependencies, since when we're executing all dependecies (modulo semaphores) are already fulfilled. That's what's eliminating the deadlocks. That's kinda why I think only reasonable option is to toss in the towel and declare dma-fence to be the memory fence (and suck up all the consequences of that decision as uapi, which is kinda where we are), and construct something new free-wheeling for userspace fencing. But only for engines that allow enough preempt/gpu page faulting to make that possible. Free wheeling userspace fences/gpu semaphores or whatever you want to call them (on windows I think it's monitored fence) only work if you can preempt to decouple the memory fences from your gpu command execution. There's the in-between step of just decoupling the batchbuffer submission prep for hw without any preempt (but a scheduler), but that seems kinda pointless. Modern execbuf should be O(1) fastpath, with all the allocation/mapping work pulled out ahead. vk exposes that model directly to clients, GL drivers could use it internally too, so I see zero value in spending lots of time engineering very tricky kernel code just for old userspace. Much more reasonable to do that in userspace, where we have
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 2020-07-22 11:45, Daniel Vetter wrote: On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel) wrote: On 2020-07-22 09:11, Daniel Vetter wrote: On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel) wrote: On 2020-07-22 00:45, Dave Airlie wrote: On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel) wrote: On 7/21/20 9:45 AM, Christian König wrote: Am 21.07.20 um 09:41 schrieb Daniel Vetter: On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote: Hi, On 7/9/20 2:33 PM, Daniel Vetter wrote: Comes up every few years, gets somewhat tedious to discuss, let's write this down once and for all. What I'm not sure about is whether the text should be more explicit in flat out mandating the amdkfd eviction fences for long running compute workloads or workloads where userspace fencing is allowed. Although (in my humble opinion) it might be possible to completely untangle kernel-introduced fences for resource management and dma-fences used for completion- and dependency tracking and lift a lot of restrictions for the dma-fences, including prohibiting infinite ones, I think this makes sense describing the current state. Yeah I think a future patch needs to type up how we want to make that happen (for some cross driver consistency) and what needs to be considered. Some of the necessary parts are already there (with like the preemption fences amdkfd has as an example), but I think some clear docs on what's required from both hw, drivers and userspace would be really good. I'm currently writing that up, but probably still need a few days for this. Great! I put down some (very) initial thoughts a couple of weeks ago building on eviction fences for various hardware complexity levels here: https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt We are seeing HW that has recoverable GPU page faults but only for compute tasks, and scheduler without semaphores hw for graphics. So a single driver may have to expose both models to userspace and also introduces the problem of how to interoperate between the two models on one card. Dave. Hmm, yes to begin with it's important to note that this is not a replacement for new programming models or APIs, This is something that takes place internally in drivers to mitigate many of the restrictions that are currently imposed on dma-fence and documented in this and previous series. It's basically the driver-private narrow completions Jason suggested in the lockdep patches discussions implemented the same way as eviction-fences. The memory fence API would be local to helpers and middle-layers like TTM, and the corresponding drivers. The only cross-driver-like visibility would be that the dma-buf move_notify() callback would not be allowed to wait on dma-fences or something that depends on a dma-fence. Because we can't preempt (on some engines at least) we already have the requirement that cross driver buffer management can get stuck on a dma-fence. Not even taking into account the horrors we do with userptr, which are cross driver no matter what. Limiting move_notify to memory fences only doesn't work, since the pte clearing might need to wait for a dma_fence first. Hence this becomes a full end-of-batch fence, not just a limited kernel-internal memory fence. For non-preemptible hardware the memory fence typically *is* the end-of-batch fence. (Unless, as documented, there is a scheduler consuming sync-file dependencies in which case the memory fence wait needs to be able to break out of that). The key thing is not that we can break out of execution, but that we can break out of dependencies, since when we're executing all dependecies (modulo semaphores) are already fulfilled. That's what's eliminating the deadlocks. That's kinda why I think only reasonable option is to toss in the towel and declare dma-fence to be the memory fence (and suck up all the consequences of that decision as uapi, which is kinda where we are), and construct something new free-wheeling for userspace fencing. But only for engines that allow enough preempt/gpu page faulting to make that possible. Free wheeling userspace fences/gpu semaphores or whatever you want to call them (on windows I think it's monitored fence) only work if you can preempt to decouple the memory fences from your gpu command execution. There's the in-between step of just decoupling the batchbuffer submission prep for hw without any preempt (but a scheduler), but that seems kinda pointless. Modern execbuf should be O(1) fastpath, with all the allocation/mapping work pulled out ahead. vk exposes that model directly to clients, GL drivers could use it internally too, so I see zero value in spending lots of time engineering very tricky kernel code just for old userspace. Much more reasonable to do that in userspace, where we have real debuggers and no panics about security bugs (or well, a lot less, webgl is still a thing, but at least
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 2020-07-22 09:11, Daniel Vetter wrote: On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel) wrote: On 2020-07-22 00:45, Dave Airlie wrote: On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel) wrote: On 7/21/20 9:45 AM, Christian König wrote: Am 21.07.20 um 09:41 schrieb Daniel Vetter: On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote: Hi, On 7/9/20 2:33 PM, Daniel Vetter wrote: Comes up every few years, gets somewhat tedious to discuss, let's write this down once and for all. What I'm not sure about is whether the text should be more explicit in flat out mandating the amdkfd eviction fences for long running compute workloads or workloads where userspace fencing is allowed. Although (in my humble opinion) it might be possible to completely untangle kernel-introduced fences for resource management and dma-fences used for completion- and dependency tracking and lift a lot of restrictions for the dma-fences, including prohibiting infinite ones, I think this makes sense describing the current state. Yeah I think a future patch needs to type up how we want to make that happen (for some cross driver consistency) and what needs to be considered. Some of the necessary parts are already there (with like the preemption fences amdkfd has as an example), but I think some clear docs on what's required from both hw, drivers and userspace would be really good. I'm currently writing that up, but probably still need a few days for this. Great! I put down some (very) initial thoughts a couple of weeks ago building on eviction fences for various hardware complexity levels here: https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt We are seeing HW that has recoverable GPU page faults but only for compute tasks, and scheduler without semaphores hw for graphics. So a single driver may have to expose both models to userspace and also introduces the problem of how to interoperate between the two models on one card. Dave. Hmm, yes to begin with it's important to note that this is not a replacement for new programming models or APIs, This is something that takes place internally in drivers to mitigate many of the restrictions that are currently imposed on dma-fence and documented in this and previous series. It's basically the driver-private narrow completions Jason suggested in the lockdep patches discussions implemented the same way as eviction-fences. The memory fence API would be local to helpers and middle-layers like TTM, and the corresponding drivers. The only cross-driver-like visibility would be that the dma-buf move_notify() callback would not be allowed to wait on dma-fences or something that depends on a dma-fence. Because we can't preempt (on some engines at least) we already have the requirement that cross driver buffer management can get stuck on a dma-fence. Not even taking into account the horrors we do with userptr, which are cross driver no matter what. Limiting move_notify to memory fences only doesn't work, since the pte clearing might need to wait for a dma_fence first. Hence this becomes a full end-of-batch fence, not just a limited kernel-internal memory fence. For non-preemptible hardware the memory fence typically *is* the end-of-batch fence. (Unless, as documented, there is a scheduler consuming sync-file dependencies in which case the memory fence wait needs to be able to break out of that). The key thing is not that we can break out of execution, but that we can break out of dependencies, since when we're executing all dependecies (modulo semaphores) are already fulfilled. That's what's eliminating the deadlocks. That's kinda why I think only reasonable option is to toss in the towel and declare dma-fence to be the memory fence (and suck up all the consequences of that decision as uapi, which is kinda where we are), and construct something new free-wheeling for userspace fencing. But only for engines that allow enough preempt/gpu page faulting to make that possible. Free wheeling userspace fences/gpu semaphores or whatever you want to call them (on windows I think it's monitored fence) only work if you can preempt to decouple the memory fences from your gpu command execution. There's the in-between step of just decoupling the batchbuffer submission prep for hw without any preempt (but a scheduler), but that seems kinda pointless. Modern execbuf should be O(1) fastpath, with all the allocation/mapping work pulled out ahead. vk exposes that model directly to clients, GL drivers could use it internally too, so I see zero value in spending lots of time engineering very tricky kernel code just for old userspace. Much more reasonable to do that in userspace, where we have real debuggers and no panics about security bugs (or well, a lot less, webgl is still a thing, but at least browsers realized you need to container that completely). Sure, it's definitely a big chunk of work
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 2020-07-22 00:45, Dave Airlie wrote: On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel) wrote: On 7/21/20 9:45 AM, Christian König wrote: Am 21.07.20 um 09:41 schrieb Daniel Vetter: On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote: Hi, On 7/9/20 2:33 PM, Daniel Vetter wrote: Comes up every few years, gets somewhat tedious to discuss, let's write this down once and for all. What I'm not sure about is whether the text should be more explicit in flat out mandating the amdkfd eviction fences for long running compute workloads or workloads where userspace fencing is allowed. Although (in my humble opinion) it might be possible to completely untangle kernel-introduced fences for resource management and dma-fences used for completion- and dependency tracking and lift a lot of restrictions for the dma-fences, including prohibiting infinite ones, I think this makes sense describing the current state. Yeah I think a future patch needs to type up how we want to make that happen (for some cross driver consistency) and what needs to be considered. Some of the necessary parts are already there (with like the preemption fences amdkfd has as an example), but I think some clear docs on what's required from both hw, drivers and userspace would be really good. I'm currently writing that up, but probably still need a few days for this. Great! I put down some (very) initial thoughts a couple of weeks ago building on eviction fences for various hardware complexity levels here: https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt We are seeing HW that has recoverable GPU page faults but only for compute tasks, and scheduler without semaphores hw for graphics. So a single driver may have to expose both models to userspace and also introduces the problem of how to interoperate between the two models on one card. Dave. Hmm, yes to begin with it's important to note that this is not a replacement for new programming models or APIs, This is something that takes place internally in drivers to mitigate many of the restrictions that are currently imposed on dma-fence and documented in this and previous series. It's basically the driver-private narrow completions Jason suggested in the lockdep patches discussions implemented the same way as eviction-fences. The memory fence API would be local to helpers and middle-layers like TTM, and the corresponding drivers. The only cross-driver-like visibility would be that the dma-buf move_notify() callback would not be allowed to wait on dma-fences or something that depends on a dma-fence. So with that in mind, I don't foresee engines with different capabilities on the same card being a problem. /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 2020-07-21 15:59, Christian König wrote: Am 21.07.20 um 12:47 schrieb Thomas Hellström (Intel): ... Yes, we can't do magic. As soon as an indefinite batch makes it to such hardware we've lost. But since we can break out while the batch is stuck in the scheduler waiting, what I believe we *can* do with this approach is to avoid deadlocks due to locally unknown dependencies, which has some bearing on this documentation patch, and also to allow memory allocation in dma-fence (not memory-fence) critical sections, like gpu fault- and error handlers without resorting to using memory pools. Avoiding deadlocks is only the tip of the iceberg here. When you allow the kernel to depend on user space to proceed with some operation there are a lot more things which need consideration. E.g. what happens when an userspace process which has submitted stuff to the kernel is killed? Are the prepared commands send to the hardware or aborted as well? What do we do with other processes waiting for that stuff? How to we do resource accounting? When processes need to block when submitting to the hardware stuff which is not ready we have a process we can punish for blocking resources. But how is kernel memory used for a submission accounted? How do we avoid deny of service attacks here were somebody eats up all memory by doing submissions which can't finish? Hmm. Are these problems really unique to user-space controlled dependencies? Couldn't you hit the same or similar problems with mis-behaving shaders blocking timeline progress? /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 7/21/20 11:50 AM, Daniel Vetter wrote: On Tue, Jul 21, 2020 at 11:38 AM Thomas Hellström (Intel) wrote: On 7/21/20 10:55 AM, Christian König wrote: Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel): On 7/21/20 9:45 AM, Christian König wrote: Am 21.07.20 um 09:41 schrieb Daniel Vetter: On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote: Hi, On 7/9/20 2:33 PM, Daniel Vetter wrote: Comes up every few years, gets somewhat tedious to discuss, let's write this down once and for all. What I'm not sure about is whether the text should be more explicit in flat out mandating the amdkfd eviction fences for long running compute workloads or workloads where userspace fencing is allowed. Although (in my humble opinion) it might be possible to completely untangle kernel-introduced fences for resource management and dma-fences used for completion- and dependency tracking and lift a lot of restrictions for the dma-fences, including prohibiting infinite ones, I think this makes sense describing the current state. Yeah I think a future patch needs to type up how we want to make that happen (for some cross driver consistency) and what needs to be considered. Some of the necessary parts are already there (with like the preemption fences amdkfd has as an example), but I think some clear docs on what's required from both hw, drivers and userspace would be really good. I'm currently writing that up, but probably still need a few days for this. Great! I put down some (very) initial thoughts a couple of weeks ago building on eviction fences for various hardware complexity levels here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odtdata=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3Dreserved=0 I don't think that this will ever be possible. See that Daniel describes in his text is that indefinite fences are a bad idea for memory management, and I think that this is a fixed fact. In other words the whole concept of submitting work to the kernel which depends on some user space interaction doesn't work and never will. Well the idea here is that memory management will *never* depend on indefinite fences: As soon as someone waits on a memory manager fence (be it eviction, shrinker or mmu notifier) it breaks out of any dma-fence dependencies and /or user-space interaction. The text tries to describe what's required to be able to do that (save for non-preemptible gpus where someone submits a forever-running shader). Yeah I think that part of your text is good to describe how to untangle memory fences from synchronization fences given how much the hw can do. So while I think this is possible (until someone comes up with a case where it wouldn't work of course), I guess Daniel has a point in that it won't happen because of inertia and there might be better options. Yeah it's just I don't see much chance for splitting dma-fence itself. That's also why I'm not positive on the "no hw preemption, only scheduler" case: You still have a dma_fence for the batch itself, which means still no userspace controlled synchronization or other form of indefinite batches allowed. So not getting us any closer to enabling the compute use cases people want. Yes, we can't do magic. As soon as an indefinite batch makes it to such hardware we've lost. But since we can break out while the batch is stuck in the scheduler waiting, what I believe we *can* do with this approach is to avoid deadlocks due to locally unknown dependencies, which has some bearing on this documentation patch, and also to allow memory allocation in dma-fence (not memory-fence) critical sections, like gpu fault- and error handlers without resorting to using memory pools. But again. I'm not saying we should actually implement this. Better to consider it and reject it than not consider it at all. /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 7/21/20 10:55 AM, Christian König wrote: Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel): On 7/21/20 9:45 AM, Christian König wrote: Am 21.07.20 um 09:41 schrieb Daniel Vetter: On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote: Hi, On 7/9/20 2:33 PM, Daniel Vetter wrote: Comes up every few years, gets somewhat tedious to discuss, let's write this down once and for all. What I'm not sure about is whether the text should be more explicit in flat out mandating the amdkfd eviction fences for long running compute workloads or workloads where userspace fencing is allowed. Although (in my humble opinion) it might be possible to completely untangle kernel-introduced fences for resource management and dma-fences used for completion- and dependency tracking and lift a lot of restrictions for the dma-fences, including prohibiting infinite ones, I think this makes sense describing the current state. Yeah I think a future patch needs to type up how we want to make that happen (for some cross driver consistency) and what needs to be considered. Some of the necessary parts are already there (with like the preemption fences amdkfd has as an example), but I think some clear docs on what's required from both hw, drivers and userspace would be really good. I'm currently writing that up, but probably still need a few days for this. Great! I put down some (very) initial thoughts a couple of weeks ago building on eviction fences for various hardware complexity levels here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odtdata=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3Dreserved=0 I don't think that this will ever be possible. See that Daniel describes in his text is that indefinite fences are a bad idea for memory management, and I think that this is a fixed fact. In other words the whole concept of submitting work to the kernel which depends on some user space interaction doesn't work and never will. Well the idea here is that memory management will *never* depend on indefinite fences: As soon as someone waits on a memory manager fence (be it eviction, shrinker or mmu notifier) it breaks out of any dma-fence dependencies and /or user-space interaction. The text tries to describe what's required to be able to do that (save for non-preemptible gpus where someone submits a forever-running shader). So while I think this is possible (until someone comes up with a case where it wouldn't work of course), I guess Daniel has a point in that it won't happen because of inertia and there might be better options. /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
On 7/21/20 9:45 AM, Christian König wrote: Am 21.07.20 um 09:41 schrieb Daniel Vetter: On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote: Hi, On 7/9/20 2:33 PM, Daniel Vetter wrote: Comes up every few years, gets somewhat tedious to discuss, let's write this down once and for all. What I'm not sure about is whether the text should be more explicit in flat out mandating the amdkfd eviction fences for long running compute workloads or workloads where userspace fencing is allowed. Although (in my humble opinion) it might be possible to completely untangle kernel-introduced fences for resource management and dma-fences used for completion- and dependency tracking and lift a lot of restrictions for the dma-fences, including prohibiting infinite ones, I think this makes sense describing the current state. Yeah I think a future patch needs to type up how we want to make that happen (for some cross driver consistency) and what needs to be considered. Some of the necessary parts are already there (with like the preemption fences amdkfd has as an example), but I think some clear docs on what's required from both hw, drivers and userspace would be really good. I'm currently writing that up, but probably still need a few days for this. Great! I put down some (very) initial thoughts a couple of weeks ago building on eviction fences for various hardware complexity levels here: https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
Hi, On 7/9/20 2:33 PM, Daniel Vetter wrote: Comes up every few years, gets somewhat tedious to discuss, let's write this down once and for all. What I'm not sure about is whether the text should be more explicit in flat out mandating the amdkfd eviction fences for long running compute workloads or workloads where userspace fencing is allowed. Although (in my humble opinion) it might be possible to completely untangle kernel-introduced fences for resource management and dma-fences used for completion- and dependency tracking and lift a lot of restrictions for the dma-fences, including prohibiting infinite ones, I think this makes sense describing the current state. Reviewed-by: Thomas Hellstrom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations
On 6/4/20 10:12 AM, Daniel Vetter wrote: Two in one go: - it is allowed to call dma_fence_wait() while holding a dma_resv_lock(). This is fundamental to how eviction works with ttm, so required. - it is allowed to call dma_fence_wait() from memory reclaim contexts, specifically from shrinker callbacks (which i915 does), and from mmu notifier callbacks (which amdgpu does, and which i915 sometimes also does, and probably always should, but that's kinda a debate). Also for stuff like HMM we really need to be able to do this, or things get real dicey. Consequence is that any critical path necessary to get to a dma_fence_signal for a fence must never a) call dma_resv_lock nor b) allocate memory with GFP_KERNEL. Also by implication of dma_resv_lock(), no userspace faulting allowed. That's some supremely obnoxious limitations, which is why we need to sprinkle the right annotations to all relevant paths. The one big locking context we're leaving out here is mmu notifiers, added in commit 23b68395c7c78a764e8963fc15a7cfd318bf187f Author: Daniel Vetter Date: Mon Aug 26 22:14:21 2019 +0200 mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end that one covers a lot of other callsites, and it's also allowed to wait on dma-fences from mmu notifiers. But there's no ready-made functions exposed to prime this, so I've left it out for now. v2: Also track against mmu notifier context. v3: kerneldoc to spec the cross-driver contract. Note that currently i915 throws in a hard-coded 10s timeout on foreign fences (not sure why that was done, but it's there), which is why that rule is worded with SHOULD instead of MUST. Also some of the mmu_notifier/shrinker rules might surprise SoC drivers, I haven't fully audited them all. Which is infeasible anyway, we'll need to run them with lockdep and dma-fence annotations and see what goes boom. v4: A spelling fix from Mika Cc: Mika Kuoppala Cc: Thomas Hellstrom Cc: linux-me...@vger.kernel.org Cc: linaro-mm-...@lists.linaro.org Cc: linux-r...@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-...@lists.freedesktop.org Cc: Chris Wilson Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter --- Documentation/driver-api/dma-buf.rst | 6 drivers/dma-buf/dma-fence.c | 41 drivers/dma-buf/dma-resv.c | 4 +++ include/linux/dma-fence.h| 1 + 4 files changed, 52 insertions(+) I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that). But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this, Reviewed-by: Thomas Hellström ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space
On 6/10/20 11:19 PM, Andrey Grodzovsky wrote: On 6/10/20 4:30 PM, Thomas Hellström (Intel) wrote: On 6/10/20 5:30 PM, Daniel Vetter wrote: On Wed, Jun 10, 2020 at 04:05:04PM +0200, Christian König wrote: Am 10.06.20 um 15:54 schrieb Andrey Grodzovsky: On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote: On 6/9/20 7:21 PM, Koenig, Christian wrote: Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey" : On 6/5/20 2:40 PM, Christian König wrote: > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky: >> >> On 5/11/20 2:45 AM, Christian König wrote: >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky: >>>> Signed-off-by: Andrey Grodzovsky >>>> --- >>>> drivers/gpu/drm/ttm/ttm_bo.c | 22 +- >>>> include/drm/ttm/ttm_bo_driver.h | 2 ++ >>>> 2 files changed, 23 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c >>>> b/drivers/gpu/drm/ttm/ttm_bo.c >>>> index c5b516f..eae61cc 100644 >>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c >>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c >>>> @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct >>>> ttm_buffer_object *bo) >>>> ttm_bo_unmap_virtual_locked(bo); >>>> ttm_mem_io_unlock(man); >>>> } >>>> +EXPORT_SYMBOL(ttm_bo_unmap_virtual); >>>> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev) >>>> +{ >>>> + struct ttm_mem_type_manager *man; >>>> + int i; >>>> -EXPORT_SYMBOL(ttm_bo_unmap_virtual); >>> >>>> + for (i = 0; i < TTM_NUM_MEM_TYPES; i++) { >>>> + man = >man[i]; >>>> + if (man->has_type && man->use_type) >>>> + ttm_mem_io_lock(man, false); >>>> + } >>> >>> You should drop that it will just result in a deadlock warning for >>> Nouveau and has no effect at all. >>> >>> Apart from that looks good to me, >>> Christian. >> >> >> As I am considering to re-include this in V2 of the patchsets, can >> you clarify please why this will have no effect at all ? > > The locks are exclusive for Nouveau to allocate/free the io address > space. > > Since we don't do this here we don't need the locks. > > Christian. So basically calling unmap_mapping_range doesn't require any extra locking around it and whatever locks are taken within the function should be enough ? I think so, yes. Christian. Yes, that's true. However, without the bo reservation, nothing stops a PTE from being immediately re-faulted back again. Even while unmap_mapping_range() is running. Can you explain more on this - specifically, which function to reserve the BO, why BO reservation would prevent re-fault of the PTE ? Thomas is talking about ttm_bo_reserver()/ttm_bo_unreserve(), but we don't need this because we unmap everything because the whole device is gone and not just manipulate a single BO. So the device removed flag needs to be advertized before this function is run, I indeed intend to call this right after calling drm_dev_unplug from amdgpu_pci_remove while adding drm_dev_enter/exit in ttm_bo_vm_fault (or in amdgpu specific wrapper since I don't see how can I access struct drm_device from ttm_bo_vm_fault) and this in my understanding should stop a PTE from being re-faulted back as you pointed out - so again I don't see how bo reservation would prevent it so it looks like I am missing something... (perhaps with a memory barrier pair). drm_dev_unplug and drm_dev_enter/exit are RCU synchronized and so I don't think require any extra memory barriers for visibility of the removed flag being set As far as I can see that should be perfectly sufficient. Only if you have a drm_dev_enter/exit pair in your fault handler. Otherwise you're still open to the races Thomas described. But aside from that the drm_dev_unplug stuff has all the barriers and stuff to make sure nothing escapes. Failure to drm_dev_enter could then also trigger the special case where we put a dummy page in place. -Daniel Hmm, Yes, indeed advertizing the flag before the call to unmap_mapping_range isn't enough, since there might be fault handlers running that haven't picked up the flag when unmap_mapping_range is launched. If you mean those fault handlers that were in progress when the flag (drm_dev_unplug) was set in amdgpu_pci_remove then as long as i wr
Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space
On 6/10/20 5:30 PM, Daniel Vetter wrote: On Wed, Jun 10, 2020 at 04:05:04PM +0200, Christian König wrote: Am 10.06.20 um 15:54 schrieb Andrey Grodzovsky: On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote: On 6/9/20 7:21 PM, Koenig, Christian wrote: Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey" : On 6/5/20 2:40 PM, Christian König wrote: > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky: >> >> On 5/11/20 2:45 AM, Christian König wrote: >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky: >>>> Signed-off-by: Andrey Grodzovsky >>>> --- >>>> drivers/gpu/drm/ttm/ttm_bo.c | 22 +- >>>> include/drm/ttm/ttm_bo_driver.h | 2 ++ >>>> 2 files changed, 23 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c >>>> b/drivers/gpu/drm/ttm/ttm_bo.c >>>> index c5b516f..eae61cc 100644 >>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c >>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c >>>> @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct >>>> ttm_buffer_object *bo) >>>> ttm_bo_unmap_virtual_locked(bo); >>>> ttm_mem_io_unlock(man); >>>> } >>>> +EXPORT_SYMBOL(ttm_bo_unmap_virtual); >>>> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev) >>>> +{ >>>> + struct ttm_mem_type_manager *man; >>>> + int i; >>>> -EXPORT_SYMBOL(ttm_bo_unmap_virtual); >>> >>>> + for (i = 0; i < TTM_NUM_MEM_TYPES; i++) { >>>> + man = >man[i]; >>>> + if (man->has_type && man->use_type) >>>> + ttm_mem_io_lock(man, false); >>>> + } >>> >>> You should drop that it will just result in a deadlock warning for >>> Nouveau and has no effect at all. >>> >>> Apart from that looks good to me, >>> Christian. >> >> >> As I am considering to re-include this in V2 of the patchsets, can >> you clarify please why this will have no effect at all ? > > The locks are exclusive for Nouveau to allocate/free the io address > space. > > Since we don't do this here we don't need the locks. > > Christian. So basically calling unmap_mapping_range doesn't require any extra locking around it and whatever locks are taken within the function should be enough ? I think so, yes. Christian. Yes, that's true. However, without the bo reservation, nothing stops a PTE from being immediately re-faulted back again. Even while unmap_mapping_range() is running. Can you explain more on this - specifically, which function to reserve the BO, why BO reservation would prevent re-fault of the PTE ? Thomas is talking about ttm_bo_reserver()/ttm_bo_unreserve(), but we don't need this because we unmap everything because the whole device is gone and not just manipulate a single BO. So the device removed flag needs to be advertized before this function is run, I indeed intend to call this right after calling drm_dev_unplug from amdgpu_pci_remove while adding drm_dev_enter/exit in ttm_bo_vm_fault (or in amdgpu specific wrapper since I don't see how can I access struct drm_device from ttm_bo_vm_fault) and this in my understanding should stop a PTE from being re-faulted back as you pointed out - so again I don't see how bo reservation would prevent it so it looks like I am missing something... (perhaps with a memory barrier pair). drm_dev_unplug and drm_dev_enter/exit are RCU synchronized and so I don't think require any extra memory barriers for visibility of the removed flag being set As far as I can see that should be perfectly sufficient. Only if you have a drm_dev_enter/exit pair in your fault handler. Otherwise you're still open to the races Thomas described. But aside from that the drm_dev_unplug stuff has all the barriers and stuff to make sure nothing escapes. Failure to drm_dev_enter could then also trigger the special case where we put a dummy page in place. -Daniel Hmm, Yes, indeed advertizing the flag before the call to unmap_mapping_range isn't enough, since there might be fault handlers running that haven't picked up the flag when unmap_mapping_range is launched. For the special case of syncing a full address-space unmap_mapping_range() with fault handlers regardless of the reason for the full address-space unmap_mapping_range() one could either traverse the address space (drm_vma_manager) and grab *all* bo reservations around the unmap_mapping_range(), or grab the i_mmap_lock in read mode in the fault handler. (It's taken in write mode in unmap_mapping_range). While the latter may seem like a simple solution, one should probably consider the overhead both in run-time and scaling ability. /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space
On 6/10/20 3:54 PM, Andrey Grodzovsky wrote: On 6/10/20 6:15 AM, Thomas Hellström (Intel) wrote: On 6/9/20 7:21 PM, Koenig, Christian wrote: Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey" : On 6/5/20 2:40 PM, Christian König wrote: > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky: >> >> On 5/11/20 2:45 AM, Christian König wrote: >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky: >>>> Signed-off-by: Andrey Grodzovsky >>>> --- >>>> drivers/gpu/drm/ttm/ttm_bo.c | 22 +- >>>> include/drm/ttm/ttm_bo_driver.h | 2 ++ >>>> 2 files changed, 23 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c >>>> b/drivers/gpu/drm/ttm/ttm_bo.c >>>> index c5b516f..eae61cc 100644 >>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c >>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c >>>> @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct >>>> ttm_buffer_object *bo) >>>> ttm_bo_unmap_virtual_locked(bo); >>>> ttm_mem_io_unlock(man); >>>> } >>>> +EXPORT_SYMBOL(ttm_bo_unmap_virtual); >>>> +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev) >>>> +{ >>>> + struct ttm_mem_type_manager *man; >>>> + int i; >>>> -EXPORT_SYMBOL(ttm_bo_unmap_virtual); >>> >>>> + for (i = 0; i < TTM_NUM_MEM_TYPES; i++) { >>>> + man = >man[i]; >>>> + if (man->has_type && man->use_type) >>>> + ttm_mem_io_lock(man, false); >>>> + } >>> >>> You should drop that it will just result in a deadlock warning for >>> Nouveau and has no effect at all. >>> >>> Apart from that looks good to me, >>> Christian. >> >> >> As I am considering to re-include this in V2 of the patchsets, can >> you clarify please why this will have no effect at all ? > > The locks are exclusive for Nouveau to allocate/free the io address > space. > > Since we don't do this here we don't need the locks. > > Christian. So basically calling unmap_mapping_range doesn't require any extra locking around it and whatever locks are taken within the function should be enough ? I think so, yes. Christian. Yes, that's true. However, without the bo reservation, nothing stops a PTE from being immediately re-faulted back again. Even while unmap_mapping_range() is running. Can you explain more on this - specifically, which function to reserve the BO, why BO reservation would prevent re-fault of the PTE ? The bo reservation is taken in the TTM fault handler and temporarily blocks inserting a new PTE. So typically the bo reservation is held around unmap_mapping_range() and the buffer object operation that triggered it (typically a move or change of backing store). /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 02/18] dma-buf: minor doc touch-ups
On 6/4/20 10:12 AM, Daniel Vetter wrote: Just some tiny edits: - fix link to struct dma_fence - give slightly more meaningful title - the polling here is about implicit fences, explicit fences (in sync_file or drm_syncobj) also have their own polling Signed-off-by: Daniel Vetter Reviewed-by: Thomas Hellstrom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release
Hi, Daniel, Please see below. On 6/4/20 10:12 AM, Daniel Vetter wrote: fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end"). But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion. I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case. Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same. Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map. With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful. Cc: Andrew Morton Cc: Jason Gunthorpe Cc: linux...@kvack.org Cc: linux-r...@vger.kernel.org Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter --- This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever. -Daniel --- mm/mmu_notifier.c | 7 --- mm/page_alloc.c | 23 ++- 2 files changed, 14 insertions(+), 16 deletions(-) diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(>mmap_sem); BUG_ON(atomic_read(>mm_users) <= 0); - if (IS_ENABLED(CONFIG_LOCKDEP)) { - fs_reclaim_acquire(GFP_KERNEL); - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); - fs_reclaim_release(GFP_KERNEL); - } - if (!mm->notifier_subscriptions) { /* * kmalloc cannot be called under mm_take_all_locks(), but we diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..f8a222db4a53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include #include @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map); -static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask); @@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false; - /* We're only interested __GFP_FS allocations for now */ - if (!(gfp_mask & __GFP_FS)) - return false; - if (gfp_mask & __GFP_NOLOCKDEP) return false; @@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void) void fs_reclaim_acquire(gfp_t gfp_mask) { - if (__need_fs_reclaim(gfp_mask)) - __fs_reclaim_acquire(); + if (__need_reclaim(gfp_mask)) { + if (!(gfp_mask & __GFP_FS)) Hmm. Shouldn't this be "if (gfp_mask & __GFP_FS)" or am I misunderstanding? + __fs_reclaim_acquire(); #ifdef CONFIG_MMU_NOTIFIER? + + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); + + } } EXPORT_SYMBOL_GPL(fs_reclaim_acquire); void fs_reclaim_release(gfp_t gfp_mask) { - if
Re: [PATCH 5/6] drm/ttm: Add destroy flag in TTM BO eviction interface
On 5/9/20 8:51 PM, Andrey Grodzovsky wrote: This will allow to invalidate, destroy backing storage and notify users of BOs when device is unpluged. Signed-off-by: Andrey Grodzovsky Please add a motivation in the commit message and use imperative wording ("Allow to invalidate..." instead of "This will allow to") s /unpluged/unplugged/ --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- drivers/gpu/drm/nouveau/nouveau_drm.c | 2 +- drivers/gpu/drm/qxl/qxl_object.c| 4 +-- drivers/gpu/drm/radeon/radeon_object.c | 2 +- drivers/gpu/drm/ttm/ttm_bo.c| 41 ++--- drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 6 ++--- include/drm/ttm/ttm_bo_api.h| 2 +- 8 files changed, 35 insertions(+), 26 deletions(-) diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h index b9bc1b0..9d57b8c 100644 --- a/include/drm/ttm/ttm_bo_api.h +++ b/include/drm/ttm/ttm_bo_api.h @@ -597,7 +597,7 @@ int ttm_bo_clean_mm(struct ttm_bo_device *bdev, unsigned mem_type); * -ERESTARTSYS: The call was interrupted by a signal while waiting to * evict a buffer. */ Please also update the function documentation. -int ttm_bo_evict_mm(struct ttm_bo_device *bdev, unsigned mem_type); +int ttm_bo_evict_mm(struct ttm_bo_device *bdev, unsigned mem_type, bool destroy); /** * ttm_kmap_obj_virtual Thanks, Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 1/6] drm/ttm: Add unampping of the entire device address space
On 6/9/20 7:21 PM, Koenig, Christian wrote: Am 09.06.2020 18:37 schrieb "Grodzovsky, Andrey" : On 6/5/20 2:40 PM, Christian König wrote: > Am 05.06.20 um 16:29 schrieb Andrey Grodzovsky: >> >> On 5/11/20 2:45 AM, Christian König wrote: >>> Am 09.05.20 um 20:51 schrieb Andrey Grodzovsky: Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/ttm/ttm_bo.c | 22 +- include/drm/ttm/ttm_bo_driver.h | 2 ++ 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index c5b516f..eae61cc 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1750,9 +1750,29 @@ void ttm_bo_unmap_virtual(struct ttm_buffer_object *bo) ttm_bo_unmap_virtual_locked(bo); ttm_mem_io_unlock(man); } +EXPORT_SYMBOL(ttm_bo_unmap_virtual); +void ttm_bo_unmap_virtual_address_space(struct ttm_bo_device *bdev) +{ + struct ttm_mem_type_manager *man; + int i; -EXPORT_SYMBOL(ttm_bo_unmap_virtual); >>> + for (i = 0; i < TTM_NUM_MEM_TYPES; i++) { + man = >man[i]; + if (man->has_type && man->use_type) + ttm_mem_io_lock(man, false); + } >>> >>> You should drop that it will just result in a deadlock warning for >>> Nouveau and has no effect at all. >>> >>> Apart from that looks good to me, >>> Christian. >> >> >> As I am considering to re-include this in V2 of the patchsets, can >> you clarify please why this will have no effect at all ? > > The locks are exclusive for Nouveau to allocate/free the io address > space. > > Since we don't do this here we don't need the locks. > > Christian. So basically calling unmap_mapping_range doesn't require any extra locking around it and whatever locks are taken within the function should be enough ? I think so, yes. Christian. Yes, that's true. However, without the bo reservation, nothing stops a PTE from being immediately re-faulted back again. Even while unmap_mapping_range() is running. So the device removed flag needs to be advertized before this function is run, (perhaps with a memory barrier pair). That should probably be added to the function documentation. (Other than that, please add a commit message if respinning). /Thomas ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] dma-fence: basic lockdep annotations
critical for reaching completion of fences. One example would be a scheduler kthread which picks up jobs and pushes them into hardware, where the interrupt handler or another completion thread calls dma_fence_signal(). But if the scheduler thread hangs, then all the fences hang, hence we need to manually annotate it. cross-release aimed to solve this by chaining cross-release dependencies, but the dependency from scheduler thread to the completion interrupt handler goes through hw where cross-release code can't observe it. In short, without manual annotations and careful review of the start and end of critical sections, cross-relese dependency tracking doesn't work. We need explicit annotations. v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise. v3: Kerneldoc. v4: Some spelling fixes from Mika v5: Amend commit message to explain in detail why cross-release isn't the solution. Cc: Mika Kuoppala Cc: Thomas Hellstrom Cc: linux-me...@vger.kernel.org Cc: linaro-mm-...@lists.linaro.org Cc: linux-r...@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-...@lists.freedesktop.org Cc: Chris Wilson Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter --- Reviewed-by: Thomas Hellström ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 03/18] dma-fence: basic lockdep annotations
On 6/4/20 10:12 AM, Daniel Vetter wrote: ... Thread A: mutex_lock(A); mutex_unlock(A); dma_fence_signal(); Thread B: mutex_lock(A); dma_fence_wait(); mutex_unlock(A); Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A. Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives. Just realized, isn't that example actually a true positive, or at least a great candidate for a true positive, since if another thread reenters that signaling path, it will block on that mutex, and the fence would never be signaled unless there is another signaling path? Although I agree the conclusion is sound: These annotations cannot be sprinkled mindlessly over the code. /Thomas v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise. v3: Kerneldoc. v4: Some spelling fixes from Mika Cc: Mika Kuoppala Cc: Thomas Hellstrom Cc: linux-me...@vger.kernel.org Cc: linaro-mm-...@lists.linaro.org Cc: linux-r...@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-...@lists.freedesktop.org Cc: Chris Wilson Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter --- Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++ include/linux/dma-fence.h| 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-) diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access -Fence Poll Support -~~ +Implicit Fence Poll Support +~~~ .. kernel-doc:: drivers/dma-buf/dma-buf.c - :doc: fence polling + :doc: implicit fence polling Kernel Functions and Structures Reference ~ @@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview +DMA Fence Signalling Annotations + + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: fence signalling annotation + DMA Fences Functions Reference ~~ diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc); +/** + * DOC: fence signalling annotation + * + * Proving correctness of all the kernel code around _fence through code + * review and testing is tricky for a few reasons: + * + * * It is a cross-driver contract, and therefore all drivers must follow the + * same rules for lock nesting order, calling contexts for various functions + * and anything else significant for in-kernel interfaces. But it is also + * impossible to test all drivers in a single machine, hence brute-force N vs. + * N testing of all combinations is impossible. Even just limiting to the + * possible combinations is infeasible. + * + * * There is an enormous amount of driver code involved. For render drivers + * there's the tail of command submission, after fences are published, + * scheduler code, interrupt and workers to process job completion, + * and timeout, gpu reset and gpu hang recovery code. Plus for integration + * with core mm with have _notifier, respectively _interval_notifier, + * and For modesetting drivers there's the commit tail functions + * between when fences for an atomic modeset are published, and when the + * corresponding vblank completes, including any interrupt processing and + * related workers. Auditing all that code, across all drivers, is not + * feasible. + * + * * Due to how many other subsystems are involved and the locking hierarchies + * this pulls in there is extremely thin wiggle-room for driver-specific + * differences. _fence interacts with almost all of the core memory + * handling through page fault handlers via _resv, dma_resv_lock() and + * dma_resv_unlock(). On the other side it also interacts through all + * allocation sites through _notifier
Re: [RFC 02/17] dma-fence: basic lockdep annotations
On 2020-05-12 10:59, Daniel Vetter wrote: Design is similar to the lockdep annotations for workers, but with some twists: - We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only. - We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra Date: Wed Aug 23 13:13:11 2017 +0200 locking/lockdep/selftests: Add mixed read-write ABBA tests - To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that. - The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default. - To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks. The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts. The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code. The main class of deadlocks this is supposed to catch are: Thread A: mutex_lock(A); mutex_unlock(A); dma_fence_signal(); Thread B: mutex_lock(A); dma_fence_wait(); mutex_unlock(A); Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A. Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives. v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise. Cc: linux-me...@vger.kernel.org Cc: linaro-mm-...@lists.linaro.org Cc: linux-r...@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-...@lists.freedesktop.org Cc: Chris Wilson Cc: Maarten Lankhorst Cc: Christian König Signed-off-by: Daniel Vetter LGTM. Perhaps some in-code documentation on how to use the new functions are called. Otherwise for patch 2 and 3, Reviewed-by: Thomas Hellstrom ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx