[PATCH] drm/i915: 2 GiB of relocations ought to be enough for anybody*
From: Tvrtko Ursulin Kernel test robot reports i915 can hit a warn in kvmalloc_node which has a purpose of dissalowing crazy size kernel allocations. This was added in 7661809d493b ("mm: don't allow oversized kvmalloc() calls"): /* Don't even allow crazy sizes */ if (WARN_ON_ONCE(size > INT_MAX)) return NULL; This would be kind of okay since i915 at one point dropped the need for making a shadow copy of the relocation list, but then it got re-added in fd1500fcd442 ("Revert "drm/i915/gem: Drop relocation slowpath".") a year after Linus added the above warning. It is plausible that the issue was not seen until now because to trigger gem_exec_reloc test requires a combination of an relatively older generation hardware but with at least 8GiB of RAM installed. Probably even more depending on runtime checks. Lets cap what we allow userspace to pass in using the matching limit. There should be no issue for real userspace since we are talking about "crazy" number of relocations which have no practical purpose. *) Well IGT tests might get upset but they can be easily adjusted. Signed-off-by: Tvrtko Ursulin Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202405151008.6ddd1aaf-oliver.s...@intel.com Cc: Kees Cook Cc: Kent Overstreet Cc: Joonas Lahtinen Cc: Rodrigo Vivi --- drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index d3a771afb083..4b34bf4fde77 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -1533,7 +1533,7 @@ static int eb_relocate_vma(struct i915_execbuffer *eb, struct eb_vma *ev) u64_to_user_ptr(entry->relocs_ptr); unsigned long remain = entry->relocation_count; - if (unlikely(remain > N_RELOC(ULONG_MAX))) + if (unlikely(remain > N_RELOC(INT_MAX))) return -EINVAL; /* @@ -1641,7 +1641,7 @@ static int check_relocations(const struct drm_i915_gem_exec_object2 *entry) if (size == 0) return 0; - if (size > N_RELOC(ULONG_MAX)) + if (size > N_RELOC(INT_MAX)) return -EINVAL; addr = u64_to_user_ptr(entry->relocs_ptr); -- 2.44.0
[PATCH 2/2] drm/amdgpu: Use drm_print_memory_stats helper from fdinfo
From: Tvrtko Ursulin Convert fdinfo memory stats to use the common drm_print_memory_stats helper. This achieves alignment with the common keys as documented in drm-usage-stats.rst, adding specifically drm-total- key the driver was missing until now. Additionally I made the code stop skipping total size for objects which currently do not have a backing store, and I added resident, active and purgeable reporting. Legacy keys have been preserved, with the outlook of only potentially removing only the drm-memory- when the time gets right. The example output now looks like this: pos: 0 flags: 0212 mnt_id:24 ino: 1239 drm-driver:amdgpu drm-client-id: 4 drm-pdev: :04:00.0 pasid: 32771 drm-total-cpu: 0 drm-shared-cpu:0 drm-active-cpu:0 drm-resident-cpu: 0 drm-purgeable-cpu: 0 drm-total-gtt: 2392 KiB drm-shared-gtt:0 drm-active-gtt:0 drm-resident-gtt: 2392 KiB drm-purgeable-gtt: 0 drm-total-vram:44564 KiB drm-shared-vram: 31952 KiB drm-active-vram: 0 drm-resident-vram: 44564 KiB drm-purgeable-vram:0 drm-memory-vram: 44564 KiB drm-memory-gtt:2392 KiB drm-memory-cpu:0 KiB amd-memory-visible-vram: 44564 KiB amd-evicted-vram: 0 KiB amd-evicted-visible-vram: 0 KiB amd-requested-vram:44564 KiB amd-requested-visible-vram:11952 KiB amd-requested-gtt: 2392 KiB drm-engine-compute:46464671 ns v2: * Track purgeable via AMDGPU_GEM_CREATE_DISCARDABLE. Signed-off-by: Tvrtko Ursulin Cc: Alex Deucher Cc: Christian König Cc: Daniel Vetter Cc: Rob Clark --- drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 48 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 96 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 35 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h| 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 20 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 3 +- 6 files changed, 122 insertions(+), 81 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c index c7df7fa3459f..00a4ab082459 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c @@ -59,18 +59,21 @@ void amdgpu_show_fdinfo(struct drm_printer *p, struct drm_file *file) struct amdgpu_fpriv *fpriv = file->driver_priv; struct amdgpu_vm *vm = >vm; - struct amdgpu_mem_stats stats; + struct amdgpu_mem_stats stats[__AMDGPU_PL_LAST + 1] = { }; ktime_t usage[AMDGPU_HW_IP_NUM]; - unsigned int hw_ip; + const char *pl_name[] = { + [TTM_PL_VRAM] = "vram", + [TTM_PL_TT] = "gtt", + [TTM_PL_SYSTEM] = "cpu", + }; + unsigned int hw_ip, i; int ret; - memset(, 0, sizeof(stats)); - ret = amdgpu_bo_reserve(vm->root.bo, false); if (ret) return; - amdgpu_vm_get_memory(vm, ); + amdgpu_vm_get_memory(vm, stats, ARRAY_SIZE(stats)); amdgpu_bo_unreserve(vm->root.bo); amdgpu_ctx_mgr_usage(>ctx_mgr, usage); @@ -82,24 +85,35 @@ void amdgpu_show_fdinfo(struct drm_printer *p, struct drm_file *file) */ drm_printf(p, "pasid:\t%u\n", fpriv->vm.pasid); - drm_printf(p, "drm-memory-vram:\t%llu KiB\n", stats.vram/1024UL); - drm_printf(p, "drm-memory-gtt: \t%llu KiB\n", stats.gtt/1024UL); - drm_printf(p, "drm-memory-cpu: \t%llu KiB\n", stats.cpu/1024UL); + + for (i = 0; i < TTM_PL_PRIV; i++) + drm_print_memory_stats(p, + [i].drm, + DRM_GEM_OBJECT_RESIDENT | + DRM_GEM_OBJECT_PURGEABLE, + pl_name[i]); + + /* Legacy amdgpu keys, alias to drm-resident-memory-: */ + drm_printf(p, "drm-memory-vram:\t%llu KiB\n", + stats[TTM_PL_VRAM].total/1024UL); + drm_printf(p, "drm-memory-gtt: \t%llu KiB\n", + stats[TTM_PL_TT].total/1024UL); + drm_printf(p, "drm-memory-cpu: \t%llu KiB\n", + stats[TTM_PL_SYSTEM].total/1024UL); + + /* Amdgpu specific memory accounting keys: */ drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n", - stats.visible_vram/1024UL); + stats[TTM_PL_VRAM].visible/1024UL); drm_printf(p, "amd-evicted-vram:\t%llu KiB\n", - stats.evicted_vram/1024UL); + stats[TTM_PL_VRAM].evicted/1024UL); drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n", - stats.evicted_visible_vram/1024UL); + stats[TTM_PL_VRAM].evicted_visible/1024UL);
[PATCH 1/2] Documentation/gpu: Document the situation with unqualified drm-memory-
From: Tvrtko Ursulin Currently it is not well defined what is drm-memory- compared to other categories. In practice the only driver which emits these keys is amdgpu and in them exposes the current resident buffer object memory (including shared). To prevent any confusion, document that drm-memory- is deprecated and an alias for drm-resident-memory-. While at it also clarify that the reserved sub-string 'memory' refers to the memory region component, and also clarify the intended semantics of other memory categories. v2: * Also mark drm-memory- as deprecated. * Add some more text describing memory categories. (Alex) v3: * Semantics of the amdgpu drm-memory is actually as drm-resident. Signed-off-by: Tvrtko Ursulin Cc: Alex Deucher Cc: Christian König Cc: Rob Clark --- Documentation/gpu/drm-usage-stats.rst | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst index 6dc299343b48..45d9b76a5748 100644 --- a/Documentation/gpu/drm-usage-stats.rst +++ b/Documentation/gpu/drm-usage-stats.rst @@ -128,7 +128,9 @@ Memory Each possible memory type which can be used to store buffer objects by the GPU in question shall be given a stable and unique name to be returned as the -string here. The name "memory" is reserved to refer to normal system memory. +string here. + +The region name "memory" is reserved to refer to normal system memory. Value shall reflect the amount of storage currently consumed by the buffer objects belong to this client, in the respective memory region. @@ -136,6 +138,9 @@ objects belong to this client, in the respective memory region. Default unit shall be bytes with optional unit specifiers of 'KiB' or 'MiB' indicating kibi- or mebi-bytes. +This key is deprecated and is an alias for drm-resident-. Only one of +the two should be present in the output. + - drm-shared-: [KiB|MiB] The total size of buffers that are shared with another file (e.g., have more @@ -143,20 +148,34 @@ than a single handle). - drm-total-: [KiB|MiB] -The total size of buffers that including shared and private memory. +The total size of all created buffers including shared and private memory. The +backing store for the buffers does not have to be currently instantiated to be +counted under this category. - drm-resident-: [KiB|MiB] -The total size of buffers that are resident in the specified region. +The total size of buffers that are resident (have their backing store present or +instantiated) in the specified region. + +This is an alias for drm-memory- and only one of the two should be +present in the output. - drm-purgeable-: [KiB|MiB] The total size of buffers that are purgeable. +For example drivers which implement a form of 'madvise' like functionality can +here count buffers which have instantiated backing store, but have been marked +with an equivalent of MADV_DONTNEED. + - drm-active-: [KiB|MiB] The total size of buffers that are active on one or more engines. +One practical example of this can be presence of unsignaled fences in an GEM +buffer reservation object. Therefore the active category is a subset of +resident. + Implementation Details == -- 2.44.0
Re: [RFC v2 0/2] Discussion around eviction improvements
On 16/05/2024 20:21, Alex Deucher wrote: On Thu, May 16, 2024 at 8:18 AM Tvrtko Ursulin wrote: From: Tvrtko Ursulin Reduced re-spin of my previous series after Christian corrected a few misconceptions that I had. So lets see if what remains makes sense or is still misguided. To summarise, the series address the following two issues: * Migration rate limiting does not work, at least not for the common case where userspace configures VRAM+GTT. It thinks it can stop migration attempts by playing with bo->allowed_domains vs bo->preferred domains but, both from the code, and from empirical experiments, I see that not working at all. When both masks are identical fiddling with them achieves nothing. Even when they are not identical allowed has a fallback GTT placement which means that when over the migration budget ttm_bo_validate with bo->allowed_domains can cause migration from GTT to VRAM. * Driver thinks it will be re-validating evicted buffers on the next submission but it does not for the very common case of VRAM+GTT because it only checks if current placement is *none* of the preferred placements. For APUs at least, we should never migrate because GTT and VRAM are both system memory so are effectively equal performance-wise. Maybe I was curious about this but thought there could be a reason why VRAM carve-out is a fix small-ish size. It cannot be made 1:1 with RAM or some other solution? this regressed when Christian reworked ttm to better handle migrating buffers back to VRAM after suspend on dGPUs? I will leave this to Christian to answer but for what this series is concerned I'd say it is orthogonal to that. Here we have two fixes not limited to APU use cases, just so it happens fixing the migration throttling improves things there too. And that even despite the first patch which triggering *more* migration attempts. Because the second patch then correctly curbs them. First patch should help with transient overcommit on discrete, allowing things get back into VRAM as soon as there is space. Second patch tries to makes migration throttling work as intended. Volunteers for testing on discrete? :) These two patches appear to have a positive result for a memory intensive game like Assassin's Creed Valhalla. On an APU like Steam Deck the game has a working set around 5 GiB, while the VRAM is configured to 1 GiB. Correctly respecting the migration budget appears to keep buffer blits at bay and improves the minimum frame rate, ie. makes things smoother. From the game's built-in benchmark, average of three runs each: FPS migrated KiBmin avg max min-1% min-0.1% because 20784781 10.00 37.00 89.6722.0012.33 patched 4227688 13.67 37.00 81.3323.3315.00 Hmm! s/because/before/ here obviously! Regards, Tvrtko Disclaimers that I have is that more runs would be needed to be more confident about the results. And more games. And APU versus discrete. Cc: Christian König Cc: Friedrich Vock Tvrtko Ursulin (2): drm/amdgpu: Re-validate evicted buffers drm/amdgpu: Actually respect buffer migration budget drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 112 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 - 2 files changed, 103 insertions(+), 30 deletions(-) -- 2.44.0
[RFC v2 0/2] Discussion around eviction improvements
From: Tvrtko Ursulin Reduced re-spin of my previous series after Christian corrected a few misconceptions that I had. So lets see if what remains makes sense or is still misguided. To summarise, the series address the following two issues: * Migration rate limiting does not work, at least not for the common case where userspace configures VRAM+GTT. It thinks it can stop migration attempts by playing with bo->allowed_domains vs bo->preferred domains but, both from the code, and from empirical experiments, I see that not working at all. When both masks are identical fiddling with them achieves nothing. Even when they are not identical allowed has a fallback GTT placement which means that when over the migration budget ttm_bo_validate with bo->allowed_domains can cause migration from GTT to VRAM. * Driver thinks it will be re-validating evicted buffers on the next submission but it does not for the very common case of VRAM+GTT because it only checks if current placement is *none* of the preferred placements. These two patches appear to have a positive result for a memory intensive game like Assassin's Creed Valhalla. On an APU like Steam Deck the game has a working set around 5 GiB, while the VRAM is configured to 1 GiB. Correctly respecting the migration budget appears to keep buffer blits at bay and improves the minimum frame rate, ie. makes things smoother. >From the game's built-in benchmark, average of three runs each: FPS migrated KiBmin avg max min-1% min-0.1% because 20784781 10.00 37.00 89.6722.0012.33 patched 4227688 13.67 37.00 81.3323.3315.00 Disclaimers that I have is that more runs would be needed to be more confident about the results. And more games. And APU versus discrete. Cc: Christian König Cc: Friedrich Vock Tvrtko Ursulin (2): drm/amdgpu: Re-validate evicted buffers drm/amdgpu: Actually respect buffer migration budget drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 112 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 - 2 files changed, 103 insertions(+), 30 deletions(-) -- 2.44.0
[RFC 2/2] drm/amdgpu: Actually respect buffer migration budget
From: Tvrtko Ursulin Current code appears to live in a misconception that playing with buffer allowed and preferred placements can always control the decision on whether backing store migration will be attempted or not. That is however not the case when userspace sets buffer placements of VRAM+GTT, which is what radv does since commit 862b6a9a ("radv: Improve spilling on discrete GPUs."), with the end result of completely ignoring the migration budget. Fix this by validating against a local singleton placement set to the current backing store location. This way, when migration budget has been depleted, we can prevent ttm_bo_validate from seeing any other than the current placement. For the case of implicit GTT allowed domain added in amdgpu_bo_create when userspace only sets VRAM the behaviour should be the same. On the first pass the re-validation will attempt to migrate away from the fallback GTT domain, and if that did not succeed the buffer will remain in the fallback placement. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 112 +++-- 1 file changed, 85 insertions(+), 27 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index ec888fc6ead8..08e7631f3a2e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -32,6 +32,7 @@ #include #include +#include #include #include "amdgpu_cs.h" @@ -775,6 +776,56 @@ void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes, spin_unlock(>mm_stats.lock); } +static bool +amdgpu_cs_bo_move_under_budget(struct amdgpu_cs_parser *p, + struct amdgpu_bo *abo) +{ + struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev); + + /* +* Don't move this buffer if we have depleted our allowance +* to move it. Don't move anything if the threshold is zero. +*/ + if (p->bytes_moved >= p->bytes_moved_threshold) + return false; + + if ((!abo->tbo.base.dma_buf || +list_empty(>tbo.base.dma_buf->attachments)) && + (!amdgpu_gmc_vram_full_visible(>gmc) && +(abo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED)) && + p->bytes_moved_vis >= p->bytes_moved_vis_threshold) { + /* +* And don't move a CPU_ACCESS_REQUIRED BO to limited +* visible VRAM if we've depleted our allowance to do +* that. +*/ + return false; + } + + return true; +} + +static bool +amdgpu_bo_fill_current_placement(struct amdgpu_bo *abo, +struct ttm_placement *placement, +struct ttm_place *place) +{ + struct ttm_placement *bo_placement = >placement; + int i; + + for (i = 0; i < bo_placement->num_placement; i++) { + if (bo_placement->placement[i].mem_type == + abo->tbo.resource->mem_type) { + *place = bo_placement->placement[i]; + placement->num_placement = 1; + placement->placement = place; + return true; + } + } + + return false; +} + static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) { struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); @@ -784,46 +835,53 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) .no_wait_gpu = false, .resv = bo->tbo.base.resv }; - uint32_t domain; + bool allow_move; int r; if (bo->tbo.pin_count) return 0; - /* Don't move this buffer if we have depleted our allowance -* to move it. Don't move anything if the threshold is zero. -*/ - if (p->bytes_moved < p->bytes_moved_threshold && - (!bo->tbo.base.dma_buf || - list_empty(>tbo.base.dma_buf->attachments))) { - if (!amdgpu_gmc_vram_full_visible(>gmc) && - (bo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED)) { - /* And don't move a CPU_ACCESS_REQUIRED BO to limited -* visible VRAM if we've depleted our allowance to do -* that. -*/ - if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) - domain = bo->preferred_domains; - else - domain = bo->allowed_domains; - } else { - domain = bo->preferred_domains; - } -
[RFC 1/2] drm/amdgpu: Re-validate evicted buffers
From: Tvrtko Ursulin Currently the driver appears to be thinking that it will be attempting to re-validate the evicted buffers on the next submission if they are not in their preferred placement. That however appears not to be true for the very common case of buffers with allowed placements of VRAM+GTT. Simply because the check can only detect if the current placement is *none* of the preferred ones, happily leaving VRAM+GTT buffers in the GTT placement "forever". Fix it by extending the VRAM+GTT special case to the re-validation logic. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 6bddd43604bc..e53ff914b62e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1248,10 +1248,25 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va, * next command submission. */ if (amdgpu_vm_is_bo_always_valid(vm, bo)) { - uint32_t mem_type = bo->tbo.resource->mem_type; + unsigned current_domain = + amdgpu_mem_type_to_domain(bo->tbo.resource->mem_type); + bool move_to_evict = false; - if (!(bo->preferred_domains & - amdgpu_mem_type_to_domain(mem_type))) + if (!(bo->preferred_domains & current_domain)) { + move_to_evict = true; + } else if ((bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK) == + (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT) && + current_domain != AMDGPU_GEM_DOMAIN_VRAM) { + /* +* If userspace has provided a list of possible +* placements equal to VRAM+GTT, we assume VRAM is *the* +* preferred placement and so try to move it back there +* on the next submission. +*/ + move_to_evict = true; + } + + if (move_to_evict) amdgpu_vm_bo_evicted(_va->base); else amdgpu_vm_bo_idle(_va->base); -- 2.44.0
Re: [PATCH v4 8/8] drm/xe/client: Print runtime to fdinfo
On 15/05/2024 22:42, Lucas De Marchi wrote: Print the accumulated runtime for client when printing fdinfo. Each time a query is done it first does 2 things: 1) loop through all the exec queues for the current client and accumulate the runtime, per engine class. CTX_TIMESTAMP is used for that, being read from the context image. 2) Read a "GPU timestamp" that can be used for considering "how much GPU time has passed" and that has the same unit/refclock as the one recording the runtime. RING_TIMESTAMP is used for that via MMIO. Since for all current platforms RING_TIMESTAMP follows the same refclock, just read it once, using any first engine available. This is exported to userspace as 2 numbers in fdinfo: drm-cycles-: drm-total-cycles-: Userspace is expected to collect at least 2 samples, which allows to know the client engine busyness as per: RUNTIME1 - RUNTIME0 busyness = - T1 - T0 Since drm-cycles- always starts at 0, it's also possible to know if and engine was ever used by a client. It's expected that userspace will read any 2 samples every few seconds. Given the update frequency of the counters involved and that CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap around (assuming 100% utilization) after ~200s. The wraparound is not perceived by userspace since it's just accumulated for all the exec_queues in a 64-bit counter) but the measurement will not be accurate if the samples are too far apart. This could be mitigated by adding a workqueue to accumulate the counters every so often, but it's additional complexity for something that is done already by userspace every few seconds in tools like gputop (from igt), htop, nvtop, etc, with none of them really defaulting to 1 sample per minute or more. Signed-off-by: Lucas De Marchi --- Documentation/gpu/drm-usage-stats.rst | 21 +++- Documentation/gpu/xe/index.rst | 1 + Documentation/gpu/xe/xe-drm-usage-stats.rst | 10 ++ drivers/gpu/drm/xe/xe_drm_client.c | 121 +++- 4 files changed, 150 insertions(+), 3 deletions(-) create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst index 6dc299343b48..a80f95ca1b2f 100644 --- a/Documentation/gpu/drm-usage-stats.rst +++ b/Documentation/gpu/drm-usage-stats.rst @@ -112,6 +112,19 @@ larger value within a reasonable period. Upon observing a value lower than what was previously read, userspace is expected to stay with that larger previous value until a monotonic update is seen. +- drm-total-cycles-: + +Engine identifier string must be the same as the one specified in the +drm-cycles- tag and shall contain the total number cycles for the given +engine. + +This is a timestamp in GPU unspecified unit that matches the update rate +of drm-cycles-. For drivers that implement this interface, the engine +utilization can be calculated entirely on the GPU clock domain, without +considering the CPU sleep time between 2 samples. + +A driver may implement either this key or drm-maxfreq-, but not both. + - drm-maxfreq-: [Hz|MHz|KHz] Engine identifier string must be the same as the one specified in the @@ -121,6 +134,9 @@ percentage utilization of the engine, whereas drm-engine- only reflects time active without considering what frequency the engine is operating as a percentage of its maximum frequency. +A driver may implement either this key or drm-total-cycles-, but not +both. + For the spec part: Acked-by: Tvrtko Ursulin Some minor comments and questions below. Memory ^^ @@ -168,5 +184,6 @@ be documented above and where possible, aligned with other drivers. Driver specific implementations --- -:ref:`i915-usage-stats` -:ref:`panfrost-usage-stats` +* :ref:`i915-usage-stats` +* :ref:`panfrost-usage-stats` +* :ref:`xe-usage-stats` diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst index c224ecaee81e..3f07aa3b5432 100644 --- a/Documentation/gpu/xe/index.rst +++ b/Documentation/gpu/xe/index.rst @@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver. xe_firmware xe_tile xe_debugging + xe-drm-usage-stats.rst diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst b/Documentation/gpu/xe/xe-drm-usage-stats.rst new file mode 100644 index ..482d503ae68a --- /dev/null +++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst @@ -0,0 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +.. _xe-usage-stats: + + +Xe DRM client usage stats implementation + + +.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c + :doc: DRM Client usage stats diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.
Re: [RFC 2/5] drm/amdgpu: Actually respect buffer migration budget
On 15/05/2024 15:31, Christian König wrote: Am 15.05.24 um 12:59 schrieb Tvrtko Ursulin: On 15/05/2024 08:20, Christian König wrote: Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin: From: Tvrtko Ursulin Current code appears to live in a misconception that playing with buffer allowed and preferred placements can control the decision on whether backing store migration will be attempted or not. Both from code inspection and from empirical experiments I see that not being true, and that both allowed and preferred placement are typically set to the same bitmask. That's not correct for the use case handled here, but see below. Which part is not correct, that bo->preferred_domains and bo->allower_domains are the same bitmask? Sorry totally forgot to explain that. This rate limit here was specially made for OpenGL applications which over commit VRAM. In those case preferred_domains will be VRAM only and allowed_domains will be VRAM|GTT. RADV always uses VRAM|GTT for both (which is correct). Got it, thanks! As such, when the code decides to throttle the migration for a client, it is in fact not achieving anything. Buffers can still be either migrated or not migrated based on the external (to this function and facility) logic. Fix it by not changing the buffer object placements if the migration budget has been spent. FIXME: Is it still required to call validate is the question.. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 22708954ae68..d07a1dd7c880 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -784,6 +784,7 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) .no_wait_gpu = false, .resv = bo->tbo.base.resv }; + bool migration_allowed = true; struct ttm_resource *old_res; uint32_t domain; int r; @@ -805,19 +806,24 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) * visible VRAM if we've depleted our allowance to do * that. */ - if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) + if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) { domain = bo->preferred_domains; - else + } else { domain = bo->allowed_domains; + migration_allowed = false; + } } else { domain = bo->preferred_domains; } } else { domain = bo->allowed_domains; + migration_allowed = false; } retry: - amdgpu_bo_placement_from_domain(bo, domain); + if (migration_allowed) + amdgpu_bo_placement_from_domain(bo, domain); That's completely invalid. Calling amdgpu_bo_placement_from_domain() is a mandatory prerequisite for calling ttm_bo_validate(); E.g. the usually code fow is: /* This initializes bo->placement */ amdgpu_bo_placement_from_domain() /* Eventually modify bo->placement to fit special requirements */ /* Apply the placement to the BO */ ttm_bo_validate(>tbo, >placement, ) To sum it up bo->placement should be a variable on the stack instead, but we never bothered to clean that up. I am not clear if you agree or not that the current method of trying to avoid migration doesn't really do anything? I totally agree, but the approach you taken to fix it is just quite broken. You can't leave bo->placement uninitialized and expect that ttm_bo_validate() won't move the BO. Yep, that much was clear, sorry that I did not explicitly acknowledge but just moved on to discussing how to fix it properly. On stack placements sounds plausible to force migration avoidance by putting a single current object placement in that list, if that is what you have in mind? Or a specialized flag/version of amdgpu_bo_placement_from_domain with an bool input like "allow_placement_change"? A very rough idea with no guarantee that it actually works: Add a TTM_PL_FLAG_RATE_LIMITED with all the TTM code to actually figure out how many bytes have been moved and how many bytes the current operation can move etc... Friedrich's patches actually looked like quite a step into the right direction for that already, so I would start from there. Then always feed amdgpu_bo_placement_from_domain() with the allowed_domains in the CS path and VM validation. Finally extend amdgpu_bo_placement_from_domain() to take a closer look at bo->preferred_domains, similar to how we do for the TTM_PL_FLAG_FALLBACK already and set the TTM_PL_FLAG_RATE_LIMITED flag as appropriate. Two things which I kind of don't like with the placement flag idea is
Re: [RFC 2/5] drm/amdgpu: Actually respect buffer migration budget
On 15/05/2024 08:20, Christian König wrote: Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin: From: Tvrtko Ursulin Current code appears to live in a misconception that playing with buffer allowed and preferred placements can control the decision on whether backing store migration will be attempted or not. Both from code inspection and from empirical experiments I see that not being true, and that both allowed and preferred placement are typically set to the same bitmask. That's not correct for the use case handled here, but see below. Which part is not correct, that bo->preferred_domains and bo->allower_domains are the same bitmask? As such, when the code decides to throttle the migration for a client, it is in fact not achieving anything. Buffers can still be either migrated or not migrated based on the external (to this function and facility) logic. Fix it by not changing the buffer object placements if the migration budget has been spent. FIXME: Is it still required to call validate is the question.. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 22708954ae68..d07a1dd7c880 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -784,6 +784,7 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) .no_wait_gpu = false, .resv = bo->tbo.base.resv }; + bool migration_allowed = true; struct ttm_resource *old_res; uint32_t domain; int r; @@ -805,19 +806,24 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) * visible VRAM if we've depleted our allowance to do * that. */ - if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) + if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) { domain = bo->preferred_domains; - else + } else { domain = bo->allowed_domains; + migration_allowed = false; + } } else { domain = bo->preferred_domains; } } else { domain = bo->allowed_domains; + migration_allowed = false; } retry: - amdgpu_bo_placement_from_domain(bo, domain); + if (migration_allowed) + amdgpu_bo_placement_from_domain(bo, domain); That's completely invalid. Calling amdgpu_bo_placement_from_domain() is a mandatory prerequisite for calling ttm_bo_validate(); E.g. the usually code fow is: /* This initializes bo->placement */ amdgpu_bo_placement_from_domain() /* Eventually modify bo->placement to fit special requirements */ /* Apply the placement to the BO */ ttm_bo_validate(>tbo, >placement, ) To sum it up bo->placement should be a variable on the stack instead, but we never bothered to clean that up. I am not clear if you agree or not that the current method of trying to avoid migration doesn't really do anything? On stack placements sounds plausible to force migration avoidance by putting a single current object placement in that list, if that is what you have in mind? Or a specialized flag/version of amdgpu_bo_placement_from_domain with an bool input like "allow_placement_change"? Regards, Tvrtko Regards, Christian. + r = ttm_bo_validate(>tbo, >placement, ); if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
Re: [RFC 1/5] drm/amdgpu: Fix migration rate limiting accounting
On 15/05/2024 08:14, Christian König wrote: Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin: From: Tvrtko Ursulin The logic assumed any migration attempt worked and therefore would over- account the amount of data migrated during buffer re-validation. As a consequence client can be unfairly penalised by incorrectly considering its migration budget spent. Fix it by looking at the before and after buffer object backing store and only account if there was a change. FIXME: I think this needs a better solution to account for migrations between VRAM visible and non-visible portions. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 +- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index ec888fc6ead8..22708954ae68 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -784,12 +784,15 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) .no_wait_gpu = false, .resv = bo->tbo.base.resv }; + struct ttm_resource *old_res; uint32_t domain; int r; if (bo->tbo.pin_count) return 0; + old_res = bo->tbo.resource; + /* Don't move this buffer if we have depleted our allowance * to move it. Don't move anything if the threshold is zero. */ @@ -817,16 +820,29 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) amdgpu_bo_placement_from_domain(bo, domain); r = ttm_bo_validate(>tbo, >placement, ); - p->bytes_moved += ctx.bytes_moved; - if (!amdgpu_gmc_vram_full_visible(>gmc) && - amdgpu_res_cpu_visible(adev, bo->tbo.resource)) - p->bytes_moved_vis += ctx.bytes_moved; - if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) { domain = bo->allowed_domains; goto retry; } + if (!r) { + struct ttm_resource *new_res = bo->tbo.resource; + bool moved = true; + + if (old_res == new_res) + moved = false; + else if (old_res && new_res && + old_res->mem_type == new_res->mem_type) + moved = false; The old resource might already be destroyed after you return from validation. So this here won't work. Apart from that even when a migration attempt fails the moved bytes should be accounted. When the validation attempt doesn't caused any moves then the bytecount here would be zero. So as far as I can see that is as fair as you can get. Right, I think I suffered a bit of tunnel vision here and completely ignore the _ctx_.moved_bytes part. Scratch this one too then. Regards, Tvrtko Regards, Christian. PS: Looks like our mail servers are once more not very reliable. If you get mails from me multiple times please just ignore it. + + if (moved) { + p->bytes_moved += ctx.bytes_moved; + if (!amdgpu_gmc_vram_full_visible(>gmc) && + amdgpu_res_cpu_visible(adev, bo->tbo.resource)) + p->bytes_moved_vis += ctx.bytes_moved; + } + } + return r; }
Re: [RFC 0/5] Discussion around eviction improvements
On 13/05/2024 14:49, Tvrtko Ursulin wrote: On 09/05/2024 13:40, Tvrtko Ursulin wrote: On 08/05/2024 19:09, Tvrtko Ursulin wrote: From: Tvrtko Ursulin Last few days I was looking at the situation with VRAM over subscription, what happens versus what perhaps should happen. Browsing through the driver and running some simple experiments. I ended up with this patch series which, as a disclaimer, may be completely wrong but as I found some suspicious things, to me at least, I thought it was a good point to stop and request some comments. To perhaps summarise what are the main issues I think I found: * Migration rate limiting does not bother knowing if actual migration happened and so can over-account and unfairly penalise. * Migration rate limiting does not even work, at least not for the common case where userspace configures VRAM+GTT. It thinks it can stop migration attempts by playing with bo->allowed_domains vs bo->preferred domains but, both from the code, and from empirical experiments, I see that not working at all. Both masks are identical so fiddling with them achieves nothing. * Idea of the fallback placement only works when VRAM has free space. As soon as it does not, ttm_resource_compatible is happy to leave the buffers in the secondary placement forever. * Driver thinks it will be re-validating evicted buffers on the next submission but it does not for the very common case of VRAM+GTT because it only checks if current placement is *none* of the preferred placements. All those problems are addressed in individual patches. End result of this series appears to be driver which will try harder to move buffers back into VRAM, but will be (more) correctly throttled in doing so by the existing rate limiting logic. I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a change but that could be a good thing too. At least I did not break anything, perhaps.. On one occassion I did see the rate limiting logic get confused while for a period of few minutes it went to a mode where it was constantly giving a high migration budget. But that recovered itself when I switched clients and did not come back so I don't know. If there is something wrong there I don't think it would be caused by any patches in this series. Since yesterday I also briefly tested with Far Cry New Dawn. One run each so possibly doesn't mean anything apart that there isn't a regression aka migration throttling is keeping things at bay even with increased requests to migrate things back to VRAM: before after min/avg/max fps 36/44/54 37/45/55 Cyberpunk 2077 from yesterday was similarly close: 26.96/29.59/30.40 29.70/30.00/30.32 I guess the real story is proper DGPU where misplaced buffers have a real cost. I found one game which regresses spectacularly badly with this series - Assasin's Creed Valhalla. The built-in benchmark at least. The game appears to have a working set much larger than the other games I tested, around 5GiB total during the benchmark. And for some reason migration throttling totally fails to put it in check. I will be investigating this shortly. I think that the conclusion is everything I attempted to add relating to TTM_PL_PREFERRED does not really work as I initially thought it did. Therefore please imagine this series as only containing patches 1, 2 and 5. (And FWIW it was quite annoying to get to the bottom of since for some reason the system exibits some sort of a latching behaviour, where on some boots and/or some minutes of runtime things were fine, and then it would latch onto a mode where the TTM_PL_PREFERRED induced breakage would show. And sometimes this breakage would appear straight away. Odd.) I still need to test though if the subset of patches manage to achieve some positive improvement on their own. It is possible, as patch 5 marks more buffers for re-validation so once overcommit subsides they would get promoted to preferred placement straight away. And 1&2 are notionally fixes for migration throttling so at least in broad sense should be still valid as discussion points. Regards, Tvrtko Series is probably rough but should be good enough for dicsussion. I am curious to hear if I identified at least something correctly as a real problem. It would also be good to hear what are the suggested games to check and see whether there is any improvement. Cc: Christian König Cc: Friedrich Vock Tvrtko Ursulin (5): drm/amdgpu: Fix migration rate limiting accounting drm/amdgpu: Actually respect buffer migration budget drm/ttm: Add preferred placement flag drm/amdgpu: Use preferred placement for VRAM+GTT drm/amdgpu: Re-validate evicted buffers drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 +++-- drivers/gpu/drm/
Re: [RFC 0/5] Discussion around eviction improvements
On 09/05/2024 13:40, Tvrtko Ursulin wrote: On 08/05/2024 19:09, Tvrtko Ursulin wrote: From: Tvrtko Ursulin Last few days I was looking at the situation with VRAM over subscription, what happens versus what perhaps should happen. Browsing through the driver and running some simple experiments. I ended up with this patch series which, as a disclaimer, may be completely wrong but as I found some suspicious things, to me at least, I thought it was a good point to stop and request some comments. To perhaps summarise what are the main issues I think I found: * Migration rate limiting does not bother knowing if actual migration happened and so can over-account and unfairly penalise. * Migration rate limiting does not even work, at least not for the common case where userspace configures VRAM+GTT. It thinks it can stop migration attempts by playing with bo->allowed_domains vs bo->preferred domains but, both from the code, and from empirical experiments, I see that not working at all. Both masks are identical so fiddling with them achieves nothing. * Idea of the fallback placement only works when VRAM has free space. As soon as it does not, ttm_resource_compatible is happy to leave the buffers in the secondary placement forever. * Driver thinks it will be re-validating evicted buffers on the next submission but it does not for the very common case of VRAM+GTT because it only checks if current placement is *none* of the preferred placements. All those problems are addressed in individual patches. End result of this series appears to be driver which will try harder to move buffers back into VRAM, but will be (more) correctly throttled in doing so by the existing rate limiting logic. I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a change but that could be a good thing too. At least I did not break anything, perhaps.. On one occassion I did see the rate limiting logic get confused while for a period of few minutes it went to a mode where it was constantly giving a high migration budget. But that recovered itself when I switched clients and did not come back so I don't know. If there is something wrong there I don't think it would be caused by any patches in this series. Since yesterday I also briefly tested with Far Cry New Dawn. One run each so possibly doesn't mean anything apart that there isn't a regression aka migration throttling is keeping things at bay even with increased requests to migrate things back to VRAM: before after min/avg/max fps 36/44/54 37/45/55 Cyberpunk 2077 from yesterday was similarly close: 26.96/29.59/30.40 29.70/30.00/30.32 I guess the real story is proper DGPU where misplaced buffers have a real cost. I found one game which regresses spectacularly badly with this series - Assasin's Creed Valhalla. The built-in benchmark at least. The game appears to have a working set much larger than the other games I tested, around 5GiB total during the benchmark. And for some reason migration throttling totally fails to put it in check. I will be investigating this shortly. Regards, Tvrtko Series is probably rough but should be good enough for dicsussion. I am curious to hear if I identified at least something correctly as a real problem. It would also be good to hear what are the suggested games to check and see whether there is any improvement. Cc: Christian König Cc: Friedrich Vock Tvrtko Ursulin (5): drm/amdgpu: Fix migration rate limiting accounting drm/amdgpu: Actually respect buffer migration budget drm/ttm: Add preferred placement flag drm/amdgpu: Use preferred placement for VRAM+GTT drm/amdgpu: Re-validate evicted buffers drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++-- drivers/gpu/drm/ttm/ttm_resource.c | 13 +--- include/drm/ttm/ttm_placement.h | 3 ++ 5 files changed, 65 insertions(+), 18 deletions(-)
Re: [PATCH 12/12] accel/ivpu: Share NPU busy time in sysfs
On 13/05/2024 11:22, Jacek Lawrynowicz wrote: Hi, On 10.05.2024 18:55, Jeffrey Hugo wrote: On 5/8/2024 7:29 AM, Jacek Lawrynowicz wrote: From: Tomasz Rusinowicz The driver tracks the time spent by NPU executing jobs and shares it through sysfs `npu_busy_time_us` file. It can be then used by user space applications to monitor device utilization. NPU is considered 'busy' starting with a first job submitted to firmware and ending when there is no more jobs pending/executing. Signed-off-by: Tomasz Rusinowicz Signed-off-by: Jacek Lawrynowicz This feels like something that would normally be handled by perf. Why not use that mechanism? Yeah, probably but we had several request to provide easy to use interface for this metric that could be integrated in various user space apps/tools that do not use ftrace. Probably more Perf/PMU aka performance counters? Which would be scriptable via $kernel/tools/perf or directly via perf_event_open(2) and read(2). Note it is not easy to get right and in the i915 implementation (see i915_pmu.c) we have a known issue with PCI hot unplug and use after free which needs input from perf core folks. Regards, Tvrtko
Re: [RFC 0/5] Discussion around eviction improvements
On 08/05/2024 19:09, Tvrtko Ursulin wrote: From: Tvrtko Ursulin Last few days I was looking at the situation with VRAM over subscription, what happens versus what perhaps should happen. Browsing through the driver and running some simple experiments. I ended up with this patch series which, as a disclaimer, may be completely wrong but as I found some suspicious things, to me at least, I thought it was a good point to stop and request some comments. To perhaps summarise what are the main issues I think I found: * Migration rate limiting does not bother knowing if actual migration happened and so can over-account and unfairly penalise. * Migration rate limiting does not even work, at least not for the common case where userspace configures VRAM+GTT. It thinks it can stop migration attempts by playing with bo->allowed_domains vs bo->preferred domains but, both from the code, and from empirical experiments, I see that not working at all. Both masks are identical so fiddling with them achieves nothing. * Idea of the fallback placement only works when VRAM has free space. As soon as it does not, ttm_resource_compatible is happy to leave the buffers in the secondary placement forever. * Driver thinks it will be re-validating evicted buffers on the next submission but it does not for the very common case of VRAM+GTT because it only checks if current placement is *none* of the preferred placements. All those problems are addressed in individual patches. End result of this series appears to be driver which will try harder to move buffers back into VRAM, but will be (more) correctly throttled in doing so by the existing rate limiting logic. I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a change but that could be a good thing too. At least I did not break anything, perhaps.. On one occassion I did see the rate limiting logic get confused while for a period of few minutes it went to a mode where it was constantly giving a high migration budget. But that recovered itself when I switched clients and did not come back so I don't know. If there is something wrong there I don't think it would be caused by any patches in this series. Since yesterday I also briefly tested with Far Cry New Dawn. One run each so possibly doesn't mean anything apart that there isn't a regression aka migration throttling is keeping things at bay even with increased requests to migrate things back to VRAM: before after min/avg/max fps 36/44/5437/45/55 Cyberpunk 2077 from yesterday was similarly close: 26.96/29.59/30.40 29.70/30.00/30.32 I guess the real story is proper DGPU where misplaced buffers have a real cost. Regards, Tvrtko Series is probably rough but should be good enough for dicsussion. I am curious to hear if I identified at least something correctly as a real problem. It would also be good to hear what are the suggested games to check and see whether there is any improvement. Cc: Christian König Cc: Friedrich Vock Tvrtko Ursulin (5): drm/amdgpu: Fix migration rate limiting accounting drm/amdgpu: Actually respect buffer migration budget drm/ttm: Add preferred placement flag drm/amdgpu: Use preferred placement for VRAM+GTT drm/amdgpu: Re-validate evicted buffers drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++-- drivers/gpu/drm/ttm/ttm_resource.c | 13 +--- include/drm/ttm/ttm_placement.h| 3 ++ 5 files changed, 65 insertions(+), 18 deletions(-)
Re: [PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo
On 08/05/2024 21:53, Lucas De Marchi wrote: On Wed, May 08, 2024 at 09:23:17AM GMT, Tvrtko Ursulin wrote: On 07/05/2024 22:35, Lucas De Marchi wrote: On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote: On 24/04/2024 00:56, Lucas De Marchi wrote: Print the accumulated runtime for client when printing fdinfo. Each time a query is done it first does 2 things: 1) loop through all the exec queues for the current client and accumulate the runtime, per engine class. CTX_TIMESTAMP is used for that, being read from the context image. 2) Read a "GPU timestamp" that can be used for considering "how much GPU time has passed" and that has the same unit/refclock as the one recording the runtime. RING_TIMESTAMP is used for that via MMIO. Since for all current platforms RING_TIMESTAMP follows the same refclock, just read it once, using any first engine. This is exported to userspace as 2 numbers in fdinfo: drm-cycles-: drm-total-cycles-: Userspace is expected to collect at least 2 samples, which allows to know the client engine busyness as per: RUNTIME1 - RUNTIME0 busyness = - T1 - T0 Another thing to point out is that it's expected that userspace will read any 2 samples every few seconds. Given the update frequency of the counters involved and that CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap around (assuming 100% utilization) after ~200s. The wraparound is not perceived by userspace since it's just accumulated for all the exec_queues in a 64-bit counter), but the measurement will not be accurate if the samples are too far apart. This could be mitigated by adding a workqueue to accumulate the counters every so often, but it's additional complexity for something that is done already by userspace every few seconds in tools like gputop (from igt), htop, nvtop, etc with none of them really defaulting to 1 sample per minute or more. Signed-off-by: Lucas De Marchi --- Documentation/gpu/drm-usage-stats.rst | 16 ++- Documentation/gpu/xe/index.rst | 1 + Documentation/gpu/xe/xe-drm-usage-stats.rst | 10 ++ drivers/gpu/drm/xe/xe_drm_client.c | 138 +++- 4 files changed, 162 insertions(+), 3 deletions(-) create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst index 6dc299343b48..421766289b78 100644 --- a/Documentation/gpu/drm-usage-stats.rst +++ b/Documentation/gpu/drm-usage-stats.rst @@ -112,6 +112,17 @@ larger value within a reasonable period. Upon observing a value lower than what was previously read, userspace is expected to stay with that larger previous value until a monotonic update is seen. +- drm-total-cycles-: + +Engine identifier string must be the same as the one specified in the +drm-cycles- tag and shall contain the total number cycles for the given +engine. + +This is a timestamp in GPU unspecified unit that matches the update rate +of drm-cycles-. For drivers that implement this interface, the engine +utilization can be calculated entirely on the GPU clock domain, without +considering the CPU sleep time between 2 samples. Two opens. 1) Do we need to explicity document that drm-total-cycles and drm-maxfreq are mutually exclusive? so userspace has a fallback mechanism to calculate utilization depending on what keys are available? No, to document all three at once do not make sense. Or at least are not expected. Or you envisage someone might legitimately emit all three? I don't see what would be the semantics. When we have cycles+maxfreq the latter is in Hz. And when we have cycles+total then it is unitless. All three? I don't follow what you mean here. *cycles* is actually a unit. The engine spent 10 cycles running this context (drm-cycles). In the same period there were 100 cycles available (drm-total-cycles). Current frequency is X MHz. Max frequency is Y MHz. For me all of them make sense if one wants to mix them together. For xe it doesn't make sense because the counter backing drm-cycles and drm-total-cycles is unrelated to the engine frequency. I can add something in the doc that we do not expected to see all of them together until we see a usecase. Each driver may implement a subset. I still don't quite see how a combination of cycles, total cycles and maxfreq makes sense together. It would require a driver where cycle period is equal to 1 / maxfreq, which also means total-cycles would be equal to maxfreq, making one of them redundant. So both for drivers like xe where cycle period is unrelated to maxfreq (or even the fataly misguided curfreq) it doens't make sense, and for driver like above is not needed. What use case am I missing? We need to document this properly so userspace knows how to do the right thing depending on what keys they discover. 2) Should drm-
Re: [RFC 1/5] drm/amdgpu: Fix migration rate limiting accounting
On 08/05/2024 20:08, Friedrich Vock wrote: On 08.05.24 20:09, Tvrtko Ursulin wrote: From: Tvrtko Ursulin The logic assumed any migration attempt worked and therefore would over- account the amount of data migrated during buffer re-validation. As a consequence client can be unfairly penalised by incorrectly considering its migration budget spent. If the migration failed but data was still moved (which I think could be the case when we try evicting everything but it still doesn't work?), shouldn't the eviction movements count towards the ratelimit too? Possibly, which path would that be? I mean there are definitely more migration which *should not* be counted which I think your mini-series approaches more accurately. What this patch achieves, in its current RFC form, is reduces the "false-positive" migration budget depletions. So larger improvements aside, point of the series was to illustrate that even the things which were said to be working do not seem to. See cover letter to see what I thought does not work either well or at all. Fix it by looking at the before and after buffer object backing store and only account if there was a change. FIXME: I think this needs a better solution to account for migrations between VRAM visible and non-visible portions. FWIW, I have some WIP patches (not posted on any MLs yet though) that attempt to solve this issue (+actually enforcing ratelimits) by moving the ratelimit accounting/enforcement to TTM entirely. By moving the accounting to TTM we can count moved bytes when we move them, and don't have to rely on comparing resources to determine whether moving actually happened. This should address your FIXME as well. Yep, I've seen them. They are not necessarily conflicting with this series, potentialy TTM placement flag aside. *If* something like this can be kept small and still manage to fix up a few simple things which do not appear to work at all at the moment. For the larger re-work it is quite, well, large and it is not easy to be certain the end result would work as expected. IMO it would be best to sketch out a larger series which brings some practical and masurable change in behaviour before commiting to merge things piecemeal. For instance I have a niggling feeling the runtime games driver plays with placements and domains are not great and wonder if things could be cleaner if simplified by letting TTM manage things more, more explicitly, and having the list of placements more static. Thinking about it seems a step too far for now though. Regards, Tvrtko Regards, Friedrich Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 +- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index ec888fc6ead8..22708954ae68 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -784,12 +784,15 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) .no_wait_gpu = false, .resv = bo->tbo.base.resv }; + struct ttm_resource *old_res; uint32_t domain; int r; if (bo->tbo.pin_count) return 0; + old_res = bo->tbo.resource; + /* Don't move this buffer if we have depleted our allowance * to move it. Don't move anything if the threshold is zero. */ @@ -817,16 +820,29 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) amdgpu_bo_placement_from_domain(bo, domain); r = ttm_bo_validate(>tbo, >placement, ); - p->bytes_moved += ctx.bytes_moved; - if (!amdgpu_gmc_vram_full_visible(>gmc) && - amdgpu_res_cpu_visible(adev, bo->tbo.resource)) - p->bytes_moved_vis += ctx.bytes_moved; - if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) { domain = bo->allowed_domains; goto retry; } + if (!r) { + struct ttm_resource *new_res = bo->tbo.resource; + bool moved = true; + + if (old_res == new_res) + moved = false; + else if (old_res && new_res && + old_res->mem_type == new_res->mem_type) + moved = false; + + if (moved) { + p->bytes_moved += ctx.bytes_moved; + if (!amdgpu_gmc_vram_full_visible(>gmc) && + amdgpu_res_cpu_visible(adev, bo->tbo.resource)) + p->bytes_moved_vis += ctx.bytes_moved; + } + } + return r; }
[RFC 1/5] drm/amdgpu: Fix migration rate limiting accounting
From: Tvrtko Ursulin The logic assumed any migration attempt worked and therefore would over- account the amount of data migrated during buffer re-validation. As a consequence client can be unfairly penalised by incorrectly considering its migration budget spent. Fix it by looking at the before and after buffer object backing store and only account if there was a change. FIXME: I think this needs a better solution to account for migrations between VRAM visible and non-visible portions. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 +- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index ec888fc6ead8..22708954ae68 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -784,12 +784,15 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) .no_wait_gpu = false, .resv = bo->tbo.base.resv }; + struct ttm_resource *old_res; uint32_t domain; int r; if (bo->tbo.pin_count) return 0; + old_res = bo->tbo.resource; + /* Don't move this buffer if we have depleted our allowance * to move it. Don't move anything if the threshold is zero. */ @@ -817,16 +820,29 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) amdgpu_bo_placement_from_domain(bo, domain); r = ttm_bo_validate(>tbo, >placement, ); - p->bytes_moved += ctx.bytes_moved; - if (!amdgpu_gmc_vram_full_visible(>gmc) && - amdgpu_res_cpu_visible(adev, bo->tbo.resource)) - p->bytes_moved_vis += ctx.bytes_moved; - if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) { domain = bo->allowed_domains; goto retry; } + if (!r) { + struct ttm_resource *new_res = bo->tbo.resource; + bool moved = true; + + if (old_res == new_res) + moved = false; + else if (old_res && new_res && +old_res->mem_type == new_res->mem_type) + moved = false; + + if (moved) { + p->bytes_moved += ctx.bytes_moved; + if (!amdgpu_gmc_vram_full_visible(>gmc) && + amdgpu_res_cpu_visible(adev, bo->tbo.resource)) + p->bytes_moved_vis += ctx.bytes_moved; + } + } + return r; } -- 2.44.0
[RFC 2/5] drm/amdgpu: Actually respect buffer migration budget
From: Tvrtko Ursulin Current code appears to live in a misconception that playing with buffer allowed and preferred placements can control the decision on whether backing store migration will be attempted or not. Both from code inspection and from empirical experiments I see that not being true, and that both allowed and preferred placement are typically set to the same bitmask. As such, when the code decides to throttle the migration for a client, it is in fact not achieving anything. Buffers can still be either migrated or not migrated based on the external (to this function and facility) logic. Fix it by not changing the buffer object placements if the migration budget has been spent. FIXME: Is it still required to call validate is the question.. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 22708954ae68..d07a1dd7c880 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -784,6 +784,7 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) .no_wait_gpu = false, .resv = bo->tbo.base.resv }; + bool migration_allowed = true; struct ttm_resource *old_res; uint32_t domain; int r; @@ -805,19 +806,24 @@ static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo) * visible VRAM if we've depleted our allowance to do * that. */ - if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) + if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) { domain = bo->preferred_domains; - else + } else { domain = bo->allowed_domains; + migration_allowed = false; + } } else { domain = bo->preferred_domains; } } else { domain = bo->allowed_domains; + migration_allowed = false; } retry: - amdgpu_bo_placement_from_domain(bo, domain); + if (migration_allowed) + amdgpu_bo_placement_from_domain(bo, domain); + r = ttm_bo_validate(>tbo, >placement, ); if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) { -- 2.44.0
[RFC 0/5] Discussion around eviction improvements
From: Tvrtko Ursulin Last few days I was looking at the situation with VRAM over subscription, what happens versus what perhaps should happen. Browsing through the driver and running some simple experiments. I ended up with this patch series which, as a disclaimer, may be completely wrong but as I found some suspicious things, to me at least, I thought it was a good point to stop and request some comments. To perhaps summarise what are the main issues I think I found: * Migration rate limiting does not bother knowing if actual migration happened and so can over-account and unfairly penalise. * Migration rate limiting does not even work, at least not for the common case where userspace configures VRAM+GTT. It thinks it can stop migration attempts by playing with bo->allowed_domains vs bo->preferred domains but, both from the code, and from empirical experiments, I see that not working at all. Both masks are identical so fiddling with them achieves nothing. * Idea of the fallback placement only works when VRAM has free space. As soon as it does not, ttm_resource_compatible is happy to leave the buffers in the secondary placement forever. * Driver thinks it will be re-validating evicted buffers on the next submission but it does not for the very common case of VRAM+GTT because it only checks if current placement is *none* of the preferred placements. All those problems are addressed in individual patches. End result of this series appears to be driver which will try harder to move buffers back into VRAM, but will be (more) correctly throttled in doing so by the existing rate limiting logic. I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a change but that could be a good thing too. At least I did not break anything, perhaps.. On one occassion I did see the rate limiting logic get confused while for a period of few minutes it went to a mode where it was constantly giving a high migration budget. But that recovered itself when I switched clients and did not come back so I don't know. If there is something wrong there I don't think it would be caused by any patches in this series. Series is probably rough but should be good enough for dicsussion. I am curious to hear if I identified at least something correctly as a real problem. It would also be good to hear what are the suggested games to check and see whether there is any improvement. Cc: Christian König Cc: Friedrich Vock Tvrtko Ursulin (5): drm/amdgpu: Fix migration rate limiting accounting drm/amdgpu: Actually respect buffer migration budget drm/ttm: Add preferred placement flag drm/amdgpu: Use preferred placement for VRAM+GTT drm/amdgpu: Re-validate evicted buffers drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++-- drivers/gpu/drm/ttm/ttm_resource.c | 13 +--- include/drm/ttm/ttm_placement.h| 3 ++ 5 files changed, 65 insertions(+), 18 deletions(-) -- 2.44.0
[RFC 5/5] drm/amdgpu: Re-validate evicted buffers
From: Tvrtko Ursulin Currently the driver appears to be thinking that it will be attempting to re-validate the evicted buffers on the next submission if they are not in their preferred placement. That however appears not to be true for the very common case of buffers with allowed placements of VRAM+GTT. Simply because the check can only detect if the current placement is *none* of the preferred ones, happily leaving VRAM+GTT buffers in the GTT placement "forever". Fix it by extending the VRAM+GTT special case to the re-validation logic. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 6bddd43604bc..e53ff914b62e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1248,10 +1248,25 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va, * next command submission. */ if (amdgpu_vm_is_bo_always_valid(vm, bo)) { - uint32_t mem_type = bo->tbo.resource->mem_type; + unsigned current_domain = + amdgpu_mem_type_to_domain(bo->tbo.resource->mem_type); + bool move_to_evict = false; - if (!(bo->preferred_domains & - amdgpu_mem_type_to_domain(mem_type))) + if (!(bo->preferred_domains & current_domain)) { + move_to_evict = true; + } else if ((bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK) == + (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT) && + current_domain != AMDGPU_GEM_DOMAIN_VRAM) { + /* +* If userspace has provided a list of possible +* placements equal to VRAM+GTT, we assume VRAM is *the* +* preferred placement and so try to move it back there +* on the next submission. +*/ + move_to_evict = true; + } + + if (move_to_evict) amdgpu_vm_bo_evicted(_va->base); else amdgpu_vm_bo_idle(_va->base); -- 2.44.0
[RFC 3/5] drm/ttm: Add preferred placement flag
From: Tvrtko Ursulin Currently the fallback placement flag can achieve a hint that buffer should be migrated back to the non-fallback placement, however that only works while there is no memory pressure. As soon as we reach full VRAM utilisation, or worse overcommit, the logic is happy to leave buffers in the fallback placement. Consequence of this is that once buffers are evicted they never get considered to be migrated back until the memory pressure subsides, leaving a potentially active client not able to bring its buffers back in. Add a "preferred" placement flag which drivers can set when they want some extra effort to be attempted for bringing a buffer back in. QQQ: Is the current "desired" flag unfortunately named perhaps? I ended up understanding it as more like "would be nice if possible but absolutely don't bother under memory pressure". Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/ttm/ttm_resource.c | 13 + include/drm/ttm/ttm_placement.h| 3 +++ 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c index 4a66b851b67d..59f3d1bcc11f 100644 --- a/drivers/gpu/drm/ttm/ttm_resource.c +++ b/drivers/gpu/drm/ttm/ttm_resource.c @@ -305,6 +305,8 @@ bool ttm_resource_compatible(struct ttm_resource *res, struct ttm_placement *placement, bool evicting) { + const u32 incompatible_flag = evicting ? TTM_PL_FLAG_DESIRED : +TTM_PL_FLAG_FALLBACK; struct ttm_buffer_object *bo = res->bo; struct ttm_device *bdev = bo->bdev; unsigned i; @@ -316,11 +318,14 @@ bool ttm_resource_compatible(struct ttm_resource *res, const struct ttm_place *place = >placement[i]; struct ttm_resource_manager *man; - if (res->mem_type != place->mem_type) - continue; + if (res->mem_type != place->mem_type) { + if (place->flags & TTM_PL_FLAG_PREFERRED) + return false; + else + continue; + } - if (place->flags & (evicting ? TTM_PL_FLAG_DESIRED : - TTM_PL_FLAG_FALLBACK)) + if (place->flags & incompatible_flag) continue; if (place->flags & TTM_PL_FLAG_CONTIGUOUS && diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h index b510a4812609..8ea0865e9cc8 100644 --- a/include/drm/ttm/ttm_placement.h +++ b/include/drm/ttm/ttm_placement.h @@ -70,6 +70,9 @@ /* Placement is only used during eviction */ #define TTM_PL_FLAG_FALLBACK (1 << 4) +/* Placement is only used during eviction */ +#define TTM_PL_FLAG_PREFERRED (1 << 5) + /** * struct ttm_place * -- 2.44.0
[RFC 4/5] drm/amdgpu: Use preferred placement for VRAM+GTT
From: Tvrtko Ursulin Now that TTM has the preferred placement flag, extend the current workaround which assumes the GTT placement as fallback in the presence of the additional VRAM placement. By marking the VRAM placement as preferred we will make the buffer re- validation phase actually attempt to migrate them back to VRAM. Without it, TTM core logic is happy to leave them in GTT placement "forever". Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Friedrich Vock --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 50b7e7c0ce50..9be767357e86 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -128,8 +128,8 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain) struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev); struct ttm_placement *placement = >placement; struct ttm_place *places = abo->placements; + int c = 0, vram_index = -1; u64 flags = abo->flags; - u32 c = 0; if (domain & AMDGPU_GEM_DOMAIN_VRAM) { unsigned int visible_pfn = adev->gmc.visible_vram_size >> PAGE_SHIFT; @@ -158,7 +158,7 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain) flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS) places[c].flags |= TTM_PL_FLAG_CONTIGUOUS; - c++; + vram_index = c++; } if (domain & AMDGPU_GEM_DOMAIN_DOORBELL) { @@ -180,8 +180,10 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain) * When GTT is just an alternative to VRAM make sure that we * only use it as fallback and still try to fill up VRAM first. */ - if (domain & abo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM) + if (vram_index >= 0) { places[c].flags |= TTM_PL_FLAG_FALLBACK; + places[vram_index].flags |= TTM_PL_FLAG_PREFERRED; + } c++; } -- 2.44.0
Re: [PATCH v3 5/6] drm/xe: Add helper to accumulate exec queue runtime
On 07/05/2024 23:45, Lucas De Marchi wrote: From: Umesh Nerlige Ramappa Add a helper to accumulate per-client runtime of all its exec queues. This is called every time a sched job is finished. v2: - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate runtime when job is finished since xe_sched_job_completed() is not a notification that job finished. - Stop trying to update runtime from xe_exec_queue_fini() - that is redundant and may happen after xef is closed, leading to a use-after-free - Do not special case the first timestamp read: the default LRC sets CTX_TIMESTAMP to zero, so even the first sample should be a valid one. - Handle the parallel submission case by multiplying the runtime by width. Signed-off-by: Umesh Nerlige Ramappa Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_device_types.h | 9 +++ drivers/gpu/drm/xe/xe_exec_queue.c | 35 drivers/gpu/drm/xe/xe_exec_queue.h | 1 + drivers/gpu/drm/xe/xe_execlist.c | 1 + drivers/gpu/drm/xe/xe_guc_submit.c | 2 ++ 5 files changed, 48 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h index 906b98fb973b..de078bdf0ab9 100644 --- a/drivers/gpu/drm/xe/xe_device_types.h +++ b/drivers/gpu/drm/xe/xe_device_types.h @@ -560,6 +560,15 @@ struct xe_file { struct mutex lock; } exec_queue; + /** +* @runtime: hw engine class runtime in ticks for this drm client +* +* Only stats from xe_exec_queue->lrc[0] are accumulated. For multi-lrc +* case, since all jobs run in parallel on the engines, only the stats +* from lrc[0] are sufficient. +*/ + u64 runtime[XE_ENGINE_CLASS_MAX]; + /** @client: drm client */ struct xe_drm_client *client; }; diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index 395de93579fa..86eb22e22c95 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -769,6 +769,41 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q) q->lrc[0].fence_ctx.next_seqno - 1; } +/** + * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw + * @q: The exec queue + * + * Update the timestamp saved by HW for this exec queue and save runtime + * calculated by using the delta from last update. On multi-lrc case, only the + * first is considered. + */ +void xe_exec_queue_update_runtime(struct xe_exec_queue *q) +{ + struct xe_file *xef; + struct xe_lrc *lrc; + u32 old_ts, new_ts; + + /* +* Jobs that are run during driver load may use an exec_queue, but are +* not associated with a user xe file, so avoid accumulating busyness +* for kernel specific work. +*/ + if (!q->vm || !q->vm->xef) + return; + + xef = q->vm->xef; + + /* +* Only sample the first LRC. For parallel submission, all of them are +* scheduled together and we compensate that below by multiplying by +* width +*/ + lrc = >lrc[0]; + + new_ts = xe_lrc_update_timestamp(lrc, _ts); + xef->runtime[q->class] += (new_ts - old_ts) * q->width; I think in theory this could be introducing a systematic error depending on how firmware handles things and tick resolution. Or even regardless of the firmware, if the timestamps are saved on context exit by the GPU hw itself and parallel contexts do not exit 100% aligned. Undershoot would be I think fine, but systematic overshoot under constant 100% parallel load from mutlitple client could constantly show >100% class utilisation. Probably not a concern in practice but worthy a comment? Regards, Tvrtko +} + void xe_exec_queue_kill(struct xe_exec_queue *q) { struct xe_exec_queue *eq = q, *next; diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h b/drivers/gpu/drm/xe/xe_exec_queue.h index 48f6da53a292..e0f07d28ee1a 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.h +++ b/drivers/gpu/drm/xe/xe_exec_queue.h @@ -75,5 +75,6 @@ struct dma_fence *xe_exec_queue_last_fence_get(struct xe_exec_queue *e, struct xe_vm *vm); void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct xe_vm *vm, struct dma_fence *fence); +void xe_exec_queue_update_runtime(struct xe_exec_queue *q); #endif diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c index dece2785933c..a316431025c7 100644 --- a/drivers/gpu/drm/xe/xe_execlist.c +++ b/drivers/gpu/drm/xe/xe_execlist.c @@ -307,6 +307,7 @@ static void execlist_job_free(struct drm_sched_job *drm_job) { struct xe_sched_job *job = to_xe_sched_job(drm_job); + xe_exec_queue_update_runtime(job->q); xe_sched_job_put(job); } diff --git
Re: [PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo
On 07/05/2024 22:35, Lucas De Marchi wrote: On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote: On 24/04/2024 00:56, Lucas De Marchi wrote: Print the accumulated runtime for client when printing fdinfo. Each time a query is done it first does 2 things: 1) loop through all the exec queues for the current client and accumulate the runtime, per engine class. CTX_TIMESTAMP is used for that, being read from the context image. 2) Read a "GPU timestamp" that can be used for considering "how much GPU time has passed" and that has the same unit/refclock as the one recording the runtime. RING_TIMESTAMP is used for that via MMIO. Since for all current platforms RING_TIMESTAMP follows the same refclock, just read it once, using any first engine. This is exported to userspace as 2 numbers in fdinfo: drm-cycles-: drm-total-cycles-: Userspace is expected to collect at least 2 samples, which allows to know the client engine busyness as per: RUNTIME1 - RUNTIME0 busyness = - T1 - T0 Another thing to point out is that it's expected that userspace will read any 2 samples every few seconds. Given the update frequency of the counters involved and that CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap around (assuming 100% utilization) after ~200s. The wraparound is not perceived by userspace since it's just accumulated for all the exec_queues in a 64-bit counter), but the measurement will not be accurate if the samples are too far apart. This could be mitigated by adding a workqueue to accumulate the counters every so often, but it's additional complexity for something that is done already by userspace every few seconds in tools like gputop (from igt), htop, nvtop, etc with none of them really defaulting to 1 sample per minute or more. Signed-off-by: Lucas De Marchi --- Documentation/gpu/drm-usage-stats.rst | 16 ++- Documentation/gpu/xe/index.rst | 1 + Documentation/gpu/xe/xe-drm-usage-stats.rst | 10 ++ drivers/gpu/drm/xe/xe_drm_client.c | 138 +++- 4 files changed, 162 insertions(+), 3 deletions(-) create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst index 6dc299343b48..421766289b78 100644 --- a/Documentation/gpu/drm-usage-stats.rst +++ b/Documentation/gpu/drm-usage-stats.rst @@ -112,6 +112,17 @@ larger value within a reasonable period. Upon observing a value lower than what was previously read, userspace is expected to stay with that larger previous value until a monotonic update is seen. +- drm-total-cycles-: + +Engine identifier string must be the same as the one specified in the +drm-cycles- tag and shall contain the total number cycles for the given +engine. + +This is a timestamp in GPU unspecified unit that matches the update rate +of drm-cycles-. For drivers that implement this interface, the engine +utilization can be calculated entirely on the GPU clock domain, without +considering the CPU sleep time between 2 samples. Two opens. 1) Do we need to explicity document that drm-total-cycles and drm-maxfreq are mutually exclusive? so userspace has a fallback mechanism to calculate utilization depending on what keys are available? No, to document all three at once do not make sense. Or at least are not expected. Or you envisage someone might legitimately emit all three? I don't see what would be the semantics. When we have cycles+maxfreq the latter is in Hz. And when we have cycles+total then it is unitless. All three? 2) Should drm-total-cycles for now be documents as driver specific? you mean to call it xe-total-cycles? Yes but it is not an ask, just an open. I have added some more poeple in the cc who were involved with driver fdinfo implementations if they will have an opinion. I would say potentially yes, and promote it to common if more than one driver would use it. For instance I see panfrost has the driver specific drm-curfreq (although isn't documenting it fully in panfrost.rst). And I have to say it is somewhat questionable to expose the current frequency per fdinfo per engine but not my call. aren't all of Documentation/gpu/drm-usage-stats.rst optional that driver may or may not implement? When you say driver-specific I'd think more of the ones not using as prefix as e.g. amd-*. I think drm-cycles + drm-total-cycles is just an alternative implementation for engine utilization. Like drm-cycles + drm-maxfreq already is an alternative to drm-engine and is not implemented by e.g. amdgpu/i915. I will submit a new version of the entire patch series to get the ball rolling, but let's keep this open for now. <...> +static void show_runtime(struct drm_printer *p, struct drm_file *file) +{ + struct xe_file *xef = file->driver_priv; + struct xe_device *xe = x
Re: drm scheduler and wq flavours
On 07/05/2024 00:23, Matthew Brost wrote: On Thu, May 02, 2024 at 03:33:50PM +0100, Tvrtko Ursulin wrote: Hi all, Continuing after the brief IRC discussion yesterday regarding work queues being prone to deadlocks or not, I had a browse around the code base and ended up a bit confused. When drm_sched_init documents and allocates an *ordered* wq, if no custom one was provided, could someone remind me was the ordered property fundamental for something to work correctly? Like run_job vs free_job ordering? Before the work queue (kthread design), run_job & free_job were ordered. It was decided to not break this existing behavior. Simply for extra paranoia or you remember if there was a reason identified? I ask because it appears different drivers to different things and at the moment it looks we have all possible combos or ordered/unordered, bound and unbound, shared or not shared with the timeout wq, or even unbound for the timeout wq. The drivers worth looking at in this respect are probably nouveau, panthor, pvr and xe. Nouveau also talks about a depency betwen run_job and free_job and goes to create two unordered wqs. Then xe looks a bit funky with the workaround/hack for lockep where it creates 512 work queues and hands them over to user queues in round-robin fashion. (Instead of default 1:1.) Which I suspect is a problem which should be applicable for any 1:1 driver given a thorough enough test suite. I think lockdep ran out of chains or something when executing some wild IGT with 1:1. Yes, any driver with a wild enough test would likely hit this lockdep splat too. Using a pool probably is not bad idea either. I wonder what is different between that and having a single shared unbound queue and let kernel manage the concurrency? Both this.. So anyway.. ordered vs unordered - drm sched dictated or at driver's choice? Default ordered, driver can override with unordered. .. and this, go back to my original question - whether the default queue must be ordered or not, or under which circustmances can drivers choose unordered. I think in drm_sched_init, where kerneldoc says it will create an ordered queue, it would be good to document the rules. Regards, Tvrtko
Re: [PATCH] Documentation/gpu: Document the situation with unqualified drm-memory-
On 03/05/2024 16:58, Alex Deucher wrote: On Fri, May 3, 2024 at 11:33 AM Daniel Vetter wrote: On Fri, May 03, 2024 at 01:58:38PM +0100, Tvrtko Ursulin wrote: [And I forgot dri-devel.. doing well!] On 03/05/2024 13:40, Tvrtko Ursulin wrote: [Correcting Christian's email] On 03/05/2024 13:36, Tvrtko Ursulin wrote: From: Tvrtko Ursulin Currently it is not well defined what is drm-memory- compared to other categories. In practice the only driver which emits these keys is amdgpu and in them exposes the total memory use (including shared). Document that drm-memory- and drm-total-memory- are aliases to prevent any confusion in the future. While at it also clarify that the reserved sub-string 'memory' refers to the memory region component. Signed-off-by: Tvrtko Ursulin Cc: Alex Deucher Cc: Christian König Mea culpa, I copied the mistake from 77d17c4cd0bf52eacfad88e63e8932eb45d643c5. :) Regards, Tvrtko Cc: Rob Clark --- Documentation/gpu/drm-usage-stats.rst | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst index 6dc299343b48..ef5c0a0aa477 100644 --- a/Documentation/gpu/drm-usage-stats.rst +++ b/Documentation/gpu/drm-usage-stats.rst @@ -128,7 +128,9 @@ Memory Each possible memory type which can be used to store buffer objects by the GPU in question shall be given a stable and unique name to be returned as the -string here. The name "memory" is reserved to refer to normal system memory. +string here. + +The region name "memory" is reserved to refer to normal system memory. Value shall reflect the amount of storage currently consumed by the buffer objects belong to this client, in the respective memory region. @@ -136,6 +138,9 @@ objects belong to this client, in the respective memory region. Default unit shall be bytes with optional unit specifiers of 'KiB' or 'MiB' indicating kibi- or mebi-bytes. +This is an alias for drm-total- and only one of the two should be +present. This feels a bit awkward and seems to needlessly complicate fdinfo uapi. - Could we just patch amdgpu to follow everyone else, and avoid the special case? If there's no tool that relies on the special amdgpu prefix then that would be a lot easier. - If that's not on the table, could we make everyone (with a suitable helper or something) just print both variants, so that we again have consisent fdinfo output? Or breaks that a different set of existing tools. - Finally maybe could we get away with fixing amd by adding the common format there, deprecating the old, fixing the tools that would break and then maybe if we're lucky, remove the old one from amdgpu in a year or so? I'm not really understanding what amdgpu is doing wrong. It seems to be following the documentation. Is the idea that we would like to deprecate drm-memory- in favor of drm-total-? If that's the case, I think the 3rd option is probably the best. We have a lot of tools and customers using this. It would have also been nice to have "memory" in the string for the newer ones to avoid conflicts with other things that might be a total or shared in the future, but I guess that ship has sailed. We should also note that drm-memory- is deprecated. While we are here, maybe we should clarify the semantics of resident, purgeable, and active. For example, isn't resident just a duplicate of total? If the memory was not resident, it would be in a different region. Amdgpu isn't doing anything wrong. It just appears when the format was discussed no one noticed (me included) that the two keys are not clearly described. And it looks there also wasn't a plan to handle the uncelar duality in the future. For me deprecating sounds fine, the 3rd option. I understand we would only make amdgpu emit both sets of keys and then remove drm-memory- in due time. With regards to key naming, yeah, memory in the name would have been nice. We had a lot of discussion on this topic but ship has indeed sailed. It is probably workarble for anything new that might come to add their prefix. As long as it does not clash with the memory categories is should be fine. In terms of resident semantics, think of it as VIRT vs RES in top(1). It is for drivers which allocate backing store lazily, on first use. Purgeable is for drivers which have a form of MADV_DONTNEED ie. currently have backing store but userspace has indicated it can be dropped without preserving the content on memory pressure. Active is when reservation object says there is activity on the buffer. Regards, Tvrtko Alex Uapi that's "either do $foo or on this one driver, do $bar" is just guaranteed to fragement the ecosystem, so imo that should be the absolute last resort. -Sima + - drm-shared-: [KiB|MiB] The total size of buffers that are shared with another file (e.g., have more
Re: [PATCH] Documentation/gpu: Document the situation with unqualified drm-memory-
On 03/05/2024 14:39, Alex Deucher wrote: On Fri, May 3, 2024 at 8:58 AM Tvrtko Ursulin wrote: [And I forgot dri-devel.. doing well!] On 03/05/2024 13:40, Tvrtko Ursulin wrote: [Correcting Christian's email] On 03/05/2024 13:36, Tvrtko Ursulin wrote: From: Tvrtko Ursulin Currently it is not well defined what is drm-memory- compared to other categories. In practice the only driver which emits these keys is amdgpu and in them exposes the total memory use (including shared). Document that drm-memory- and drm-total-memory- are aliases to prevent any confusion in the future. While at it also clarify that the reserved sub-string 'memory' refers to the memory region component. Signed-off-by: Tvrtko Ursulin Cc: Alex Deucher Cc: Christian König Mea culpa, I copied the mistake from 77d17c4cd0bf52eacfad88e63e8932eb45d643c5. :) I'm not following. What is the mistake from that commit? Just the spelling of Christian's last name in the email address, nothing in the code itself. I failed to spot both that when copying the email for git commit, and also failed to cc dri-devel so I am having a bad day. Regards, Tvrtko Regards, Tvrtko Cc: Rob Clark --- Documentation/gpu/drm-usage-stats.rst | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst index 6dc299343b48..ef5c0a0aa477 100644 --- a/Documentation/gpu/drm-usage-stats.rst +++ b/Documentation/gpu/drm-usage-stats.rst @@ -128,7 +128,9 @@ Memory Each possible memory type which can be used to store buffer objects by the GPU in question shall be given a stable and unique name to be returned as the -string here. The name "memory" is reserved to refer to normal system memory. +string here. + +The region name "memory" is reserved to refer to normal system memory. Is this supposed to mean drm-memory-memory? That was my impression, but that seems sort of weird. Maybe we should just drop that sentence. Alex Value shall reflect the amount of storage currently consumed by the buffer objects belong to this client, in the respective memory region. @@ -136,6 +138,9 @@ objects belong to this client, in the respective memory region. Default unit shall be bytes with optional unit specifiers of 'KiB' or 'MiB' indicating kibi- or mebi-bytes. +This is an alias for drm-total- and only one of the two should be +present. + - drm-shared-: [KiB|MiB] The total size of buffers that are shared with another file (e.g., have more @@ -145,6 +150,9 @@ than a single handle). The total size of buffers that including shared and private memory. +This is an alias for drm-memory- and only one of the two should be +present. + - drm-resident-: [KiB|MiB] The total size of buffers that are resident in the specified region.
Re: [PATCH] Documentation/gpu: Document the situation with unqualified drm-memory-
[And I forgot dri-devel.. doing well!] On 03/05/2024 13:40, Tvrtko Ursulin wrote: [Correcting Christian's email] On 03/05/2024 13:36, Tvrtko Ursulin wrote: From: Tvrtko Ursulin Currently it is not well defined what is drm-memory- compared to other categories. In practice the only driver which emits these keys is amdgpu and in them exposes the total memory use (including shared). Document that drm-memory- and drm-total-memory- are aliases to prevent any confusion in the future. While at it also clarify that the reserved sub-string 'memory' refers to the memory region component. Signed-off-by: Tvrtko Ursulin Cc: Alex Deucher Cc: Christian König Mea culpa, I copied the mistake from 77d17c4cd0bf52eacfad88e63e8932eb45d643c5. :) Regards, Tvrtko Cc: Rob Clark --- Documentation/gpu/drm-usage-stats.rst | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst index 6dc299343b48..ef5c0a0aa477 100644 --- a/Documentation/gpu/drm-usage-stats.rst +++ b/Documentation/gpu/drm-usage-stats.rst @@ -128,7 +128,9 @@ Memory Each possible memory type which can be used to store buffer objects by the GPU in question shall be given a stable and unique name to be returned as the -string here. The name "memory" is reserved to refer to normal system memory. +string here. + +The region name "memory" is reserved to refer to normal system memory. Value shall reflect the amount of storage currently consumed by the buffer objects belong to this client, in the respective memory region. @@ -136,6 +138,9 @@ objects belong to this client, in the respective memory region. Default unit shall be bytes with optional unit specifiers of 'KiB' or 'MiB' indicating kibi- or mebi-bytes. +This is an alias for drm-total- and only one of the two should be +present. + - drm-shared-: [KiB|MiB] The total size of buffers that are shared with another file (e.g., have more @@ -145,6 +150,9 @@ than a single handle). The total size of buffers that including shared and private memory. +This is an alias for drm-memory- and only one of the two should be +present. + - drm-resident-: [KiB|MiB] The total size of buffers that are resident in the specified region.
drm scheduler and wq flavours
Hi all, Continuing after the brief IRC discussion yesterday regarding work queues being prone to deadlocks or not, I had a browse around the code base and ended up a bit confused. When drm_sched_init documents and allocates an *ordered* wq, if no custom one was provided, could someone remind me was the ordered property fundamental for something to work correctly? Like run_job vs free_job ordering? I ask because it appears different drivers to different things and at the moment it looks we have all possible combos or ordered/unordered, bound and unbound, shared or not shared with the timeout wq, or even unbound for the timeout wq. The drivers worth looking at in this respect are probably nouveau, panthor, pvr and xe. Nouveau also talks about a depency betwen run_job and free_job and goes to create two unordered wqs. Then xe looks a bit funky with the workaround/hack for lockep where it creates 512 work queues and hands them over to user queues in round-robin fashion. (Instead of default 1:1.) Which I suspect is a problem which should be applicable for any 1:1 driver given a thorough enough test suite. So anyway.. ordered vs unordered - drm sched dictated or at driver's choice? Regards, Tvrtko
Re: [PATCH] drm/sysfs: Add drm class-wide attribute to get active device clients
Hi, On 24/04/2024 15:48, Adrián Larumbe wrote: Hi Tvrtko, On 15.04.2024 13:50, Tvrtko Ursulin wrote: On 05/04/2024 18:59, Rob Clark wrote: On Wed, Apr 3, 2024 at 11:37 AM Adrián Larumbe wrote: Up to this day, all fdinfo-based GPU profilers must traverse the entire /proc directory structure to find open DRM clients with fdinfo file descriptors. This is inefficient and time-consuming. This patch adds a new device class attribute that will install a sysfs file per DRM device, which can be queried by profilers to get a list of PIDs for their open clients. This file isn't human-readable, and it's meant to be queried only by GPU profilers like gputop and nvtop. Cc: Boris Brezillon Cc: Tvrtko Ursulin Cc: Christopher Healy Signed-off-by: Adrián Larumbe It does seem like a good idea.. idk if there is some precedent to prefer binary vs ascii in sysfs, but having a way to avoid walking _all_ processes is a good idea. I naturally second that it is a needed feature, but I do not think binary format is justified. AFAIR it should be used for things like hw/fw standardised tables or firmware images, not when exporting a simple list of PIDs. It also precludes easy shell/script access and the benefit of avoiding parsing a short list is I suspect completely dwarfed by needing to parse all the related fdinfo etc. I'd rather keep it as a binary file for the sake of easily parsing the number list on the client side, in gputop or nvtop. For textual access, there's already a debugfs file that presents the same information, so I thought it was best not to duplicate that functionality and restrict sysfs to serving the very specific use case of UM profilers having to access the DRM client list. I should mention I did something controversial here, which is a semantically binary attribute through the regular attribute interface. I guess if I keep it as a binary attribute in the end, I should switch over to the binary attribute API. Another reason why I implemented it as a binary file is that we can only send back at most a whole page. If a PID takes 4 bytes, that's usually 1024 clients at most, which is probably enough for any UM profiler, but will decrease even more if we turn it into an ASCII readable file. I'm afraid I still think there is no reason for a binary file, even less so artificially limited to 1024 clients. Any consumer will have to parse text fdinfo so a binary list of pids is not adding any real cost. I did some research into sysfs binary attributes, and while some sources mention that it's often used for dumping or loading of driver FW, none of them claim it cannot be used for other purposes. --- drivers/gpu/drm/drm_internal.h | 2 +- drivers/gpu/drm/drm_privacy_screen.c | 2 +- drivers/gpu/drm/drm_sysfs.c | 89 ++-- 3 files changed, 74 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h index 2215baef9a3e..9a399b03d11c 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -145,7 +145,7 @@ bool drm_master_internal_acquire(struct drm_device *dev); void drm_master_internal_release(struct drm_device *dev); /* drm_sysfs.c */ -extern struct class *drm_class; +extern struct class drm_class; int drm_sysfs_init(void); void drm_sysfs_destroy(void); diff --git a/drivers/gpu/drm/drm_privacy_screen.c b/drivers/gpu/drm/drm_privacy_screen.c index 6cc39e30781f..2fbd24ba5818 100644 --- a/drivers/gpu/drm/drm_privacy_screen.c +++ b/drivers/gpu/drm/drm_privacy_screen.c @@ -401,7 +401,7 @@ struct drm_privacy_screen *drm_privacy_screen_register( mutex_init(>lock); BLOCKING_INIT_NOTIFIER_HEAD(>notifier_head); - priv->dev.class = drm_class; + priv->dev.class = _class; priv->dev.type = _privacy_screen_type; priv->dev.parent = parent; priv->dev.release = drm_privacy_screen_device_release; diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index a953f69a34b6..56ca9e22c720 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -58,8 +58,6 @@ static struct device_type drm_sysfs_device_connector = { .name = "drm_connector", }; -struct class *drm_class; - #ifdef CONFIG_ACPI static bool drm_connector_acpi_bus_match(struct device *dev) { @@ -128,6 +126,62 @@ static const struct component_ops typec_connector_ops = { static CLASS_ATTR_STRING(version, S_IRUGO, "drm 1.1.0 20060810"); +static ssize_t clients_show(struct device *cd, struct device_attribute *attr, char *buf) +{ + struct drm_minor *minor = cd->driver_data; + struct drm_device *ddev = minor->dev; + struct drm_file *priv; + ssize_t offset = 0; + void *pid_buf; + + if (minor->type != DRM_MINOR_RENDER) + return 0; Why this? I return nothing in case of a non-render node be
Re: [PATCH v4 8/8] drm/v3d: Add modparam for turning off Big/Super Pages
On 28/04/2024 13:40, Maíra Canal wrote: Add a modparam for turning off Big/Super Pages to make sure that if an user doesn't want Big/Super Pages enabled, it can disabled it by setting the modparam to false. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_drv.c | 7 +++ drivers/gpu/drm/v3d/v3d_gemfs.c | 5 + 2 files changed, 12 insertions(+) diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c index 28b7ddce7747..1a6e01235df6 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.c +++ b/drivers/gpu/drm/v3d/v3d_drv.c @@ -36,6 +36,13 @@ #define DRIVER_MINOR 0 #define DRIVER_PATCHLEVEL 0 +/* Only expose the `super_pages` modparam if THP is enabled. */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +bool super_pages = true; +module_param_named(super_pages, super_pages, bool, 0400); +MODULE_PARM_DESC(super_pages, "Enable/Disable Super Pages support."); +#endif + static int v3d_get_param_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) { diff --git a/drivers/gpu/drm/v3d/v3d_gemfs.c b/drivers/gpu/drm/v3d/v3d_gemfs.c index 31cf5bd11e39..0ade02bb7209 100644 --- a/drivers/gpu/drm/v3d/v3d_gemfs.c +++ b/drivers/gpu/drm/v3d/v3d_gemfs.c @@ -11,6 +11,7 @@ void v3d_gemfs_init(struct v3d_dev *v3d) char huge_opt[] = "huge=within_size"; struct file_system_type *type; struct vfsmount *gemfs; + extern bool super_pages; /* * By creating our own shmemfs mountpoint, we can pass in @@ -20,6 +21,10 @@ void v3d_gemfs_init(struct v3d_dev *v3d) if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) goto err; + /* The user doesn't want to enable Super Pages */ + if (!super_pages) + goto err; + type = get_fs_type("tmpfs"); if (!type) goto err; Reviewed-by: Tvrtko Ursulin Regards, Tvrtko
Re: [PATCH v2 3/6] drm/xe: Add helper to accumulate exec queue runtime
On 26/04/2024 19:59, Umesh Nerlige Ramappa wrote: On Fri, Apr 26, 2024 at 11:49:32AM +0100, Tvrtko Ursulin wrote: On 24/04/2024 00:56, Lucas De Marchi wrote: From: Umesh Nerlige Ramappa Add a helper to accumulate per-client runtime of all its exec queues. Currently that is done in 2 places: 1. when the exec_queue is destroyed 2. when the sched job is completed Signed-off-by: Umesh Nerlige Ramappa Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_device_types.h | 9 +++ drivers/gpu/drm/xe/xe_exec_queue.c | 37 drivers/gpu/drm/xe/xe_exec_queue.h | 1 + drivers/gpu/drm/xe/xe_sched_job.c | 2 ++ 4 files changed, 49 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h index 2e62450d86e1..33d3bf93a2f1 100644 --- a/drivers/gpu/drm/xe/xe_device_types.h +++ b/drivers/gpu/drm/xe/xe_device_types.h @@ -547,6 +547,15 @@ struct xe_file { struct mutex lock; } exec_queue; + /** + * @runtime: hw engine class runtime in ticks for this drm client + * + * Only stats from xe_exec_queue->lrc[0] are accumulated. For multi-lrc + * case, since all jobs run in parallel on the engines, only the stats + * from lrc[0] are sufficient. Out of curiousity doesn't this mean multi-lrc jobs will be incorrectly accounted for? (When capacity is considered.) TBH, I am not sure what the user would like to see here for multi-lrc. If reporting the capacity, then we may need to use width as a multiplication factor for multi-lrc. How was this done in i915? IMO user has to see the real utilisation - so if there are two VCS and both are busy, 100% should be reported and not 50%. Latter would be misleading, either with or without cross-checking with physical utilisation. In i915 with execlists this works correctly and with GuC you would probably know the answer better than me. Regards, Tvrtko Regards, Umesh Regards, Tvrtko + */ + u64 runtime[XE_ENGINE_CLASS_MAX]; + /** @client: drm client */ struct xe_drm_client *client; }; diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index 395de93579fa..b7b6256cb96a 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -214,6 +214,8 @@ void xe_exec_queue_fini(struct xe_exec_queue *q) { int i; + xe_exec_queue_update_runtime(q); + for (i = 0; i < q->width; ++i) xe_lrc_finish(q->lrc + i); if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM || !q->vm)) @@ -769,6 +771,41 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q) q->lrc[0].fence_ctx.next_seqno - 1; } +/** + * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw + * @q: The exec queue + * + * Update the timestamp saved by HW for this exec queue and save runtime + * calculated by using the delta from last update. On multi-lrc case, only the + * first is considered. + */ +void xe_exec_queue_update_runtime(struct xe_exec_queue *q) +{ + struct xe_file *xef; + struct xe_lrc *lrc; + u32 old_ts, new_ts; + + /* + * Jobs that are run during driver load may use an exec_queue, but are + * not associated with a user xe file, so avoid accumulating busyness + * for kernel specific work. + */ + if (!q->vm || !q->vm->xef) + return; + + xef = q->vm->xef; + lrc = >lrc[0]; + + new_ts = xe_lrc_update_timestamp(lrc, _ts); + + /* + * Special case the very first timestamp: we don't want the + * initial delta to be a huge value + */ + if (old_ts) + xef->runtime[q->class] += new_ts - old_ts; +} + void xe_exec_queue_kill(struct xe_exec_queue *q) { struct xe_exec_queue *eq = q, *next; diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h b/drivers/gpu/drm/xe/xe_exec_queue.h index 02ce8d204622..45b72daa2db3 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.h +++ b/drivers/gpu/drm/xe/xe_exec_queue.h @@ -66,5 +66,6 @@ struct dma_fence *xe_exec_queue_last_fence_get(struct xe_exec_queue *e, struct xe_vm *vm); void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct xe_vm *vm, struct dma_fence *fence); +void xe_exec_queue_update_runtime(struct xe_exec_queue *q); #endif diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c index cd8a2fba5438..6a081a4fa190 100644 --- a/drivers/gpu/drm/xe/xe_sched_job.c +++ b/drivers/gpu/drm/xe/xe_sched_job.c @@ -242,6 +242,8 @@ bool xe_sched_job_completed(struct xe_sched_job *job) { struct xe_lrc *lrc = job->q->lrc; + xe_exec_queue_update_runtime(job->q); + /* * Can safely check just LRC[0] seqno as that is last seqno written when * parallel handshake is done.
Re: [PATCH] MAINTAINERS: Move the drm-intel repo location to fd.o GitLab
On 26/04/2024 16:47, Lucas De Marchi wrote: On Wed, Apr 24, 2024 at 01:41:59PM GMT, Ryszard Knop wrote: The drm-intel repo is moving from the classic fd.o git host to GitLab. Update its location with a URL matching other fd.o GitLab kernel trees. Signed-off-by: Ryszard Knop Acked-by: Lucas De Marchi Also Cc'ing maintainers Thanks, Acked-by: Tvrtko Ursulin Regards, Tvrtko --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index d6327dc12cb1..fbf7371a0bb0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10854,7 +10854,7 @@ W: https://drm.pages.freedesktop.org/intel-docs/ Q: http://patchwork.freedesktop.org/project/intel-gfx/ B: https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html C: irc://irc.oftc.net/intel-gfx -T: git git://anongit.freedesktop.org/drm-intel +T: git https://gitlab.freedesktop.org/drm/i915/kernel.git F: Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon F: Documentation/gpu/i915.rst F: drivers/gpu/drm/ci/xfails/i915* -- 2.44.0
Re: [PATCH v2 3/6] drm/xe: Add helper to accumulate exec queue runtime
On 24/04/2024 00:56, Lucas De Marchi wrote: From: Umesh Nerlige Ramappa Add a helper to accumulate per-client runtime of all its exec queues. Currently that is done in 2 places: 1. when the exec_queue is destroyed 2. when the sched job is completed Signed-off-by: Umesh Nerlige Ramappa Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_device_types.h | 9 +++ drivers/gpu/drm/xe/xe_exec_queue.c | 37 drivers/gpu/drm/xe/xe_exec_queue.h | 1 + drivers/gpu/drm/xe/xe_sched_job.c| 2 ++ 4 files changed, 49 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h index 2e62450d86e1..33d3bf93a2f1 100644 --- a/drivers/gpu/drm/xe/xe_device_types.h +++ b/drivers/gpu/drm/xe/xe_device_types.h @@ -547,6 +547,15 @@ struct xe_file { struct mutex lock; } exec_queue; + /** +* @runtime: hw engine class runtime in ticks for this drm client +* +* Only stats from xe_exec_queue->lrc[0] are accumulated. For multi-lrc +* case, since all jobs run in parallel on the engines, only the stats +* from lrc[0] are sufficient. Out of curiousity doesn't this mean multi-lrc jobs will be incorrectly accounted for? (When capacity is considered.) Regards, Tvrtko +*/ + u64 runtime[XE_ENGINE_CLASS_MAX]; + /** @client: drm client */ struct xe_drm_client *client; }; diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index 395de93579fa..b7b6256cb96a 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -214,6 +214,8 @@ void xe_exec_queue_fini(struct xe_exec_queue *q) { int i; + xe_exec_queue_update_runtime(q); + for (i = 0; i < q->width; ++i) xe_lrc_finish(q->lrc + i); if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM || !q->vm)) @@ -769,6 +771,41 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q) q->lrc[0].fence_ctx.next_seqno - 1; } +/** + * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw + * @q: The exec queue + * + * Update the timestamp saved by HW for this exec queue and save runtime + * calculated by using the delta from last update. On multi-lrc case, only the + * first is considered. + */ +void xe_exec_queue_update_runtime(struct xe_exec_queue *q) +{ + struct xe_file *xef; + struct xe_lrc *lrc; + u32 old_ts, new_ts; + + /* +* Jobs that are run during driver load may use an exec_queue, but are +* not associated with a user xe file, so avoid accumulating busyness +* for kernel specific work. +*/ + if (!q->vm || !q->vm->xef) + return; + + xef = q->vm->xef; + lrc = >lrc[0]; + + new_ts = xe_lrc_update_timestamp(lrc, _ts); + + /* +* Special case the very first timestamp: we don't want the +* initial delta to be a huge value +*/ + if (old_ts) + xef->runtime[q->class] += new_ts - old_ts; +} + void xe_exec_queue_kill(struct xe_exec_queue *q) { struct xe_exec_queue *eq = q, *next; diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h b/drivers/gpu/drm/xe/xe_exec_queue.h index 02ce8d204622..45b72daa2db3 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.h +++ b/drivers/gpu/drm/xe/xe_exec_queue.h @@ -66,5 +66,6 @@ struct dma_fence *xe_exec_queue_last_fence_get(struct xe_exec_queue *e, struct xe_vm *vm); void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct xe_vm *vm, struct dma_fence *fence); +void xe_exec_queue_update_runtime(struct xe_exec_queue *q); #endif diff --git a/drivers/gpu/drm/xe/xe_sched_job.c b/drivers/gpu/drm/xe/xe_sched_job.c index cd8a2fba5438..6a081a4fa190 100644 --- a/drivers/gpu/drm/xe/xe_sched_job.c +++ b/drivers/gpu/drm/xe/xe_sched_job.c @@ -242,6 +242,8 @@ bool xe_sched_job_completed(struct xe_sched_job *job) { struct xe_lrc *lrc = job->q->lrc; + xe_exec_queue_update_runtime(job->q); + /* * Can safely check just LRC[0] seqno as that is last seqno written when * parallel handshake is done.
Re: [PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo
On 24/04/2024 00:56, Lucas De Marchi wrote: Print the accumulated runtime for client when printing fdinfo. Each time a query is done it first does 2 things: 1) loop through all the exec queues for the current client and accumulate the runtime, per engine class. CTX_TIMESTAMP is used for that, being read from the context image. 2) Read a "GPU timestamp" that can be used for considering "how much GPU time has passed" and that has the same unit/refclock as the one recording the runtime. RING_TIMESTAMP is used for that via MMIO. Since for all current platforms RING_TIMESTAMP follows the same refclock, just read it once, using any first engine. This is exported to userspace as 2 numbers in fdinfo: drm-cycles-: drm-total-cycles-: Userspace is expected to collect at least 2 samples, which allows to know the client engine busyness as per: RUNTIME1 - RUNTIME0 busyness = - T1 - T0 Another thing to point out is that it's expected that userspace will read any 2 samples every few seconds. Given the update frequency of the counters involved and that CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap around (assuming 100% utilization) after ~200s. The wraparound is not perceived by userspace since it's just accumulated for all the exec_queues in a 64-bit counter), but the measurement will not be accurate if the samples are too far apart. This could be mitigated by adding a workqueue to accumulate the counters every so often, but it's additional complexity for something that is done already by userspace every few seconds in tools like gputop (from igt), htop, nvtop, etc with none of them really defaulting to 1 sample per minute or more. Signed-off-by: Lucas De Marchi --- Documentation/gpu/drm-usage-stats.rst | 16 ++- Documentation/gpu/xe/index.rst | 1 + Documentation/gpu/xe/xe-drm-usage-stats.rst | 10 ++ drivers/gpu/drm/xe/xe_drm_client.c | 138 +++- 4 files changed, 162 insertions(+), 3 deletions(-) create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst index 6dc299343b48..421766289b78 100644 --- a/Documentation/gpu/drm-usage-stats.rst +++ b/Documentation/gpu/drm-usage-stats.rst @@ -112,6 +112,17 @@ larger value within a reasonable period. Upon observing a value lower than what was previously read, userspace is expected to stay with that larger previous value until a monotonic update is seen. +- drm-total-cycles-: + +Engine identifier string must be the same as the one specified in the +drm-cycles- tag and shall contain the total number cycles for the given +engine. + +This is a timestamp in GPU unspecified unit that matches the update rate +of drm-cycles-. For drivers that implement this interface, the engine +utilization can be calculated entirely on the GPU clock domain, without +considering the CPU sleep time between 2 samples. Two opens. 1) Do we need to explicity document that drm-total-cycles and drm-maxfreq are mutually exclusive? 2) Should drm-total-cycles for now be documents as driver specific? I have added some more poeple in the cc who were involved with driver fdinfo implementations if they will have an opinion. I would say potentially yes, and promote it to common if more than one driver would use it. For instance I see panfrost has the driver specific drm-curfreq (although isn't documenting it fully in panfrost.rst). And I have to say it is somewhat questionable to expose the current frequency per fdinfo per engine but not my call. + - drm-maxfreq-: [Hz|MHz|KHz] Engine identifier string must be the same as the one specified in the @@ -168,5 +179,6 @@ be documented above and where possible, aligned with other drivers. Driver specific implementations --- -:ref:`i915-usage-stats` -:ref:`panfrost-usage-stats` +* :ref:`i915-usage-stats` +* :ref:`panfrost-usage-stats` +* :ref:`xe-usage-stats` diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst index c224ecaee81e..3f07aa3b5432 100644 --- a/Documentation/gpu/xe/index.rst +++ b/Documentation/gpu/xe/index.rst @@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver. xe_firmware xe_tile xe_debugging + xe-drm-usage-stats.rst diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst b/Documentation/gpu/xe/xe-drm-usage-stats.rst new file mode 100644 index ..ccb48733cbe3 --- /dev/null +++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst @@ -0,0 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +.. _xe-usage-stats: + +=== +Xe DRM client usage stats implemenation +=== + +.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c + :doc: DRM Client usage stats diff --git
Re: [PATCH v3 8/8] drm/v3d: Add modparam for turning off Big/Super Pages
On 21/04/2024 22:44, Maíra Canal wrote: Add a modparam for turning off Big/Super Pages to make sure that if an user doesn't want Big/Super Pages enabled, it can disabled it by setting the modparam to false. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_drv.c | 8 drivers/gpu/drm/v3d/v3d_gemfs.c | 5 + 2 files changed, 13 insertions(+) diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c index 3debf37e7d9b..bc8c8905112a 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.c +++ b/drivers/gpu/drm/v3d/v3d_drv.c @@ -36,6 +36,14 @@ #define DRIVER_MINOR 0 #define DRIVER_PATCHLEVEL 0 +bool super_pages = true; + +/* Only expose the `super_pages` modparam if THP is enabled. */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE I would have put bool super_pages in here so it can get compiled out. +module_param_named(super_pages, super_pages, bool, 0400); +MODULE_PARM_DESC(super_pages, "Enable/Disable Super Pages support."); +#endif + static int v3d_get_param_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) { diff --git a/drivers/gpu/drm/v3d/v3d_gemfs.c b/drivers/gpu/drm/v3d/v3d_gemfs.c index 31cf5bd11e39..5fa08263cff2 100644 --- a/drivers/gpu/drm/v3d/v3d_gemfs.c +++ b/drivers/gpu/drm/v3d/v3d_gemfs.c @@ -11,6 +11,11 @@ void v3d_gemfs_init(struct v3d_dev *v3d) char huge_opt[] = "huge=within_size"; struct file_system_type *type; struct vfsmount *gemfs; + extern bool super_pages; + + /* The user doesn't want to enable Super Pages */ + if (!super_pages) + goto err; And if this hunk is moved after the CONFIG_TRANSPARENT_HUGEPAGE check just below I hope compiler can be happy with that. Regards, Tvrtko /* * By creating our own shmemfs mountpoint, we can pass in
Re: [PATCH v3 7/8] drm/v3d: Use gemfs/THP in BO creation if available
On 21/04/2024 22:44, Maíra Canal wrote: Although Big/Super Pages could appear naturally, it would be quite hard to have 1MB or 64KB allocated contiguously naturally. Therefore, we can force the creation of large pages allocated contiguously by using a mountpoint with "huge=within_size" enabled. As V3D has a mountpoint with "huge=within_size" (if user has THP enabled), use this mountpoint for BO creation if available. This will allow us to create large pages allocated contiguously and make use of Big/Super Pages. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_bo.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_bo.c b/drivers/gpu/drm/v3d/v3d_bo.c index 79e31c5299b1..16ac26c31c6b 100644 --- a/drivers/gpu/drm/v3d/v3d_bo.c +++ b/drivers/gpu/drm/v3d/v3d_bo.c @@ -94,6 +94,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj) struct v3d_dev *v3d = to_v3d_dev(obj->dev); struct v3d_bo *bo = to_v3d_bo(obj); struct sg_table *sgt; + u64 align; int ret; /* So far we pin the BO in the MMU for its lifetime, so use @@ -103,6 +104,15 @@ v3d_bo_create_finish(struct drm_gem_object *obj) if (IS_ERR(sgt)) return PTR_ERR(sgt); + if (!v3d->gemfs) + align = SZ_4K; + else if (obj->size >= SZ_1M) + align = SZ_1M; + else if (obj->size >= SZ_64K) + align = SZ_64K; + else + align = SZ_4K; V3d has one GPU address space, right? I wonder if one day fragmentation could become an issue but it's a problem for another day. Patch looks fine to me. Reviewed-by: Tvrtko Ursulin Regards, Tvrtko + spin_lock(>mm_lock); /* Allocate the object's space in the GPU's page tables. * Inserting PTEs will happen later, but the offset is for the @@ -110,7 +120,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj) */ ret = drm_mm_insert_node_generic(>mm, >node, obj->size >> V3D_MMU_PAGE_SHIFT, -SZ_4K >> V3D_MMU_PAGE_SHIFT, 0, 0); +align >> V3D_MMU_PAGE_SHIFT, 0, 0); spin_unlock(>mm_lock); if (ret) return ret; @@ -130,10 +140,17 @@ struct v3d_bo *v3d_bo_create(struct drm_device *dev, struct drm_file *file_priv, size_t unaligned_size) { struct drm_gem_shmem_object *shmem_obj; + struct v3d_dev *v3d = to_v3d_dev(dev); struct v3d_bo *bo; int ret; - shmem_obj = drm_gem_shmem_create(dev, unaligned_size); + /* Let the user opt out of allocating the BOs with THP */ + if (v3d->gemfs) + shmem_obj = drm_gem_shmem_create_with_mnt(dev, unaligned_size, + v3d->gemfs); + else + shmem_obj = drm_gem_shmem_create(dev, unaligned_size); + if (IS_ERR(shmem_obj)) return ERR_CAST(shmem_obj); bo = to_v3d_bo(_obj->base);
Re: [PATCH v3 6/8] drm/v3d: Support Big/Super Pages when writing out PTEs
On 22/04/2024 10:57, Tvrtko Ursulin wrote: On 21/04/2024 22:44, Maíra Canal wrote: The V3D MMU also supports 64KB and 1MB pages, called big and super pages, respectively. In order to set a 64KB page or 1MB page in the MMU, we need to make sure that page table entries for all 4KB pages within a big/super page must be correctly configured. In order to create a big/super page, we need a contiguous memory region. That's why we use a separate mountpoint with THP enabled. In order to place the page table entries in the MMU, we iterate over the 16 4KB pages (for big pages) or 256 4KB pages (for super pages) and insert the PTE. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_drv.h | 1 + drivers/gpu/drm/v3d/v3d_mmu.c | 52 ++- 2 files changed, 40 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h index 17236ee23490..79d8a1a059aa 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.h +++ b/drivers/gpu/drm/v3d/v3d_drv.h @@ -18,6 +18,7 @@ struct platform_device; struct reset_control; #define V3D_MMU_PAGE_SHIFT 12 +#define V3D_PAGE_FACTOR (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT) #define V3D_MAX_QUEUES (V3D_CPU + 1) diff --git a/drivers/gpu/drm/v3d/v3d_mmu.c b/drivers/gpu/drm/v3d/v3d_mmu.c index 14f3af40d6f6..2e0b31e373b2 100644 --- a/drivers/gpu/drm/v3d/v3d_mmu.c +++ b/drivers/gpu/drm/v3d/v3d_mmu.c @@ -25,9 +25,16 @@ * superpage bit set. */ #define V3D_PTE_SUPERPAGE BIT(31) +#define V3D_PTE_BIGPAGE BIT(30) #define V3D_PTE_WRITEABLE BIT(29) #define V3D_PTE_VALID BIT(28) +static bool v3d_mmu_is_aligned(u32 page, u32 page_address, size_t alignment) +{ + return IS_ALIGNED(page, alignment >> V3D_MMU_PAGE_SHIFT) && + IS_ALIGNED(page_address, alignment >> V3D_MMU_PAGE_SHIFT); +} + static int v3d_mmu_flush_all(struct v3d_dev *v3d) { int ret; @@ -87,19 +94,38 @@ void v3d_mmu_insert_ptes(struct v3d_bo *bo) struct drm_gem_shmem_object *shmem_obj = >base; struct v3d_dev *v3d = to_v3d_dev(shmem_obj->base.dev); u32 page = bo->node.start; - u32 page_prot = V3D_PTE_WRITEABLE | V3D_PTE_VALID; - struct sg_dma_page_iter dma_iter; - - for_each_sgtable_dma_page(shmem_obj->sgt, _iter, 0) { - dma_addr_t dma_addr = sg_page_iter_dma_address(_iter); - u32 page_address = dma_addr >> V3D_MMU_PAGE_SHIFT; - u32 pte = page_prot | page_address; - u32 i; - - BUG_ON(page_address + (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT) >= - BIT(24)); - for (i = 0; i < PAGE_SIZE >> V3D_MMU_PAGE_SHIFT; i++) - v3d->pt[page++] = pte + i; + struct scatterlist *sgl; + unsigned int count; + + for_each_sgtable_dma_sg(shmem_obj->sgt, sgl, count) { + dma_addr_t dma_addr = sg_dma_address(sgl); + u32 pfn = dma_addr >> V3D_MMU_PAGE_SHIFT; + unsigned int len = sg_dma_len(sgl); + + while (len > 0) { + u32 page_prot = V3D_PTE_WRITEABLE | V3D_PTE_VALID; + u32 page_address = page_prot | pfn; + unsigned int i, page_size; + + BUG_ON(pfn + V3D_PAGE_FACTOR >= BIT(24)); + + if (len >= SZ_1M && v3d_mmu_is_aligned(page, page_address, SZ_1M)) { + page_size = SZ_1M; + page_address |= V3D_PTE_SUPERPAGE; + } else if (len >= SZ_64K && v3d_mmu_is_aligned(page, page_address, SZ_64K)) { + page_size = SZ_64K; + page_address |= V3D_PTE_BIGPAGE; + } else { + page_size = SZ_4K; + } + + for (i = 0; i < page_size >> V3D_MMU_PAGE_SHIFT; i++) { + v3d->pt[page++] = page_address + i; + pfn++; + } + + len -= page_size; + } } WARN_ON_ONCE(page - bo->node.start != It looks correct to me. Reviewed-by: Tvrtko Ursulin Ooops muscle memory strikes again! I guess reviewing patches for 10+ years can do that.. :) Reviewed-by: Tvrtko Ursulin Regards, Tvrtko
Re: [PATCH v3 6/8] drm/v3d: Support Big/Super Pages when writing out PTEs
On 21/04/2024 22:44, Maíra Canal wrote: The V3D MMU also supports 64KB and 1MB pages, called big and super pages, respectively. In order to set a 64KB page or 1MB page in the MMU, we need to make sure that page table entries for all 4KB pages within a big/super page must be correctly configured. In order to create a big/super page, we need a contiguous memory region. That's why we use a separate mountpoint with THP enabled. In order to place the page table entries in the MMU, we iterate over the 16 4KB pages (for big pages) or 256 4KB pages (for super pages) and insert the PTE. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_drv.h | 1 + drivers/gpu/drm/v3d/v3d_mmu.c | 52 ++- 2 files changed, 40 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h index 17236ee23490..79d8a1a059aa 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.h +++ b/drivers/gpu/drm/v3d/v3d_drv.h @@ -18,6 +18,7 @@ struct platform_device; struct reset_control; #define V3D_MMU_PAGE_SHIFT 12 +#define V3D_PAGE_FACTOR (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT) #define V3D_MAX_QUEUES (V3D_CPU + 1) diff --git a/drivers/gpu/drm/v3d/v3d_mmu.c b/drivers/gpu/drm/v3d/v3d_mmu.c index 14f3af40d6f6..2e0b31e373b2 100644 --- a/drivers/gpu/drm/v3d/v3d_mmu.c +++ b/drivers/gpu/drm/v3d/v3d_mmu.c @@ -25,9 +25,16 @@ * superpage bit set. */ #define V3D_PTE_SUPERPAGE BIT(31) +#define V3D_PTE_BIGPAGE BIT(30) #define V3D_PTE_WRITEABLE BIT(29) #define V3D_PTE_VALID BIT(28) +static bool v3d_mmu_is_aligned(u32 page, u32 page_address, size_t alignment) +{ + return IS_ALIGNED(page, alignment >> V3D_MMU_PAGE_SHIFT) && + IS_ALIGNED(page_address, alignment >> V3D_MMU_PAGE_SHIFT); +} + static int v3d_mmu_flush_all(struct v3d_dev *v3d) { int ret; @@ -87,19 +94,38 @@ void v3d_mmu_insert_ptes(struct v3d_bo *bo) struct drm_gem_shmem_object *shmem_obj = >base; struct v3d_dev *v3d = to_v3d_dev(shmem_obj->base.dev); u32 page = bo->node.start; - u32 page_prot = V3D_PTE_WRITEABLE | V3D_PTE_VALID; - struct sg_dma_page_iter dma_iter; - - for_each_sgtable_dma_page(shmem_obj->sgt, _iter, 0) { - dma_addr_t dma_addr = sg_page_iter_dma_address(_iter); - u32 page_address = dma_addr >> V3D_MMU_PAGE_SHIFT; - u32 pte = page_prot | page_address; - u32 i; - - BUG_ON(page_address + (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT) >= - BIT(24)); - for (i = 0; i < PAGE_SIZE >> V3D_MMU_PAGE_SHIFT; i++) - v3d->pt[page++] = pte + i; + struct scatterlist *sgl; + unsigned int count; + + for_each_sgtable_dma_sg(shmem_obj->sgt, sgl, count) { + dma_addr_t dma_addr = sg_dma_address(sgl); + u32 pfn = dma_addr >> V3D_MMU_PAGE_SHIFT; + unsigned int len = sg_dma_len(sgl); + + while (len > 0) { + u32 page_prot = V3D_PTE_WRITEABLE | V3D_PTE_VALID; + u32 page_address = page_prot | pfn; + unsigned int i, page_size; + + BUG_ON(pfn + V3D_PAGE_FACTOR >= BIT(24)); + + if (len >= SZ_1M && v3d_mmu_is_aligned(page, page_address, SZ_1M)) { + page_size = SZ_1M; + page_address |= V3D_PTE_SUPERPAGE; + } else if (len >= SZ_64K && v3d_mmu_is_aligned(page, page_address, SZ_64K)) { + page_size = SZ_64K; + page_address |= V3D_PTE_BIGPAGE; + } else { + page_size = SZ_4K; + } + + for (i = 0; i < page_size >> V3D_MMU_PAGE_SHIFT; i++) { + v3d->pt[page++] = page_address + i; + pfn++; + } + + len -= page_size; + } } WARN_ON_ONCE(page - bo->node.start != It looks correct to me. Reviewed-by: Tvrtko Ursulin Regards, Tvrtko
Re: [PATCH v3 4/5] drm/v3d: Decouple stats calculation from printing
On 20/04/2024 22:32, Maíra Canal wrote: Create a function to decouple the stats calculation from the printing. This will be useful in the next step when we add a seqcount to protect the stats. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_drv.c | 18 ++ drivers/gpu/drm/v3d/v3d_drv.h | 4 drivers/gpu/drm/v3d/v3d_sysfs.c | 11 +++ 3 files changed, 21 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c index 52e3ba9df46f..2ec359ed2def 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.c +++ b/drivers/gpu/drm/v3d/v3d_drv.c @@ -142,6 +142,15 @@ v3d_postclose(struct drm_device *dev, struct drm_file *file) kfree(v3d_priv); } +void v3d_get_stats(const struct v3d_stats *stats, u64 timestamp, + u64 *active_runtime, u64 *jobs_completed) +{ + *active_runtime = stats->enabled_ns; + if (stats->start_ns) + *active_runtime += timestamp - stats->start_ns; + *jobs_completed = stats->jobs_completed; +} + static void v3d_show_fdinfo(struct drm_printer *p, struct drm_file *file) { struct v3d_file_priv *file_priv = file->driver_priv; @@ -150,20 +159,21 @@ static void v3d_show_fdinfo(struct drm_printer *p, struct drm_file *file) for (queue = 0; queue < V3D_MAX_QUEUES; queue++) { struct v3d_stats *stats = _priv->stats[queue]; + u64 active_runtime, jobs_completed; + + v3d_get_stats(stats, timestamp, _runtime, _completed); /* Note that, in case of a GPU reset, the time spent during an * attempt of executing the job is not computed in the runtime. */ drm_printf(p, "drm-engine-%s: \t%llu ns\n", - v3d_queue_to_string(queue), - stats->start_ns ? stats->enabled_ns + timestamp - stats->start_ns - : stats->enabled_ns); + v3d_queue_to_string(queue), active_runtime); /* Note that we only count jobs that completed. Therefore, jobs * that were resubmitted due to a GPU reset are not computed. */ drm_printf(p, "v3d-jobs-%s: \t%llu jobs\n", - v3d_queue_to_string(queue), stats->jobs_completed); + v3d_queue_to_string(queue), jobs_completed); } } diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h index 5a198924d568..ff06dc1cc078 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.h +++ b/drivers/gpu/drm/v3d/v3d_drv.h @@ -510,6 +510,10 @@ struct drm_gem_object *v3d_prime_import_sg_table(struct drm_device *dev, /* v3d_debugfs.c */ void v3d_debugfs_init(struct drm_minor *minor); +/* v3d_drv.c */ +void v3d_get_stats(const struct v3d_stats *stats, u64 timestamp, + u64 *active_runtime, u64 *jobs_completed); + /* v3d_fence.c */ extern const struct dma_fence_ops v3d_fence_ops; struct dma_fence *v3d_fence_create(struct v3d_dev *v3d, enum v3d_queue queue); diff --git a/drivers/gpu/drm/v3d/v3d_sysfs.c b/drivers/gpu/drm/v3d/v3d_sysfs.c index 6a8e7acc8b82..d610e355964f 100644 --- a/drivers/gpu/drm/v3d/v3d_sysfs.c +++ b/drivers/gpu/drm/v3d/v3d_sysfs.c @@ -15,18 +15,15 @@ gpu_stats_show(struct device *dev, struct device_attribute *attr, char *buf) struct v3d_dev *v3d = to_v3d_dev(drm); enum v3d_queue queue; u64 timestamp = local_clock(); - u64 active_runtime; ssize_t len = 0; len += sysfs_emit(buf, "queue\ttimestamp\tjobs\truntime\n"); for (queue = 0; queue < V3D_MAX_QUEUES; queue++) { struct v3d_stats *stats = >queue[queue].stats; + u64 active_runtime, jobs_completed; - if (stats->start_ns) - active_runtime = timestamp - stats->start_ns; - else - active_runtime = 0; + v3d_get_stats(stats, timestamp, _runtime, _completed); /* Each line will display the queue name, timestamp, the number * of jobs sent to that queue and the runtime, as can be seem here: @@ -40,9 +37,7 @@ gpu_stats_show(struct device *dev, struct device_attribute *attr, char *buf) */ len += sysfs_emit_at(buf, len, "%s\t%llu\t%llu\t%llu\n", v3d_queue_to_string(queue), -timestamp, -stats->jobs_completed, -stats->enabled_ns + active_runtime); + timestamp, jobs_completed, active_runtime); } return len; Reviewed-by: Tvrtko Ursulin Regards, Tvrtko
Re: [PATCH v2 4/4] drm/v3d: Fix race-condition between sysfs/fdinfo and interrupt handler
On 17/04/2024 01:53, Maíra Canal wrote: In V3D, the conclusion of a job is indicated by a IRQ. When a job finishes, then we update the local and the global GPU stats of that queue. But, while the GPU stats are being updated, a user might be reading the stats from sysfs or fdinfo. For example, on `gpu_stats_show()`, we could think about a scenario where `v3d->queue[queue].start_ns != 0`, then an interruption happens, we update interrupt the value of `v3d->queue[queue].start_ns` to 0, we come back to `gpu_stats_show()` to calculate `active_runtime` and now, `active_runtime = timestamp`. In this simple example, the user would see a spike in the queue usage, that didn't matches reality. match In order to address this issue properly, use a seqcount to protect read and write sections of the code. Fixes: 09a93cc4f7d1 ("drm/v3d: Implement show_fdinfo() callback for GPU usage stats") Reported-by: Tvrtko Ursulin Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_drv.c | 10 ++ drivers/gpu/drm/v3d/v3d_drv.h | 21 + drivers/gpu/drm/v3d/v3d_gem.c | 7 +-- drivers/gpu/drm/v3d/v3d_sched.c | 7 +++ drivers/gpu/drm/v3d/v3d_sysfs.c | 11 +++ 5 files changed, 42 insertions(+), 14 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c index 52e3ba9df46f..cf15fa142968 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.c +++ b/drivers/gpu/drm/v3d/v3d_drv.c @@ -121,6 +121,7 @@ v3d_open(struct drm_device *dev, struct drm_file *file) 1, NULL); memset(_priv->stats[i], 0, sizeof(v3d_priv->stats[i])); + seqcount_init(_priv->stats[i].lock); } v3d_perfmon_open_file(v3d_priv); @@ -150,20 +151,21 @@ static void v3d_show_fdinfo(struct drm_printer *p, struct drm_file *file) for (queue = 0; queue < V3D_MAX_QUEUES; queue++) { struct v3d_stats *stats = _priv->stats[queue]; + u64 active_runtime, jobs_completed; + + v3d_get_stats(stats, timestamp, _runtime, _completed); /* Note that, in case of a GPU reset, the time spent during an * attempt of executing the job is not computed in the runtime. */ drm_printf(p, "drm-engine-%s: \t%llu ns\n", - v3d_queue_to_string(queue), - stats->start_ns ? stats->enabled_ns + timestamp - stats->start_ns - : stats->enabled_ns); + v3d_queue_to_string(queue), active_runtime); /* Note that we only count jobs that completed. Therefore, jobs * that were resubmitted due to a GPU reset are not computed. */ drm_printf(p, "v3d-jobs-%s: \t%llu jobs\n", - v3d_queue_to_string(queue), stats->jobs_completed); + v3d_queue_to_string(queue), jobs_completed); } } diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h index 5a198924d568..5211df7c7317 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.h +++ b/drivers/gpu/drm/v3d/v3d_drv.h @@ -40,8 +40,29 @@ struct v3d_stats { u64 start_ns; u64 enabled_ns; u64 jobs_completed; + + /* +* This seqcount is used to protect the access to the GPU stats +* variables. It must be used as, while we are reading the stats, +* IRQs can happen and the stats can be updated. +*/ + seqcount_t lock; }; +static inline void v3d_get_stats(const struct v3d_stats *stats, u64 timestamp, +u64 *active_runtime, u64 *jobs_completed) +{ + unsigned int seq; + + do { + seq = read_seqcount_begin(>lock); + *active_runtime = stats->enabled_ns; + if (stats->start_ns) + *active_runtime += timestamp - stats->start_ns; + *jobs_completed = stats->jobs_completed; + } while (read_seqcount_retry(>lock, seq)); +} Patch reads clean and obviously correct to me. Reviewed-by: Tvrtko Ursulin The only possible discussion point I see is whether v3d_get_stats could have been introduced first to avoid mixing pure refactors with functionality, and whether it deserves to be in a header, or could be a function call in v3d_drv.c just as well. No strong opinion from me, since it is your driver your preference. Regards, Tvrtko + struct v3d_queue_state { struct drm_gpu_scheduler sched; diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c index d14589d3ae6c..da8faf3b9011 100644 --- a/drivers/gpu/drm/v3d/v3d_gem.c +++ b/drivers/gpu/drm/v3d/v3d_gem.c @@ -247,8 +247,11 @@ v3d_gem_init(struct drm_device *dev) int ret, i; for (i = 0; i < V3D_MAX_QUEUES; i++) { -
Re: Proposal to add CRIU support to DRM render nodes
On 01/04/2024 18:58, Felix Kuehling wrote: On 2024-04-01 12:56, Tvrtko Ursulin wrote: On 01/04/2024 17:37, Felix Kuehling wrote: On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change. Doesn't it read the saved gpu id from the image file while doing restore and tries to open the render node to match it? Maybe I am misreading the code.. But if it does, does it imply that in practice it could be stable across reboots? Or that it is not possible to restore to a different instance of maybe the same GPU model installed in a system? Ah, the idea is, that when you restore on a different system, you may get different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore it on GPU 2 on the same system. That's why we need to translate GPU IDs in restored applications. User mode still uses the old GPU IDs, but the kernel mode driver translates them to the actual GPU IDs of the GPUs that the process was restored on. I see.. I think. Normal flow is ppd->user_gpu_id set during client init, but for restored clients it gets overriden during restore so that any further ioctls can actually not instantly fail. And then in amdgpu_plugin_restore_file, when it is opening the render node, it relies on the kfd topology to have filled in (more or less) the target_gpu_id corresponding to the render node gpu id of the target GPU - the one associated with the new kfd gpu_id? I am digging into this because I am trying to see if some part of GPU discovery could somehow be decoupled.. to offer you to work on at least that until you start to tackle the main body of the feature. But it looks properly tangled up. Do you have any suggestions with what I could help with? Maybe developing some sort of drm device enumeration library if you see a way that would be useful in decoupling the device discovery from kfd. We would need to define what sort of information you would need to be queryable from it. This also highlights another aspect on those spatially partitioned GPUs. GPU IDs identify device partitions, not devices. Similarly, each partition has its own render node, and the KFD topology info in sysfs points to the render-minor number corresponding to each GPU ID. I am not familiar with this. This is not SR-IOV but some other kind of partitioning? Would you have any links where I could read more? Right, the bare-metal driver can partition a PF spatially without SRIOV. SRIOV can also use spatial partitioning and expose each partition through its own VF, but that's not useful for bare metal. Spatial partitioning is new in MI300. There is some high-level info in this whitepaper: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf. From the outside (userspace) this looks simply like multiple DRM render nodes or how exactly? Regards, Tvrtko Rega
Re: [PATCH 1/5] drm/v3d: Don't increment `enabled_ns` twice
Hi, On 15/04/2024 12:17, Chema Casanova wrote: El 3/4/24 a las 22:24, Maíra Canal escribió: The commit 509433d8146c ("drm/v3d: Expose the total GPU usage stats on sysfs") introduced the calculation of global GPU stats. For the regards, it used the already existing infrastructure provided by commit 09a93cc4f7d1 ("drm/v3d: Implement show_fdinfo() callback for GPU usage stats"). While adding global GPU stats calculation ability, the author forgot to delete the existing one. Currently, the value of `enabled_ns` is incremented twice by the end of the job, when it should be added just once. Therefore, delete the leftovers from commit 509433d8146c ("drm/v3d: Expose the total GPU usage stats on sysfs"). Fixes: 509433d8146c ("drm/v3d: Expose the total GPU usage stats on sysfs") Reported-by: Tvrtko Ursulin Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_irq.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_irq.c b/drivers/gpu/drm/v3d/v3d_irq.c index 2e04f6cb661e..ce6b2fb341d1 100644 --- a/drivers/gpu/drm/v3d/v3d_irq.c +++ b/drivers/gpu/drm/v3d/v3d_irq.c @@ -105,7 +105,6 @@ v3d_irq(int irq, void *arg) struct v3d_file_priv *file = v3d->bin_job->base.file->driver_priv; u64 runtime = local_clock() - file->start_ns[V3D_BIN]; - file->enabled_ns[V3D_BIN] += local_clock() - file->start_ns[V3D_BIN]; file->jobs_sent[V3D_BIN]++; v3d->queue[V3D_BIN].jobs_sent++; @@ -126,7 +125,6 @@ v3d_irq(int irq, void *arg) struct v3d_file_priv *file = v3d->render_job->base.file->driver_priv; u64 runtime = local_clock() - file->start_ns[V3D_RENDER]; - file->enabled_ns[V3D_RENDER] += local_clock() - file->start_ns[V3D_RENDER]; file->jobs_sent[V3D_RENDER]++; v3d->queue[V3D_RENDER].jobs_sent++; @@ -147,7 +145,6 @@ v3d_irq(int irq, void *arg) struct v3d_file_priv *file = v3d->csd_job->base.file->driver_priv; u64 runtime = local_clock() - file->start_ns[V3D_CSD]; - file->enabled_ns[V3D_CSD] += local_clock() - file->start_ns[V3D_CSD]; file->jobs_sent[V3D_CSD]++; v3d->queue[V3D_CSD].jobs_sent++; @@ -195,7 +192,6 @@ v3d_hub_irq(int irq, void *arg) struct v3d_file_priv *file = v3d->tfu_job->base.file->driver_priv; u64 runtime = local_clock() - file->start_ns[V3D_TFU]; - file->enabled_ns[V3D_TFU] += local_clock() - file->start_ns[V3D_TFU]; file->jobs_sent[V3D_TFU]++; v3d->queue[V3D_TFU].jobs_sent++; Thanks for fixing this. I see that I included this error in my first refactoring of the original patch. Not sure if it would be worth creating a simple test like https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/commit/2f81ed3aed873c7cc2f6d0e1117fa4fb02033246 for i915? Just a thought. Regards, Tvrtko
Re: [PATCH] drm/sysfs: Add drm class-wide attribute to get active device clients
On 05/04/2024 18:59, Rob Clark wrote: On Wed, Apr 3, 2024 at 11:37 AM Adrián Larumbe wrote: Up to this day, all fdinfo-based GPU profilers must traverse the entire /proc directory structure to find open DRM clients with fdinfo file descriptors. This is inefficient and time-consuming. This patch adds a new device class attribute that will install a sysfs file per DRM device, which can be queried by profilers to get a list of PIDs for their open clients. This file isn't human-readable, and it's meant to be queried only by GPU profilers like gputop and nvtop. Cc: Boris Brezillon Cc: Tvrtko Ursulin Cc: Christopher Healy Signed-off-by: Adrián Larumbe It does seem like a good idea.. idk if there is some precedent to prefer binary vs ascii in sysfs, but having a way to avoid walking _all_ processes is a good idea. I naturally second that it is a needed feature, but I do not think binary format is justified. AFAIR it should be used for things like hw/fw standardised tables or firmware images, not when exporting a simple list of PIDs. It also precludes easy shell/script access and the benefit of avoiding parsing a short list is I suspect completely dwarfed by needing to parse all the related fdinfo etc. --- drivers/gpu/drm/drm_internal.h | 2 +- drivers/gpu/drm/drm_privacy_screen.c | 2 +- drivers/gpu/drm/drm_sysfs.c | 89 ++-- 3 files changed, 74 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h index 2215baef9a3e..9a399b03d11c 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -145,7 +145,7 @@ bool drm_master_internal_acquire(struct drm_device *dev); void drm_master_internal_release(struct drm_device *dev); /* drm_sysfs.c */ -extern struct class *drm_class; +extern struct class drm_class; int drm_sysfs_init(void); void drm_sysfs_destroy(void); diff --git a/drivers/gpu/drm/drm_privacy_screen.c b/drivers/gpu/drm/drm_privacy_screen.c index 6cc39e30781f..2fbd24ba5818 100644 --- a/drivers/gpu/drm/drm_privacy_screen.c +++ b/drivers/gpu/drm/drm_privacy_screen.c @@ -401,7 +401,7 @@ struct drm_privacy_screen *drm_privacy_screen_register( mutex_init(>lock); BLOCKING_INIT_NOTIFIER_HEAD(>notifier_head); - priv->dev.class = drm_class; + priv->dev.class = _class; priv->dev.type = _privacy_screen_type; priv->dev.parent = parent; priv->dev.release = drm_privacy_screen_device_release; diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c index a953f69a34b6..56ca9e22c720 100644 --- a/drivers/gpu/drm/drm_sysfs.c +++ b/drivers/gpu/drm/drm_sysfs.c @@ -58,8 +58,6 @@ static struct device_type drm_sysfs_device_connector = { .name = "drm_connector", }; -struct class *drm_class; - #ifdef CONFIG_ACPI static bool drm_connector_acpi_bus_match(struct device *dev) { @@ -128,6 +126,62 @@ static const struct component_ops typec_connector_ops = { static CLASS_ATTR_STRING(version, S_IRUGO, "drm 1.1.0 20060810"); +static ssize_t clients_show(struct device *cd, struct device_attribute *attr, char *buf) +{ + struct drm_minor *minor = cd->driver_data; + struct drm_device *ddev = minor->dev; + struct drm_file *priv; + ssize_t offset = 0; + void *pid_buf; + + if (minor->type != DRM_MINOR_RENDER) + return 0; Why this? + + pid_buf = kvmalloc(PAGE_SIZE, GFP_KERNEL); I don't quite get the kvmalloc for just one page (or why even a temporay buffer and not write into buf directly?). + if (!pid_buf) + return 0; + + mutex_lock(>filelist_mutex); + list_for_each_entry_reverse(priv, >filelist, lhead) { + struct pid *pid; + + if (drm_WARN_ON(ddev, (PAGE_SIZE - offset) < sizeof(pid_t))) + break; Feels bad.. I would suggest exploring implementing a read callback (instead of show) and handling arbitrary size output. + + rcu_read_lock(); + pid = rcu_dereference(priv->pid); + (*(pid_t *)(pid_buf + offset)) = pid_vnr(pid); + rcu_read_unlock(); + + offset += sizeof(pid_t); + } + mutex_unlock(>filelist_mutex); + + if (offset < PAGE_SIZE) + (*(pid_t *)(pid_buf + offset)) = 0; Either NULL terminated or PAGE_SIZE/sizeof(pid) entries and not NULL terminated feels weird. If I got that right. For me everything points towards going for text output. + + memcpy(buf, pid_buf, offset); + + kvfree(pid_buf); + + return offset; + +} +static DEVICE_ATTR_RO(clients); Shouldn't BIN_ATTR_RO be used for binary files in sysfs? Regards, Tvrtko P.S. Or maybe it is time for drmfs? Where each client gets a directory and drivers can populate files. Such as per client logging s
Re: [PATCH v2 4/6] drm/gem: Create shmem GEM object in a given mountpoint
On 05/04/2024 19:29, Maíra Canal wrote: Create a function `drm_gem_shmem_create_with_mnt()`, similar to `drm_gem_shmem_create()`, that has a mountpoint as a argument. This function will create a shmem GEM object in a given tmpfs mountpoint. This function will be useful for drivers that have a special mountpoint with flags enabled. Signed-off-by: Maíra Canal --- drivers/gpu/drm/drm_gem_shmem_helper.c | 30 ++ include/drm/drm_gem_shmem_helper.h | 3 +++ 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c index 13bcdbfd..10b7c4c769a3 100644 --- a/drivers/gpu/drm/drm_gem_shmem_helper.c +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c @@ -49,7 +49,8 @@ static const struct drm_gem_object_funcs drm_gem_shmem_funcs = { }; static struct drm_gem_shmem_object * -__drm_gem_shmem_create(struct drm_device *dev, size_t size, bool private) +__drm_gem_shmem_create(struct drm_device *dev, size_t size, bool private, + struct vfsmount *gemfs) { struct drm_gem_shmem_object *shmem; struct drm_gem_object *obj; @@ -76,7 +77,7 @@ __drm_gem_shmem_create(struct drm_device *dev, size_t size, bool private) drm_gem_private_object_init(dev, obj, size); shmem->map_wc = false; /* dma-buf mappings use always writecombine */ } else { - ret = drm_gem_object_init(dev, obj, size); + ret = drm_gem_object_init_with_mnt(dev, obj, size, gemfs); } if (ret) { drm_gem_private_object_fini(obj); @@ -123,10 +124,31 @@ __drm_gem_shmem_create(struct drm_device *dev, size_t size, bool private) */ struct drm_gem_shmem_object *drm_gem_shmem_create(struct drm_device *dev, size_t size) { - return __drm_gem_shmem_create(dev, size, false); + return __drm_gem_shmem_create(dev, size, false, NULL); } EXPORT_SYMBOL_GPL(drm_gem_shmem_create); +/** + * drm_gem_shmem_create_with_mnt - Allocate an object with the given size in a + * given mountpoint + * @dev: DRM device + * @size: Size of the object to allocate + * @gemfs: tmpfs mount where the GEM object will be created + * + * This function creates a shmem GEM object in a given tmpfs mountpoint. + * + * Returns: + * A struct drm_gem_shmem_object * on success or an ERR_PTR()-encoded negative + * error code on failure. + */ +struct drm_gem_shmem_object *drm_gem_shmem_create_with_mnt(struct drm_device *dev, + size_t size, + struct vfsmount *gemfs) +{ + return __drm_gem_shmem_create(dev, size, false, gemfs); +} +EXPORT_SYMBOL_GPL(drm_gem_shmem_create_with_mnt); + /** * drm_gem_shmem_free - Free resources associated with a shmem GEM object * @shmem: shmem GEM object to free @@ -760,7 +782,7 @@ drm_gem_shmem_prime_import_sg_table(struct drm_device *dev, size_t size = PAGE_ALIGN(attach->dmabuf->size); struct drm_gem_shmem_object *shmem; - shmem = __drm_gem_shmem_create(dev, size, true); + shmem = __drm_gem_shmem_create(dev, size, true, NULL); if (IS_ERR(shmem)) return ERR_CAST(shmem); diff --git a/include/drm/drm_gem_shmem_helper.h b/include/drm/drm_gem_shmem_helper.h index efbc9f27312b..d22e3fb53631 100644 --- a/include/drm/drm_gem_shmem_helper.h +++ b/include/drm/drm_gem_shmem_helper.h @@ -97,6 +97,9 @@ struct drm_gem_shmem_object { container_of(obj, struct drm_gem_shmem_object, base) struct drm_gem_shmem_object *drm_gem_shmem_create(struct drm_device *dev, size_t size); +struct drm_gem_shmem_object *drm_gem_shmem_create_with_mnt(struct drm_device *dev, + size_t size, + struct vfsmount *gemfs); void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem); void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem); Reviewed-by: Tvrtko Ursulin Regards, Tvrtko
Re: [PATCH v2 2/6] drm/gem: Create a drm_gem_object_init_with_mnt() function
On 05/04/2024 19:29, Maíra Canal wrote: For some applications, such as applications that uses huge pages, we might want to have a different mountpoint, for which we pass mount flags that better match our usecase. Therefore, create a new function `drm_gem_object_init_with_mnt()` that allow us to define the tmpfs mountpoint where the GEM object will be created. If this parameter is NULL, then we fallback to `shmem_file_setup()`. Signed-off-by: Maíra Canal --- drivers/gpu/drm/drm_gem.c | 34 ++ include/drm/drm_gem.h | 3 +++ 2 files changed, 33 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index d4bbc5d109c8..74ebe68e3d61 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -114,22 +114,32 @@ drm_gem_init(struct drm_device *dev) } /** - * drm_gem_object_init - initialize an allocated shmem-backed GEM object + * drm_gem_object_init_with_mnt - initialize an allocated shmem-backed GEM + * object in a given shmfs mountpoint + * * @dev: drm_device the object should be initialized for * @obj: drm_gem_object to initialize * @size: object size + * @gemfs: tmpfs mount where the GEM object will be created. If NULL, use + * the usual tmpfs mountpoint (`shm_mnt`). * * Initialize an already allocated GEM object of the specified size with * shmfs backing store. */ -int drm_gem_object_init(struct drm_device *dev, - struct drm_gem_object *obj, size_t size) +int drm_gem_object_init_with_mnt(struct drm_device *dev, +struct drm_gem_object *obj, size_t size, +struct vfsmount *gemfs) { struct file *filp; drm_gem_private_object_init(dev, obj, size); - filp = shmem_file_setup("drm mm object", size, VM_NORESERVE); + if (gemfs) + filp = shmem_file_setup_with_mnt(gemfs, "drm mm object", size, +VM_NORESERVE); + else + filp = shmem_file_setup("drm mm object", size, VM_NORESERVE); + if (IS_ERR(filp)) return PTR_ERR(filp); @@ -137,6 +147,22 @@ int drm_gem_object_init(struct drm_device *dev, return 0; } +EXPORT_SYMBOL(drm_gem_object_init_with_mnt); + +/** + * drm_gem_object_init - initialize an allocated shmem-backed GEM object + * @dev: drm_device the object should be initialized for + * @obj: drm_gem_object to initialize + * @size: object size + * + * Initialize an already allocated GEM object of the specified size with + * shmfs backing store. + */ +int drm_gem_object_init(struct drm_device *dev, struct drm_gem_object *obj, + size_t size) +{ + return drm_gem_object_init_with_mnt(dev, obj, size, NULL); +} EXPORT_SYMBOL(drm_gem_object_init); I would be tempted to static inline this one but see what other people think. (One wise kernel legend was once annoyed by trivial wrappers / function calls. But some other are then annoyed by static inlines.. so dunno.) For either flavour: Reviewed-by: Tvrtko Ursulin Regards, Tvrtko /** diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index bae4865b2101..2ebf6e10cc44 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -472,6 +472,9 @@ void drm_gem_object_release(struct drm_gem_object *obj); void drm_gem_object_free(struct kref *kref); int drm_gem_object_init(struct drm_device *dev, struct drm_gem_object *obj, size_t size); +int drm_gem_object_init_with_mnt(struct drm_device *dev, +struct drm_gem_object *obj, size_t size, +struct vfsmount *gemfs); void drm_gem_private_object_init(struct drm_device *dev, struct drm_gem_object *obj, size_t size); void drm_gem_private_object_fini(struct drm_gem_object *obj); -- 2.44.0
Re: [PATCH 5/5] drm/v3d: Fix race-condition between sysfs/fdinfo and interrupt handler
On 03/04/2024 21:24, Maíra Canal wrote: In V3D, the conclusion of a job is indicated by a IRQ. When a job finishes, then we update the local and the global GPU stats of that queue. But, while the GPU stats are being updated, a user might be reading the stats from sysfs or fdinfo. For example, on `gpu_stats_show()`, we could think about a scenario where `v3d->queue[queue].start_ns != 0`, then an interruption happens, we update the value of `v3d->queue[queue].start_ns` to 0, we come back to `gpu_stats_show()` to calculate `active_runtime` and now, `active_runtime = timestamp`. In this simple example, the user would see a spike in the queue usage, that didn't matches reality. In order to address this issue properly, use rw-locks to protect read and write sections of the code. Fixes: 09a93cc4f7d1 ("drm/v3d: Implement show_fdinfo() callback for GPU usage stats") Reported-by: Tvrtko Ursulin Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_drv.c | 16 drivers/gpu/drm/v3d/v3d_drv.h | 7 +++ drivers/gpu/drm/v3d/v3d_gem.c | 7 +-- drivers/gpu/drm/v3d/v3d_sched.c | 9 + drivers/gpu/drm/v3d/v3d_sysfs.c | 16 5 files changed, 41 insertions(+), 14 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c index cbb62be18aa5..60437718786c 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.c +++ b/drivers/gpu/drm/v3d/v3d_drv.c @@ -119,7 +119,9 @@ v3d_open(struct drm_device *dev, struct drm_file *file) drm_sched_entity_init(_priv->sched_entity[i], DRM_SCHED_PRIORITY_NORMAL, , 1, NULL); + Nitpick - if you want a blank line here probably add it in the patch which added the below memset. memset(_priv->stats[i], 0, sizeof(v3d_priv->stats[i])); + rwlock_init(_priv->stats[i].rw_lock); } v3d_perfmon_open_file(v3d_priv); @@ -149,20 +151,26 @@ static void v3d_show_fdinfo(struct drm_printer *p, struct drm_file *file) for (queue = 0; queue < V3D_MAX_QUEUES; queue++) { struct v3d_stats *stats = _priv->stats[queue]; + u64 active_time, jobs_sent; + unsigned long flags; + + read_lock_irqsave(>rw_lock, flags); The context is never irq/bh here so you can you read_lock_irq. However on the topic of lock type chosen, I think sort of established wisdom is that rwlocks are overkill for such short locked sections. More so, optimizing for multiple concurrent readers is not a huge use case for fdinfo reads. I would go for a plain spinlock, or potentially even read/write_seqcount. Just because the latter has no atomics in the irq handler. Readers might retry now and then, but unless v3d typically sees thousands of interrupts per second it should not be a problem. + active_time = stats->start_ns ? stats->enabled_ns + timestamp - stats->start_ns + : stats->enabled_ns; + jobs_sent = stats->jobs_sent; + read_unlock_irqrestore(>rw_lock, flags); /* Note that, in case of a GPU reset, the time spent during an * attempt of executing the job is not computed in the runtime. */ drm_printf(p, "drm-engine-%s: \t%llu ns\n", - v3d_queue_to_string(queue), - stats->start_ns ? stats->enabled_ns + timestamp - stats->start_ns - : stats->enabled_ns); + v3d_queue_to_string(queue), active_time); /* Note that we only count jobs that completed. Therefore, jobs * that were resubmitted due to a GPU reset are not computed. */ drm_printf(p, "v3d-jobs-%s: \t%llu jobs\n", - v3d_queue_to_string(queue), stats->jobs_sent); + v3d_queue_to_string(queue), jobs_sent); } } diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h index 0117593976ed..8fde2623f763 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.h +++ b/drivers/gpu/drm/v3d/v3d_drv.h @@ -40,6 +40,13 @@ struct v3d_stats { u64 start_ns; u64 enabled_ns; u64 jobs_sent; + + /* +* This lock is used to protect the access to the GPU stats variables. +* It must be used as, while we are reading the stats, IRQs can happen +* and the stats would be updated. +*/ + rwlock_t rw_lock; }; struct v3d_queue_state { diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c index d14589d3ae6c..439088724a51 100644 --- a/drivers/gpu/drm/v3d/v3d_gem.c +++ b/drivers/gpu/drm/v3d/v3d_gem.c @@ -247,8 +247,11 @@ v3d_gem_init(struct drm_device *de
Re: [PATCH v2 6/6] drm/v3d: Enable big and super pages
On 05/04/2024 19:29, Maíra Canal wrote: The V3D MMU also supports 64KB and 1MB pages, called big and super pages, respectively. In order to set a 64KB page or 1MB page in the MMU, we need to make sure that page table entries for all 4KB pages within a big/super page must be correctly configured. In order to create a big/super page, we need a contiguous memory region. That's why we use a separate mountpoint with THP enabled. In order to place the page table entries in the MMU, we iterate over the 16 4KB pages (for big pages) or 256 4KB pages (for super pages) and insert the PTE. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_bo.c| 21 +-- drivers/gpu/drm/v3d/v3d_drv.c | 8 ++ drivers/gpu/drm/v3d/v3d_drv.h | 2 ++ drivers/gpu/drm/v3d/v3d_gemfs.c | 6 + drivers/gpu/drm/v3d/v3d_mmu.c | 46 ++--- 5 files changed, 71 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_bo.c b/drivers/gpu/drm/v3d/v3d_bo.c index 79e31c5299b1..cfe82232886a 100644 --- a/drivers/gpu/drm/v3d/v3d_bo.c +++ b/drivers/gpu/drm/v3d/v3d_bo.c @@ -94,6 +94,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj) struct v3d_dev *v3d = to_v3d_dev(obj->dev); struct v3d_bo *bo = to_v3d_bo(obj); struct sg_table *sgt; + u64 align; int ret; /* So far we pin the BO in the MMU for its lifetime, so use @@ -103,6 +104,15 @@ v3d_bo_create_finish(struct drm_gem_object *obj) if (IS_ERR(sgt)) return PTR_ERR(sgt); + if (!v3d->super_pages) + align = SZ_4K; + else if (obj->size >= SZ_1M) + align = SZ_1M; + else if (obj->size >= SZ_64K) + align = SZ_64K; + else + align = SZ_4K; + spin_lock(>mm_lock); /* Allocate the object's space in the GPU's page tables. * Inserting PTEs will happen later, but the offset is for the @@ -110,7 +120,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj) */ ret = drm_mm_insert_node_generic(>mm, >node, obj->size >> V3D_MMU_PAGE_SHIFT, -SZ_4K >> V3D_MMU_PAGE_SHIFT, 0, 0); +align >> V3D_MMU_PAGE_SHIFT, 0, 0); spin_unlock(>mm_lock); if (ret) return ret; @@ -130,10 +140,17 @@ struct v3d_bo *v3d_bo_create(struct drm_device *dev, struct drm_file *file_priv, size_t unaligned_size) { struct drm_gem_shmem_object *shmem_obj; + struct v3d_dev *v3d = to_v3d_dev(dev); struct v3d_bo *bo; int ret; - shmem_obj = drm_gem_shmem_create(dev, unaligned_size); + /* Let the user opt out of allocating the BOs with THP */ + if (v3d->super_pages) + shmem_obj = drm_gem_shmem_create_with_mnt(dev, unaligned_size, + v3d->gemfs); + else + shmem_obj = drm_gem_shmem_create(dev, unaligned_size); + if (IS_ERR(shmem_obj)) return ERR_CAST(shmem_obj); bo = to_v3d_bo(_obj->base); diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c index 3debf37e7d9b..3dbd29560be4 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.c +++ b/drivers/gpu/drm/v3d/v3d_drv.c @@ -36,6 +36,12 @@ #define DRIVER_MINOR 0 #define DRIVER_PATCHLEVEL 0 +static bool super_pages = true; +module_param_named(super_pages, super_pages, bool, 0400); +MODULE_PARM_DESC(super_pages, "Enable/Disable Super Pages support. Note: \ + To enable Super Pages, you need support to \ + enable THP."); Maybe not expose the modparam unless CONFIG_TRANSPARENT_HUGEPAGE is enabled? Then you wouldn't have to explain the dependency in the description. + static int v3d_get_param_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) { @@ -308,6 +314,8 @@ static int v3d_platform_drm_probe(struct platform_device *pdev) return -ENOMEM; } + v3d->super_pages = super_pages; + ret = v3d_gem_init(drm); if (ret) goto dma_free; diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h index 17236ee23490..0a7aacf51164 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.h +++ b/drivers/gpu/drm/v3d/v3d_drv.h @@ -18,6 +18,7 @@ struct platform_device; struct reset_control; #define V3D_MMU_PAGE_SHIFT 12 +#define V3D_PAGE_FACTOR (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT) #define V3D_MAX_QUEUES (V3D_CPU + 1) @@ -121,6 +122,7 @@ struct v3d_dev { * tmpfs instance used for shmem backed objects */ struct vfsmount *gemfs; + bool super_pages; You could probably get away with not having to add this new bool by basing the runtime checks of v3d->gemfs != NULL. In v3d_gemfs_init you would
Re: [PATCH 4/5] drm/v3d: Create function to update a set of GPU stats
On 03/04/2024 21:24, Maíra Canal wrote: Given a set of GPU stats, that is, a `struct v3d_stats` related to a queue in a given context, create a function that can update all this set of GPU stats. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_sched.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index ea5f5a84b55b..754107b80f67 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -118,6 +118,16 @@ v3d_job_start_stats(struct v3d_job *job, enum v3d_queue queue) global_stats->start_ns = now; } +static void +v3d_stats_update(struct v3d_stats *stats) +{ + u64 now = local_clock(); + + stats->enabled_ns += now - stats->start_ns; + stats->jobs_sent++; + stats->start_ns = 0; +} + void v3d_job_update_stats(struct v3d_job *job, enum v3d_queue queue) { @@ -125,15 +135,9 @@ v3d_job_update_stats(struct v3d_job *job, enum v3d_queue queue) struct v3d_file_priv *file = job->file->driver_priv; struct v3d_stats *global_stats = >queue[queue].stats; struct v3d_stats *local_stats = >stats[queue]; - u64 now = local_clock(); - - local_stats->enabled_ns += now - local_stats->start_ns; - local_stats->jobs_sent++; - local_stats->start_ns = 0; - global_stats->enabled_ns += now - global_stats->start_ns; - global_stats->jobs_sent++; - global_stats->start_ns = 0; + v3d_stats_update(local_stats); + v3d_stats_update(global_stats); } static struct dma_fence *v3d_bin_job_run(struct drm_sched_job *sched_job) Reviewed-by: Tvrtko Ursulin Regards, Tvrtko
Re: [PATCH 3/5] drm/v3d: Create a struct to store the GPU stats
->file->driver_priv; + struct v3d_stats *global_stats = >queue[queue].stats; + struct v3d_stats *local_stats = >stats[queue]; u64 now = local_clock(); - file->start_ns[queue] = now; - v3d->queue[queue].start_ns = now; + local_stats->start_ns = now; + global_stats->start_ns = now; } void @@ -121,15 +123,17 @@ v3d_job_update_stats(struct v3d_job *job, enum v3d_queue queue) { struct v3d_dev *v3d = job->v3d; struct v3d_file_priv *file = job->file->driver_priv; + struct v3d_stats *global_stats = >queue[queue].stats; + struct v3d_stats *local_stats = >stats[queue]; u64 now = local_clock(); - file->enabled_ns[queue] += now - file->start_ns[queue]; - file->jobs_sent[queue]++; - file->start_ns[queue] = 0; + local_stats->enabled_ns += now - local_stats->start_ns; + local_stats->jobs_sent++; + local_stats->start_ns = 0; - v3d->queue[queue].enabled_ns += now - v3d->queue[queue].start_ns; - v3d->queue[queue].jobs_sent++; - v3d->queue[queue].start_ns = 0; + global_stats->enabled_ns += now - global_stats->start_ns; + global_stats->jobs_sent++; + global_stats->start_ns = 0; } static struct dma_fence *v3d_bin_job_run(struct drm_sched_job *sched_job) diff --git a/drivers/gpu/drm/v3d/v3d_sysfs.c b/drivers/gpu/drm/v3d/v3d_sysfs.c index d106845ba890..1eb5f3de6937 100644 --- a/drivers/gpu/drm/v3d/v3d_sysfs.c +++ b/drivers/gpu/drm/v3d/v3d_sysfs.c @@ -21,8 +21,10 @@ gpu_stats_show(struct device *dev, struct device_attribute *attr, char *buf) len += sysfs_emit(buf, "queue\ttimestamp\tjobs\truntime\n"); for (queue = 0; queue < V3D_MAX_QUEUES; queue++) { - if (v3d->queue[queue].start_ns) - active_runtime = timestamp - v3d->queue[queue].start_ns; + struct v3d_stats *stats = >queue[queue].stats; + + if (stats->start_ns) + active_runtime = timestamp - stats->start_ns; else active_runtime = 0; @@ -39,8 +41,8 @@ gpu_stats_show(struct device *dev, struct device_attribute *attr, char *buf) len += sysfs_emit_at(buf, len, "%s\t%llu\t%llu\t%llu\n", v3d_queue_to_string(queue), timestamp, -v3d->queue[queue].jobs_sent, - v3d->queue[queue].enabled_ns + active_runtime); +stats->jobs_sent, +stats->enabled_ns + active_runtime); } return len; Reviewed-by: Tvrtko Ursulin Regards, Tvrtko
Re: [PATCH 1/5] drm/v3d: Don't increment `enabled_ns` twice
On 03/04/2024 21:24, Maíra Canal wrote: The commit 509433d8146c ("drm/v3d: Expose the total GPU usage stats on sysfs") introduced the calculation of global GPU stats. For the regards, it used the already existing infrastructure provided by commit 09a93cc4f7d1 ("drm/v3d: Implement show_fdinfo() callback for GPU usage stats"). While adding global GPU stats calculation ability, the author forgot to delete the existing one. Currently, the value of `enabled_ns` is incremented twice by the end of the job, when it should be added just once. Therefore, delete the leftovers from commit 509433d8146c ("drm/v3d: Expose the total GPU usage stats on sysfs"). Fixes: 509433d8146c ("drm/v3d: Expose the total GPU usage stats on sysfs") Reported-by: Tvrtko Ursulin Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_irq.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_irq.c b/drivers/gpu/drm/v3d/v3d_irq.c index 2e04f6cb661e..ce6b2fb341d1 100644 --- a/drivers/gpu/drm/v3d/v3d_irq.c +++ b/drivers/gpu/drm/v3d/v3d_irq.c @@ -105,7 +105,6 @@ v3d_irq(int irq, void *arg) struct v3d_file_priv *file = v3d->bin_job->base.file->driver_priv; u64 runtime = local_clock() - file->start_ns[V3D_BIN]; - file->enabled_ns[V3D_BIN] += local_clock() - file->start_ns[V3D_BIN]; file->jobs_sent[V3D_BIN]++; v3d->queue[V3D_BIN].jobs_sent++; @@ -126,7 +125,6 @@ v3d_irq(int irq, void *arg) struct v3d_file_priv *file = v3d->render_job->base.file->driver_priv; u64 runtime = local_clock() - file->start_ns[V3D_RENDER]; - file->enabled_ns[V3D_RENDER] += local_clock() - file->start_ns[V3D_RENDER]; file->jobs_sent[V3D_RENDER]++; v3d->queue[V3D_RENDER].jobs_sent++; @@ -147,7 +145,6 @@ v3d_irq(int irq, void *arg) struct v3d_file_priv *file = v3d->csd_job->base.file->driver_priv; u64 runtime = local_clock() - file->start_ns[V3D_CSD]; - file->enabled_ns[V3D_CSD] += local_clock() - file->start_ns[V3D_CSD]; file->jobs_sent[V3D_CSD]++; v3d->queue[V3D_CSD].jobs_sent++; @@ -195,7 +192,6 @@ v3d_hub_irq(int irq, void *arg) struct v3d_file_priv *file = v3d->tfu_job->base.file->driver_priv; u64 runtime = local_clock() - file->start_ns[V3D_TFU]; - file->enabled_ns[V3D_TFU] += local_clock() - file->start_ns[V3D_TFU]; file->jobs_sent[V3D_TFU]++; v3d->queue[V3D_TFU].jobs_sent++; Reviewed-by: Tvrtko Ursulin Regards, Tvrtko
Re: Proposal to add CRIU support to DRM render nodes
On 01/04/2024 17:37, Felix Kuehling wrote: On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? The concept of the GPU ID is that it's stable while the system is up, even when devices get added and removed dynamically. It was baked into the API early on, but I don't think we ever fully validated device hot plug. I think the closest we're getting is with our latest MI GPUs and dynamic partition mode change. Doesn't it read the saved gpu id from the image file while doing restore and tries to open the render node to match it? Maybe I am misreading the code.. But if it does, does it imply that in practice it could be stable across reboots? Or that it is not possible to restore to a different instance of maybe the same GPU model installed in a system? This also highlights another aspect on those spatially partitioned GPUs. GPU IDs identify device partitions, not devices. Similarly, each partition has its own render node, and the KFD topology info in sysfs points to the render-minor number corresponding to each GPU ID. I am not familiar with this. This is not SR-IOV but some other kind of partitioning? Would you have any links where I could read more? Regards, Tvrtko Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Got it. Regards, Tvrtko Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects a
Re: drm-misc migration to Gitlab server
On 12/03/2024 13:56, Maxime Ripard wrote: Hi, On Tue, Feb 20, 2024 at 09:49:25AM +0100, Maxime Ripard wrote: ## Changing the default location repo Dim gets its repos list in the drm-rerere nightly.conf file. We will need to change that file to match the gitlab repo, and drop the old cgit URLs to avoid people pushing to the wrong place once the transition is made. I guess the next merge window is a good time to do so, it's usually a quiet time for us and a small disruption would be easier to handle. I'll be off-duty during that time too, so I'll have time to handle any complication. ## Updating the documentation The documentation currently mentions the old process to request a drm-misc access. It will all go through Gitlab now, so it will change a few things. We will also need to update and move the issue template to the new repo to maintain consistency. I would expect the transition (if everything goes smoothly) to occur in the merge-window time frame (11/03 -> 24/03). The transition just happened. The main drm-misc repo is now on gitlab, with the old cgit repo being setup as a mirror. If there's any issue accessing that gitlab repo, please let me know. No issues accessing the repo just a slight confusion and how to handle the workflow. More specifically, if I have a patch which wants to be merged to drm-misc-next, is the mailing list based worklflow still the way to go, or I should create a merge request, or I should apply for commit access via some new method other than adding permissions to my legacy fdo ssh account? Regards, Tvrtko
Re: Proposal to add CRIU support to DRM render nodes
On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. Yeah, if we want to support graphics apps that don't use KFD, we should definitely do that. Currently we get a lot of topology information from KFD, not even from the /dev/kfd device but from the sysfs nodes exposed by KFD. We'd need to get GPU device info from the render nodes instead. And if KFD is available, we may need to integrate both sources of information. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Do you mean adding GPU discovery to the core CRIU, or to the plugin. I was thinking this is still part of the plugin. Yes I agree. I was only thinking about adding some DRM device discovery code in a more decoupled fashion from the current plugin, for both the reason discussed above (decoupling a bit from reliance on kfd sysfs), and then also if/when a new DRM driver might want to implement this the code could be move to some common plugin area. I am not sure how feasible that would be though. The "gpu id" concept and it's matching in the current kernel code and CRIU plugin - is that value tied to the physical GPU instance or how it works? Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. I've been pretty far under-water lately. I hope I'll find time to work on this more, but it's probably going to be at least a few weeks. Got it. Regards, Tvrtko Regards, Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging
Re: [PATCH] dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FS
On 01/04/2024 13:45, Christian König wrote: Am 01.04.24 um 14:39 schrieb Tvrtko Ursulin: On 29/03/2024 00:00, T.J. Mercier wrote: On Thu, Mar 28, 2024 at 7:53 AM Tvrtko Ursulin wrote: From: Tvrtko Ursulin There is no point in compiling in the list and mutex operations which are only used from the dma-buf debugfs code, if debugfs is not compiled in. Put the code in questions behind some kconfig guards and so save some text and maybe even a pointer per object at runtime when not enabled. Signed-off-by: Tvrtko Ursulin Reviewed-by: T.J. Mercier Thanks! How would patches to dma-buf be typically landed? Via what tree I mean? drm-misc-next? That should go through drm-misc-next. And feel free to add Reviewed-by: Christian König as well. Thanks! Maarten if I got it right you are handling the next drm-misc-next pull - could you merge this one please? Regards, Tvrtko
Re: [PATCH] dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FS
On 29/03/2024 00:00, T.J. Mercier wrote: On Thu, Mar 28, 2024 at 7:53 AM Tvrtko Ursulin wrote: From: Tvrtko Ursulin There is no point in compiling in the list and mutex operations which are only used from the dma-buf debugfs code, if debugfs is not compiled in. Put the code in questions behind some kconfig guards and so save some text and maybe even a pointer per object at runtime when not enabled. Signed-off-by: Tvrtko Ursulin Reviewed-by: T.J. Mercier Thanks! How would patches to dma-buf be typically landed? Via what tree I mean? drm-misc-next? Regards, Tvrtko
Re: Proposal to add CRIU support to DRM render nodes
Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present, correct? AFAICT at least, it relies on that to figure out the amdgpu DRM node. In would be probably good to consider designing things without that dependency. So that checkpointing an application which does not use /dev/kfd is possible. Or if the kernel does not even have the KFD support compiled in. It could perhaps mean no more than adding some GPU discovery code into CRIU. Which shuold be flexible enough to account for things like re-assigned minor numbers due driver reload. Otherwise I am eagerly awaiting to hear more about the design specifics around dma-buf handling. And also seeing how to extend to other DRM related anonymous fds. Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which
[PATCH] dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FS
From: Tvrtko Ursulin There is no point in compiling in the list and mutex operations which are only used from the dma-buf debugfs code, if debugfs is not compiled in. Put the code in questions behind some kconfig guards and so save some text and maybe even a pointer per object at runtime when not enabled. Signed-off-by: Tvrtko Ursulin Cc: Sumit Semwal Cc: "Christian König" Cc: linux-me...@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-...@lists.linaro.org Cc: linux-ker...@vger.kernel.org Cc: kernel-...@igalia.com --- drivers/dma-buf/dma-buf.c | 56 --- include/linux/dma-buf.h | 2 ++ 2 files changed, 36 insertions(+), 22 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..8892bc701a66 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -35,12 +35,35 @@ static inline int is_dma_buf_file(struct file *); -struct dma_buf_list { - struct list_head head; - struct mutex lock; -}; +#if IS_ENABLED(CONFIG_DEBUG_FS) +static DEFINE_MUTEX(debugfs_list_mutex); +static LIST_HEAD(debugfs_list); -static struct dma_buf_list db_list; +static void __dma_buf_debugfs_list_add(struct dma_buf *dmabuf) +{ + mutex_lock(_list_mutex); + list_add(>list_node, _list); + mutex_unlock(_list_mutex); +} + +static void __dma_buf_debugfs_list_del(struct dma_buf *dmabuf) +{ + if (!dmabuf) + return; + + mutex_lock(_list_mutex); + list_del(>list_node); + mutex_unlock(_list_mutex); +} +#else +static void __dma_buf_debugfs_list_add(struct dma_buf *dmabuf) +{ +} + +static void __dma_buf_debugfs_list_del(struct file *file) +{ +} +#endif static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen) { @@ -89,17 +112,10 @@ static void dma_buf_release(struct dentry *dentry) static int dma_buf_file_release(struct inode *inode, struct file *file) { - struct dma_buf *dmabuf; - if (!is_dma_buf_file(file)) return -EINVAL; - dmabuf = file->private_data; - if (dmabuf) { - mutex_lock(_list.lock); - list_del(>list_node); - mutex_unlock(_list.lock); - } + __dma_buf_debugfs_list_del(file->private_data); return 0; } @@ -672,9 +688,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file; - mutex_lock(_list.lock); - list_add(>list_node, _list.head); - mutex_unlock(_list.lock); + __dma_buf_debugfs_list_add(dmabuf); return dmabuf; @@ -1611,7 +1625,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) size_t size = 0; int ret; - ret = mutex_lock_interruptible(_list.lock); + ret = mutex_lock_interruptible(_list_mutex); if (ret) return ret; @@ -1620,7 +1634,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) seq_printf(s, "%-8s\t%-8s\t%-8s\t%-8s\texp_name\t%-8s\tname\n", "size", "flags", "mode", "count", "ino"); - list_for_each_entry(buf_obj, _list.head, list_node) { + list_for_each_entry(buf_obj, _list, list_node) { ret = dma_resv_lock_interruptible(buf_obj->resv, NULL); if (ret) @@ -1657,11 +1671,11 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) seq_printf(s, "\nTotal %d objects, %zu bytes\n", count, size); - mutex_unlock(_list.lock); + mutex_unlock(_list_mutex); return 0; error_unlock: - mutex_unlock(_list.lock); + mutex_unlock(_list_mutex); return ret; } @@ -1718,8 +1732,6 @@ static int __init dma_buf_init(void) if (IS_ERR(dma_buf_mnt)) return PTR_ERR(dma_buf_mnt); - mutex_init(_list.lock); - INIT_LIST_HEAD(_list.head); dma_buf_init_debugfs(); return 0; } diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8ff4add71f88..36216d28d8bd 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -370,8 +370,10 @@ struct dma_buf { */ struct module *owner; +#if IS_ENABLED(CONFIG_DEBUG_FS) /** @list_node: node for dma_buf accounting and debugging. */ struct list_head list_node; +#endif /** @priv: exporter specific private data for this buffer object. */ void *priv; -- 2.44.0
Re: [PATCH] drm/i915/gem: Replace dev_priv with i915
On 28/03/2024 07:18, Andi Shyti wrote: Anyone using 'dev_priv' instead of 'i915' in a cleaned-up area should be fined and required to do community service for a few days. I thought I had cleaned up the 'gem/' directory in the past, but still, old aficionados of the 'dev_priv' name keep sneaking it in. Signed-off-by: Andi Shyti Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin --- drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 4 ++-- drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 6 +++--- drivers/gpu/drm/i915/gem/i915_gem_stolen.h | 8 drivers/gpu/drm/i915/gem/i915_gem_tiling.c | 18 +- drivers/gpu/drm/i915/gem/i915_gem_userptr.c| 6 +++--- .../gpu/drm/i915/gem/selftests/huge_pages.c| 14 +++--- 6 files changed, 28 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 3f20fe381199..42619fc05de4 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -2456,7 +2456,7 @@ static int eb_submit(struct i915_execbuffer *eb) * The engine index is returned. */ static unsigned int -gen8_dispatch_bsd_engine(struct drm_i915_private *dev_priv, +gen8_dispatch_bsd_engine(struct drm_i915_private *i915, struct drm_file *file) { struct drm_i915_file_private *file_priv = file->driver_priv; @@ -2464,7 +2464,7 @@ gen8_dispatch_bsd_engine(struct drm_i915_private *dev_priv, /* Check whether the file_priv has already selected one ring. */ if ((int)file_priv->bsd_engine < 0) file_priv->bsd_engine = - get_random_u32_below(dev_priv->engine_uabi_class_count[I915_ENGINE_CLASS_VIDEO]); + get_random_u32_below(i915->engine_uabi_class_count[I915_ENGINE_CLASS_VIDEO]); return file_priv->bsd_engine; } diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index 38b72d86560f..c5e1c718a6d2 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -654,7 +654,7 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915, /* Allocate a new GEM object and fill it with the supplied data */ struct drm_i915_gem_object * -i915_gem_object_create_shmem_from_data(struct drm_i915_private *dev_priv, +i915_gem_object_create_shmem_from_data(struct drm_i915_private *i915, const void *data, resource_size_t size) { struct drm_i915_gem_object *obj; @@ -663,8 +663,8 @@ i915_gem_object_create_shmem_from_data(struct drm_i915_private *dev_priv, resource_size_t offset; int err; - GEM_WARN_ON(IS_DGFX(dev_priv)); - obj = i915_gem_object_create_shmem(dev_priv, round_up(size, PAGE_SIZE)); + GEM_WARN_ON(IS_DGFX(i915)); + obj = i915_gem_object_create_shmem(i915, round_up(size, PAGE_SIZE)); if (IS_ERR(obj)) return obj; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_stolen.h b/drivers/gpu/drm/i915/gem/i915_gem_stolen.h index 258381d1c054..dfe0db8bb1b9 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_stolen.h +++ b/drivers/gpu/drm/i915/gem/i915_gem_stolen.h @@ -14,14 +14,14 @@ struct drm_i915_gem_object; #define i915_stolen_fb drm_mm_node -int i915_gem_stolen_insert_node(struct drm_i915_private *dev_priv, +int i915_gem_stolen_insert_node(struct drm_i915_private *i915, struct drm_mm_node *node, u64 size, unsigned alignment); -int i915_gem_stolen_insert_node_in_range(struct drm_i915_private *dev_priv, +int i915_gem_stolen_insert_node_in_range(struct drm_i915_private *i915, struct drm_mm_node *node, u64 size, unsigned alignment, u64 start, u64 end); -void i915_gem_stolen_remove_node(struct drm_i915_private *dev_priv, +void i915_gem_stolen_remove_node(struct drm_i915_private *i915, struct drm_mm_node *node); struct intel_memory_region * i915_gem_stolen_smem_setup(struct drm_i915_private *i915, u16 type, @@ -31,7 +31,7 @@ i915_gem_stolen_lmem_setup(struct drm_i915_private *i915, u16 type, u16 instance); struct drm_i915_gem_object * -i915_gem_object_create_stolen(struct drm_i915_private *dev_priv, +i915_gem_object_create_stolen(struct drm_i915_private *i915, resource_size_t size); bool i915_gem_object_is_stolen(const struct drm_i915_gem_object *obj); diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c index a049ca0b7980..d9eb84c1d2f1 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c +++ b/drivers/gpu/drm/i915/
Re: [PATCH v6 0/3] Disable automatic load CCS load balancing
On 20/03/2024 15:06, Andi Shyti wrote: Ping! Any thoughts here? I only casually observed the discussion after I saw Matt suggested further simplifications. As I understood it, you will bring back the uabi engine games when adding the dynamic behaviour and that is fine by me. Regards, Tvrtko On Wed, Mar 13, 2024 at 09:19:48PM +0100, Andi Shyti wrote: Hi, this series does basically two things: 1. Disables automatic load balancing as adviced by the hardware workaround. 2. Assigns all the CCS slices to one single user engine. The user will then be able to query only one CCS engine >From v5 I have created a new file, gt/intel_gt_ccs_mode.c where I added the intel_gt_apply_ccs_mode(). In the upcoming patches, this file will contain the implementation for dynamic CCS mode setting. Thanks Tvrtko, Matt, John and Joonas for your reviews! Andi Changelog = v5 -> v6 (thanks Matt for the suggestions in v6) - Remove the refactoring and the for_each_available_engine() macro and instead do not create the intel_engine_cs structure at all. - In patch 1 just a trivial reordering of the bit definitions. v4 -> v5 - Use the workaround framework to do all the CCS balancing settings in order to always apply the modes also when the engine resets. Put everything in its own specific function to be executed for the first CCS engine encountered. (Thanks Matt) - Calculate the CCS ID for the CCS mode as the first available CCS among all the engines (Thanks Matt) - create the intel_gt_ccs_mode.c function to host the CCS configuration. We will have it ready for the next series. - Fix a selftest that was failing because could not set CCS2. - Add the for_each_available_engine() macro to exclude CCS1+ and start using it in the hangcheck selftest. v3 -> v4 - Reword correctly the comment in the workaround - Fix a buffer overflow (Thanks Joonas) - Handle properly the fused engines when setting the CCS mode. v2 -> v3 - Simplified the algorithm for creating the list of the exported uabi engines. (Patch 1) (Thanks, Tvrtko) - Consider the fused engines when creating the uabi engine list (Patch 2) (Thanks, Matt) - Patch 4 now uses a the refactoring from patch 1, in a cleaner outcome. v1 -> v2 - In Patch 1 use the correct workaround number (thanks Matt). - In Patch 2 do not add the extra CCS engines to the exposed UABI engine list and adapt the engine counting accordingly (thanks Tvrtko). - Reword the commit of Patch 2 (thanks John). Andi Shyti (3): drm/i915/gt: Disable HW load balancing for CCS drm/i915/gt: Do not generate the command streamer for all the CCS drm/i915/gt: Enable only one CCS for compute workload drivers/gpu/drm/i915/Makefile | 1 + drivers/gpu/drm/i915/gt/intel_engine_cs.c | 20 --- drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.c | 39 + drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.h | 13 +++ drivers/gpu/drm/i915/gt/intel_gt_regs.h | 6 drivers/gpu/drm/i915/gt/intel_workarounds.c | 30 ++-- 6 files changed, 103 insertions(+), 6 deletions(-) create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.c create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.h -- 2.43.0
Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()
On 18/03/2024 15:05, Christian König wrote: Am 18.03.24 um 15:24 schrieb Maíra Canal: Not that the CC list wasn't big enough, but I'm adding MM folks in the CC list. On 3/18/24 11:04, Christian König wrote: Am 18.03.24 um 14:28 schrieb Maíra Canal: Hi Christian, On 3/18/24 10:10, Christian König wrote: Am 18.03.24 um 13:42 schrieb Maíra Canal: Hi Christian, On 3/12/24 10:48, Christian König wrote: Am 12.03.24 um 14:09 schrieb Tvrtko Ursulin: On 12/03/2024 10:37, Christian König wrote: Am 12.03.24 um 11:31 schrieb Tvrtko Ursulin: On 12/03/2024 10:23, Christian König wrote: Am 12.03.24 um 10:30 schrieb Tvrtko Ursulin: On 12/03/2024 08:59, Christian König wrote: Am 12.03.24 um 09:51 schrieb Tvrtko Ursulin: Hi Maira, On 11/03/2024 10:05, Maíra Canal wrote: For some applications, such as using huge pages, we might want to have a different mountpoint, for which we pass in mount flags that better match our usecase. Therefore, add a new parameter to drm_gem_object_init() that allow us to define the tmpfs mountpoint where the GEM object will be created. If this parameter is NULL, then we fallback to shmem_file_setup(). One strategy for reducing churn, and so the number of drivers this patch touches, could be to add a lower level drm_gem_object_init() (which takes vfsmount, call it __drm_gem_object_init(), or drm__gem_object_init_mnt(), and make drm_gem_object_init() call that one with a NULL argument. I would even go a step further into the other direction. The shmem backed GEM object is just some special handling as far as I can see. So I would rather suggest to rename all drm_gem_* function which only deal with the shmem backed GEM object into drm_gem_shmem_*. That makes sense although it would be very churny. I at least would be on the fence regarding the cost vs benefit. Yeah, it should clearly not be part of this patch here. Also the explanation why a different mount point helps with something isn't very satisfying. Not satisfying as you think it is not detailed enough to say driver wants to use huge pages for performance? Or not satisying as you question why huge pages would help? That huge pages are beneficial is clear to me, but I'm missing the connection why a different mount point helps with using huge pages. Ah right, same as in i915, one needs to mount a tmpfs instance passing huge=within_size or huge=always option. Default is 'never', see man 5 tmpfs. Thanks for the explanation, I wasn't aware of that. Mhm, shouldn't we always use huge pages? Is there a reason for a DRM device to not use huge pages with the shmem backend? AFAIU, according to b901bb89324a ("drm/i915/gemfs: enable THP"), back then the understanding was within_size may overallocate, meaning there would be some space wastage, until the memory pressure makes the thp code split the trailing huge page. I haven't checked if that still applies. Other than that I don't know if some drivers/platforms could have problems if they have some limitations or hardcoded assumptions when they iterate the sg list. Yeah, that was the whole point behind my question. As far as I can see this isn't driver specific, but platform specific. I might be wrong here, but I think we should then probably not have that handling in each individual driver, but rather centralized in the DRM code. I don't see a point in enabling THP for all shmem drivers. A huge page is only useful if the driver is going to use it. On V3D, for example, I only need huge pages because I need the memory contiguously allocated to implement Super Pages. Otherwise, if we don't have the Super Pages support implemented in the driver, I would be creating memory pressure without any performance gain. Well that's the point I'm disagreeing with. THP doesn't seem to create much extra memory pressure for this use case. As far as I can see background for the option is that files in tmpfs usually have a varying size, so it usually isn't beneficial to allocate a huge page just to find that the shmem file is much smaller than what's needed. But GEM objects have a fixed size. So we of hand knew if we need 4KiB or 1GiB and can therefore directly allocate huge pages if they are available and object large enough to back them with. If the memory pressure is so high that we don't have huge pages available the shmem code falls back to standard pages anyway. The matter is: how do we define the point where the memory pressure is high? Well as driver developers/maintainers we simply don't do that. This is the job of the shmem code. For example, notice that in this implementation of Super Pages for the V3D driver, I only use a Super Page if the BO is bigger than 2MB. I'm doing that because the Raspberry Pi only has 4GB of RAM available for the GPU. If I created huge pages for every BO allocation (and initially, I tried that), I would end up with hangs in some applications. Yeah, that
Re: Proposal to add CRIU support to DRM render nodes
On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Thanks, that's good to hear! Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to restored for things to continue executing sensibly? Or how is that handled? In our compute use cases, we suspend user mode queues. This can include CWSR (compute-wave-save-restore) where the state of in-flight waves is stored in memory and can be reloaded and resumed from memory later. We don't use any fences other than "eviction fences", that are signaled after the queues are suspended. And those fences are never handed to user mode. So we don't need to worry about any fence state in the checkpoint. If we extended this to support the kernel mode command submission APIs, I would expect that we'd wait for all current submissions to complete, and stop new ones from being sent to the HW before taking the checkpoint. When we take the checkpoint in the CRIU plugin, the CPU threads are already frozen and cannot submit any more work. If we wai
Re: [PATCH 5/5] drm/v3d: Enable super pages
Hi Maira, On 11/03/2024 10:06, Maíra Canal wrote: The V3D MMU also supports 1MB pages, called super pages. In order to set a 1MB page in the MMU, we need to make sure that page table entries for all 4KB pages within a super page must be correctly configured. Therefore, if the BO is larger than 2MB, we allocate it in a separate mountpoint that uses THP. This will allow us to create a contiguous memory region to create our super pages. In order to place the page table entries in the MMU, we iterate over the 256 4KB pages and insert the PTE. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/v3d_bo.c| 19 +-- drivers/gpu/drm/v3d/v3d_drv.c | 7 +++ drivers/gpu/drm/v3d/v3d_drv.h | 6 -- drivers/gpu/drm/v3d/v3d_gemfs.c | 6 ++ drivers/gpu/drm/v3d/v3d_mmu.c | 24 ++-- 5 files changed, 56 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_bo.c b/drivers/gpu/drm/v3d/v3d_bo.c index a07ede668cc1..cb8e49a33be7 100644 --- a/drivers/gpu/drm/v3d/v3d_bo.c +++ b/drivers/gpu/drm/v3d/v3d_bo.c @@ -94,6 +94,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj) struct v3d_dev *v3d = to_v3d_dev(obj->dev); struct v3d_bo *bo = to_v3d_bo(obj); struct sg_table *sgt; + u64 align; int ret; /* So far we pin the BO in the MMU for its lifetime, so use @@ -103,6 +104,9 @@ v3d_bo_create_finish(struct drm_gem_object *obj) if (IS_ERR(sgt)) return PTR_ERR(sgt); + bo->huge_pages = (obj->size >= SZ_2M && v3d->super_pages); + align = bo->huge_pages ? SZ_1M : SZ_4K; + spin_lock(>mm_lock); /* Allocate the object's space in the GPU's page tables. * Inserting PTEs will happen later, but the offset is for the @@ -110,7 +114,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj) */ ret = drm_mm_insert_node_generic(>mm, >node, obj->size >> V3D_MMU_PAGE_SHIFT, -GMP_GRANULARITY >> V3D_MMU_PAGE_SHIFT, 0, 0); +align >> V3D_MMU_PAGE_SHIFT, 0, 0); spin_unlock(>mm_lock); if (ret) return ret; @@ -130,10 +134,21 @@ struct v3d_bo *v3d_bo_create(struct drm_device *dev, struct drm_file *file_priv, size_t unaligned_size) { struct drm_gem_shmem_object *shmem_obj; + struct v3d_dev *v3d = to_v3d_dev(dev); struct v3d_bo *bo; + size_t size; int ret; - shmem_obj = drm_gem_shmem_create(dev, unaligned_size); + size = PAGE_ALIGN(unaligned_size); + + /* To avoid memory fragmentation, we only use THP if the BO is bigger +* than two Super Pages (1MB). +*/ + if (size >= SZ_2M && v3d->super_pages) + shmem_obj = drm_gem_shmem_create_with_mnt(dev, size, v3d->gemfs); + else + shmem_obj = drm_gem_shmem_create(dev, size); + if (IS_ERR(shmem_obj)) return ERR_CAST(shmem_obj); bo = to_v3d_bo(_obj->base); diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c index 3debf37e7d9b..96f4d8227407 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.c +++ b/drivers/gpu/drm/v3d/v3d_drv.c @@ -36,6 +36,11 @@ #define DRIVER_MINOR 0 #define DRIVER_PATCHLEVEL 0 +static bool super_pages = true; +module_param_named(super_pages, super_pages, bool, 0400); +MODULE_PARM_DESC(super_pages, "Enable/Disable Super Pages support. Note: \ + To enable Super Pages, you need support to THP."); + static int v3d_get_param_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) { @@ -308,6 +313,8 @@ static int v3d_platform_drm_probe(struct platform_device *pdev) return -ENOMEM; } + v3d->super_pages = super_pages; + ret = v3d_gem_init(drm); if (ret) goto dma_free; diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h index d2ce8222771a..795087663739 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.h +++ b/drivers/gpu/drm/v3d/v3d_drv.h @@ -17,9 +17,8 @@ struct clk; struct platform_device; struct reset_control; -#define GMP_GRANULARITY (128 * 1024) - #define V3D_MMU_PAGE_SHIFT 12 +#define V3D_PAGE_FACTOR (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT) #define V3D_MAX_QUEUES (V3D_CPU + 1) @@ -123,6 +122,7 @@ struct v3d_dev { * tmpfs instance used for shmem backed objects */ struct vfsmount *gemfs; + bool super_pages; One not very important comment just in passing: Does v3d->super_pages == !!v3d->gemfs always holds at runtime? Thinking if you really need to add v3d->super_pages, or could just infer from v3d->gemfs, maybe via a wrapper or whatever pattern is used in v3d. struct work_struct overflow_mem_work; @@ -211,6 +211,8 @@ struct v3d_bo { struct list_head
Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()
On 12/03/2024 10:37, Christian König wrote: Am 12.03.24 um 11:31 schrieb Tvrtko Ursulin: On 12/03/2024 10:23, Christian König wrote: Am 12.03.24 um 10:30 schrieb Tvrtko Ursulin: On 12/03/2024 08:59, Christian König wrote: Am 12.03.24 um 09:51 schrieb Tvrtko Ursulin: Hi Maira, On 11/03/2024 10:05, Maíra Canal wrote: For some applications, such as using huge pages, we might want to have a different mountpoint, for which we pass in mount flags that better match our usecase. Therefore, add a new parameter to drm_gem_object_init() that allow us to define the tmpfs mountpoint where the GEM object will be created. If this parameter is NULL, then we fallback to shmem_file_setup(). One strategy for reducing churn, and so the number of drivers this patch touches, could be to add a lower level drm_gem_object_init() (which takes vfsmount, call it __drm_gem_object_init(), or drm__gem_object_init_mnt(), and make drm_gem_object_init() call that one with a NULL argument. I would even go a step further into the other direction. The shmem backed GEM object is just some special handling as far as I can see. So I would rather suggest to rename all drm_gem_* function which only deal with the shmem backed GEM object into drm_gem_shmem_*. That makes sense although it would be very churny. I at least would be on the fence regarding the cost vs benefit. Yeah, it should clearly not be part of this patch here. Also the explanation why a different mount point helps with something isn't very satisfying. Not satisfying as you think it is not detailed enough to say driver wants to use huge pages for performance? Or not satisying as you question why huge pages would help? That huge pages are beneficial is clear to me, but I'm missing the connection why a different mount point helps with using huge pages. Ah right, same as in i915, one needs to mount a tmpfs instance passing huge=within_size or huge=always option. Default is 'never', see man 5 tmpfs. Thanks for the explanation, I wasn't aware of that. Mhm, shouldn't we always use huge pages? Is there a reason for a DRM device to not use huge pages with the shmem backend? AFAIU, according to b901bb89324a ("drm/i915/gemfs: enable THP"), back then the understanding was within_size may overallocate, meaning there would be some space wastage, until the memory pressure makes the thp code split the trailing huge page. I haven't checked if that still applies. Other than that I don't know if some drivers/platforms could have problems if they have some limitations or hardcoded assumptions when they iterate the sg list. Te Cc is plenty large so perhaps someone else will have additional information. :) Regards, Tvrtko I mean it would make this patch here even smaller. Regards, Christian. Regards, Tvrtko
Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()
On 12/03/2024 10:23, Christian König wrote: Am 12.03.24 um 10:30 schrieb Tvrtko Ursulin: On 12/03/2024 08:59, Christian König wrote: Am 12.03.24 um 09:51 schrieb Tvrtko Ursulin: Hi Maira, On 11/03/2024 10:05, Maíra Canal wrote: For some applications, such as using huge pages, we might want to have a different mountpoint, for which we pass in mount flags that better match our usecase. Therefore, add a new parameter to drm_gem_object_init() that allow us to define the tmpfs mountpoint where the GEM object will be created. If this parameter is NULL, then we fallback to shmem_file_setup(). One strategy for reducing churn, and so the number of drivers this patch touches, could be to add a lower level drm_gem_object_init() (which takes vfsmount, call it __drm_gem_object_init(), or drm__gem_object_init_mnt(), and make drm_gem_object_init() call that one with a NULL argument. I would even go a step further into the other direction. The shmem backed GEM object is just some special handling as far as I can see. So I would rather suggest to rename all drm_gem_* function which only deal with the shmem backed GEM object into drm_gem_shmem_*. That makes sense although it would be very churny. I at least would be on the fence regarding the cost vs benefit. Yeah, it should clearly not be part of this patch here. Also the explanation why a different mount point helps with something isn't very satisfying. Not satisfying as you think it is not detailed enough to say driver wants to use huge pages for performance? Or not satisying as you question why huge pages would help? That huge pages are beneficial is clear to me, but I'm missing the connection why a different mount point helps with using huge pages. Ah right, same as in i915, one needs to mount a tmpfs instance passing huge=within_size or huge=always option. Default is 'never', see man 5 tmpfs. Regards, Tvrtko
Re: [PATCH 0/5] drm/i915: cleanup dead code
On 11/03/2024 19:27, Lucas De Marchi wrote: On Mon, Mar 11, 2024 at 05:43:00PM +, Tvrtko Ursulin wrote: On 06/03/2024 19:36, Lucas De Marchi wrote: Remove platforms that never had their PCI IDs added to the driver and are of course marked with requiring force_probe. Note that most of the code for those platforms is actually used by subsequent ones, so it's not a huge amount of code being removed. I had PVC and xehpsdv back in October but could not collect all acks. :( Last two patches from https://patchwork.freedesktop.org/series/124705/. oh... I was actually surprised we still had xehpsdv while removing a WA for PVC, which made me look into removing these platforms. rebasing your series and comparing yours..my-v2, where my-v2 only has patches 2 and 4, I have the diff below. I think it's small enough that I can just take your commits and squash delta. Is that ok to you? my version is a little bit more aggressive, also doing some renames s/xehpsdv/xehp/ and dropping some more code (engine_mask_apply_copy_fuses(), unused registers, default ctx, fw ranges). Right, yeah I see I missed some case combos in the comments when grepping and more. diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h index 8a8fcd4fceac..bc26dc126104 100644 --- a/Documentation/gpu/rfc/i915_vm_bind.h +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -93,12 +93,11 @@ struct drm_i915_gem_timeline_fence { * Multiple VA mappings can be created to the same section of the object * (aliasing). * - * The @start, @offset and @length must be 4K page aligned. However the DG2 - * and XEHPSDV has 64K page size for device local memory and has compact page - * table. On those platforms, for binding device local-memory objects, the - * @start, @offset and @length must be 64K aligned. Also, UMDs should not mix - * the local memory 64K page and the system memory 4K page bindings in the same - * 2M range. + * The @start, @offset and @length must be 4K page aligned. However the DG2 has + * 64K page size for device local memory and has compact page table. On that + * platform, for binding device local-memory objects, the @start, @offset and + * @length must be 64K aligned. Also, UMDs should not mix the local memory 64K + * page and the system memory 4K page bindings in the same 2M range. * * Error code -EINVAL will be returned if @start, @offset and @length are not * properly aligned. In version 1 (See I915_PARAM_VM_BIND_VERSION), error code diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h index 1495b6074492..d3300ae3053f 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h +++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h @@ -386,7 +386,7 @@ struct drm_i915_gem_object { * and kernel mode driver for caching policy control after GEN12. * In the meantime platform specific tables are created to translate * i915_cache_level into pat index, for more details check the macros - * defined i915/i915_pci.c, e.g. TGL_CACHELEVEL. + * defined i915/i915_pci.c, e.g. MTL_CACHELEVEL. Why this? * For backward compatibility, this field contains values exactly match * the entries of enum i915_cache_level for pre-GEN12 platforms (See * LEGACY_CACHELEVEL), so that the PTE encode functions for these diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c index fa46d2308b0e..1bd0e041e15c 100644 --- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c +++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c @@ -500,11 +500,11 @@ gen8_ppgtt_insert_pte(struct i915_ppgtt *ppgtt, } static void -xehpsdv_ppgtt_insert_huge(struct i915_address_space *vm, - struct i915_vma_resource *vma_res, - struct sgt_dma *iter, - unsigned int pat_index, - u32 flags) +xehp_ppgtt_insert_huge(struct i915_address_space *vm, + struct i915_vma_resource *vma_res, + struct sgt_dma *iter, + unsigned int pat_index, + u32 flags) { const gen8_pte_t pte_encode = vm->pte_encode(0, pat_index, flags); unsigned int rem = sg_dma_len(iter->sg); @@ -741,8 +741,8 @@ static void gen8_ppgtt_insert(struct i915_address_space *vm, struct sgt_dma iter = sgt_dma(vma_res); if (vma_res->bi.page_sizes.sg > I915_GTT_PAGE_SIZE) { - if (GRAPHICS_VER_FULL(vm->i915) >= IP_VER(12, 50)) - xehpsdv_ppgtt_insert_huge(vm, vma_res, , pat_index, flags); + if (GRAPHICS_VER_FULL(vm->i915) >= IP_VER(12, 55)) + xehp_ppgtt_insert_huge(vm, vma_
Re: Proposal to add CRIU support to DRM render nodes
On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. I forgot to finish this thought. I cannot answer / don't know of any concrete plans, but I think feature is pretty cool and if amdgpu gets it working I wouldn't be surprised if other drivers would get interested. Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to restored for things to continue executing sensibly? Or how is that handled? For some more background about our implementation in KFD, you can refer to this whitepaper: https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md Potential objections to a KFD-style CRIU API in DRM render nodes, I'll address each of them in more detail below: * Opaque information in the checkpoint data that user mode can't interpret or do anything with * A second API for creating objects (e.g. BOs) that is separate from the regular BO creation API * Kernel mode would need to be involved in restoring BO sharing relationships rather than replaying BO creation, export and import from user mode # Opaque information in the checkpoint This comes out of ABI compatibility considerations. Adding any new objects or attributes to the driver/HW state that needs to be checkpointed could potentially break the ABI of the CRIU checkpoint/restore ioctl if the pl
Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()
On 12/03/2024 08:59, Christian König wrote: Am 12.03.24 um 09:51 schrieb Tvrtko Ursulin: Hi Maira, On 11/03/2024 10:05, Maíra Canal wrote: For some applications, such as using huge pages, we might want to have a different mountpoint, for which we pass in mount flags that better match our usecase. Therefore, add a new parameter to drm_gem_object_init() that allow us to define the tmpfs mountpoint where the GEM object will be created. If this parameter is NULL, then we fallback to shmem_file_setup(). One strategy for reducing churn, and so the number of drivers this patch touches, could be to add a lower level drm_gem_object_init() (which takes vfsmount, call it __drm_gem_object_init(), or drm__gem_object_init_mnt(), and make drm_gem_object_init() call that one with a NULL argument. I would even go a step further into the other direction. The shmem backed GEM object is just some special handling as far as I can see. So I would rather suggest to rename all drm_gem_* function which only deal with the shmem backed GEM object into drm_gem_shmem_*. That makes sense although it would be very churny. I at least would be on the fence regarding the cost vs benefit. Also the explanation why a different mount point helps with something isn't very satisfying. Not satisfying as you think it is not detailed enough to say driver wants to use huge pages for performance? Or not satisying as you question why huge pages would help? Regards, Tvrtko
Re: [PATCH 3/5] drm/v3d: Introduce gemfs
Hi, On 11/03/2024 10:06, Maíra Canal wrote: Create a separate "tmpfs" kernel mount for V3D. This will allow us to move away from the shmemfs `shm_mnt` and gives the flexibility to do things like set our own mount options. Here, the interest is to use "huge=", which should allow us to enable the use of THP for our shmem-backed objects. Signed-off-by: Maíra Canal --- drivers/gpu/drm/v3d/Makefile| 3 ++- drivers/gpu/drm/v3d/v3d_drv.h | 9 +++ drivers/gpu/drm/v3d/v3d_gem.c | 3 +++ drivers/gpu/drm/v3d/v3d_gemfs.c | 46 + 4 files changed, 60 insertions(+), 1 deletion(-) create mode 100644 drivers/gpu/drm/v3d/v3d_gemfs.c diff --git a/drivers/gpu/drm/v3d/Makefile b/drivers/gpu/drm/v3d/Makefile index b7d673f1153b..fcf710926057 100644 --- a/drivers/gpu/drm/v3d/Makefile +++ b/drivers/gpu/drm/v3d/Makefile @@ -13,7 +13,8 @@ v3d-y := \ v3d_trace_points.o \ v3d_sched.o \ v3d_sysfs.o \ - v3d_submit.o + v3d_submit.o \ + v3d_gemfs.o v3d-$(CONFIG_DEBUG_FS) += v3d_debugfs.o diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h index 1950c723dde1..d2ce8222771a 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.h +++ b/drivers/gpu/drm/v3d/v3d_drv.h @@ -119,6 +119,11 @@ struct v3d_dev { struct drm_mm mm; spinlock_t mm_lock; + /* +* tmpfs instance used for shmem backed objects +*/ + struct vfsmount *gemfs; + struct work_struct overflow_mem_work; struct v3d_bin_job *bin_job; @@ -519,6 +524,10 @@ void v3d_reset(struct v3d_dev *v3d); void v3d_invalidate_caches(struct v3d_dev *v3d); void v3d_clean_caches(struct v3d_dev *v3d); +/* v3d_gemfs.c */ +void v3d_gemfs_init(struct v3d_dev *v3d); +void v3d_gemfs_fini(struct v3d_dev *v3d); + /* v3d_submit.c */ void v3d_job_cleanup(struct v3d_job *job); void v3d_job_put(struct v3d_job *job); diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c index 66f4b78a6b2e..faefbe497e8d 100644 --- a/drivers/gpu/drm/v3d/v3d_gem.c +++ b/drivers/gpu/drm/v3d/v3d_gem.c @@ -287,6 +287,8 @@ v3d_gem_init(struct drm_device *dev) v3d_init_hw_state(v3d); v3d_mmu_set_page_table(v3d); + v3d_gemfs_init(v3d); + ret = v3d_sched_init(v3d); if (ret) { drm_mm_takedown(>mm); @@ -304,6 +306,7 @@ v3d_gem_destroy(struct drm_device *dev) struct v3d_dev *v3d = to_v3d_dev(dev); v3d_sched_fini(v3d); + v3d_gemfs_fini(v3d); /* Waiting for jobs to finish would need to be done before * unregistering V3D. diff --git a/drivers/gpu/drm/v3d/v3d_gemfs.c b/drivers/gpu/drm/v3d/v3d_gemfs.c new file mode 100644 index ..8518b7da6f73 --- /dev/null +++ b/drivers/gpu/drm/v3d/v3d_gemfs.c @@ -0,0 +1,46 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* Copyright (C) 2024 Raspberry Pi */ + +#include +#include + +#include "v3d_drv.h" + +void v3d_gemfs_init(struct v3d_dev *v3d) +{ + char huge_opt[] = "huge=always"; Using 'always' and not 'within_size' is deliberate? It can waste memory but indeed could be best for performance. I am just asking and perhaps I missed some prior discussion on this. Regards, Tvrtko + struct file_system_type *type; + struct vfsmount *gemfs; + + /* +* By creating our own shmemfs mountpoint, we can pass in +* mount flags that better match our usecase. However, we +* only do so on platforms which benefit from it. +*/ + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) + goto err; + + type = get_fs_type("tmpfs"); + if (!type) + goto err; + + gemfs = vfs_kern_mount(type, SB_KERNMOUNT, type->name, huge_opt); + if (IS_ERR(gemfs)) + goto err; + + v3d->gemfs = gemfs; + drm_info(>drm, "Using Transparent Hugepages\n"); + + return; + +err: + v3d->gemfs = NULL; + drm_notice(>drm, + "Transparent Hugepage support is recommended for optimal performance on this platform!\n"); +} + +void v3d_gemfs_fini(struct v3d_dev *v3d) +{ + if (v3d->gemfs) + kern_unmount(v3d->gemfs); +} -- 2.43.0
Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()
Hi Maira, On 11/03/2024 10:05, Maíra Canal wrote: For some applications, such as using huge pages, we might want to have a different mountpoint, for which we pass in mount flags that better match our usecase. Therefore, add a new parameter to drm_gem_object_init() that allow us to define the tmpfs mountpoint where the GEM object will be created. If this parameter is NULL, then we fallback to shmem_file_setup(). One strategy for reducing churn, and so the number of drivers this patch touches, could be to add a lower level drm_gem_object_init() (which takes vfsmount, call it __drm_gem_object_init(), or drm__gem_object_init_mnt(), and make drm_gem_object_init() call that one with a NULL argument. Regards, Tvrtko Cc: Russell King Cc: Lucas Stach Cc: Christian Gmeiner Cc: Inki Dae Cc: Seung-Woo Kim Cc: Kyungmin Park Cc: Krzysztof Kozlowski Cc: Alim Akhtar Cc: Patrik Jakobsson Cc: Sui Jingfeng Cc: Chun-Kuang Hu Cc: Philipp Zabel Cc: Matthias Brugger Cc: AngeloGioacchino Del Regno Cc: Rob Clark Cc: Abhinav Kumar Cc: Dmitry Baryshkov Cc: Sean Paul Cc: Marijn Suijten Cc: Karol Herbst Cc: Lyude Paul Cc: Danilo Krummrich Cc: Tomi Valkeinen Cc: Gerd Hoffmann Cc: Sandy Huang Cc: "Heiko Stübner" Cc: Andy Yan Cc: Thierry Reding Cc: Mikko Perttunen Cc: Jonathan Hunter Cc: Christian König Cc: Huang Rui Cc: Oleksandr Andrushchenko Cc: Karolina Stolarek Cc: Andi Shyti Signed-off-by: Maíra Canal --- drivers/gpu/drm/armada/armada_gem.c | 2 +- drivers/gpu/drm/drm_gem.c | 12 ++-- drivers/gpu/drm/drm_gem_dma_helper.c | 2 +- drivers/gpu/drm/drm_gem_shmem_helper.c| 2 +- drivers/gpu/drm/drm_gem_vram_helper.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_gem.c | 2 +- drivers/gpu/drm/exynos/exynos_drm_gem.c | 2 +- drivers/gpu/drm/gma500/gem.c | 2 +- drivers/gpu/drm/loongson/lsdc_ttm.c | 2 +- drivers/gpu/drm/mediatek/mtk_drm_gem.c| 2 +- drivers/gpu/drm/msm/msm_gem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_gem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_prime.c | 2 +- drivers/gpu/drm/omapdrm/omap_gem.c| 2 +- drivers/gpu/drm/qxl/qxl_object.c | 2 +- drivers/gpu/drm/rockchip/rockchip_drm_gem.c | 2 +- drivers/gpu/drm/tegra/gem.c | 2 +- drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 2 +- drivers/gpu/drm/xen/xen_drm_front_gem.c | 2 +- include/drm/drm_gem.h | 3 ++- 20 files changed, 30 insertions(+), 21 deletions(-) diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c index 26d10065d534..36a25e667341 100644 --- a/drivers/gpu/drm/armada/armada_gem.c +++ b/drivers/gpu/drm/armada/armada_gem.c @@ -226,7 +226,7 @@ static struct armada_gem_object *armada_gem_alloc_object(struct drm_device *dev, obj->obj.funcs = _gem_object_funcs; - if (drm_gem_object_init(dev, >obj, size)) { + if (drm_gem_object_init(dev, >obj, size, NULL)) { kfree(obj); return NULL; } diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 44a948b80ee1..ddd8777fcda5 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -118,18 +118,26 @@ drm_gem_init(struct drm_device *dev) * @dev: drm_device the object should be initialized for * @obj: drm_gem_object to initialize * @size: object size + * @gemfs: tmpfs mount where the GEM object will be created. If NULL, use + * the usual tmpfs mountpoint (`shm_mnt`). * * Initialize an already allocated GEM object of the specified size with * shmfs backing store. */ int drm_gem_object_init(struct drm_device *dev, - struct drm_gem_object *obj, size_t size) + struct drm_gem_object *obj, size_t size, + struct vfsmount *gemfs) { struct file *filp; drm_gem_private_object_init(dev, obj, size); - filp = shmem_file_setup("drm mm object", size, VM_NORESERVE); + if (gemfs) + filp = shmem_file_setup_with_mnt(gemfs, "drm mm object", size, +VM_NORESERVE); + else + filp = shmem_file_setup("drm mm object", size, VM_NORESERVE); + if (IS_ERR(filp)) return PTR_ERR(filp); diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c b/drivers/gpu/drm/drm_gem_dma_helper.c index 870b90b78bc4..9ada5ac85dd6 100644 --- a/drivers/gpu/drm/drm_gem_dma_helper.c +++ b/drivers/gpu/drm/drm_gem_dma_helper.c @@ -95,7 +95,7 @@ __drm_gem_dma_create(struct drm_device *drm, size_t size, bool private) /* Always use writecombine for dma-buf mappings */ dma_obj->map_noncoherent = false; } else { - ret = drm_gem_object_init(drm, gem_obj, size); + ret =
Re: [PATCH 0/5] drm/i915: cleanup dead code
On 06/03/2024 19:36, Lucas De Marchi wrote: Remove platforms that never had their PCI IDs added to the driver and are of course marked with requiring force_probe. Note that most of the code for those platforms is actually used by subsequent ones, so it's not a huge amount of code being removed. I had PVC and xehpsdv back in October but could not collect all acks. :( Last two patches from https://patchwork.freedesktop.org/series/124705/. Regards, Tvrtko drivers/gpu/drm/xe/compat-i915-headers/i915_drv.h is also changed on the xe side, but that should be ok: the defines are there only for compat reasons while building the display side (and none of these platforms have display, so it's build-issue only). First patch is what motivated the others and was submitted alone @ 20240306144723.1826977-1-lucas.demar...@intel.com . While loooking at this WA I was wondering why we still had some of that code around. Build-tested only for now. Lucas De Marchi (5): drm/i915: Drop WA 16015675438 drm/i915: Drop dead code for xehpsdv drm/i915: Update IP_VER(12, 50) drm/i915: Drop dead code for pvc drm/i915: Remove special handling for !RCS_MASK() Documentation/gpu/rfc/i915_vm_bind.h | 11 +- .../gpu/drm/i915/gem/i915_gem_object_types.h | 2 +- .../gpu/drm/i915/gem/selftests/huge_pages.c | 4 +- .../i915/gem/selftests/i915_gem_client_blt.c | 8 +- drivers/gpu/drm/i915/gt/gen8_engine_cs.c | 5 +- drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 40 ++-- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 43 +--- .../drm/i915/gt/intel_execlists_submission.c | 10 +- drivers/gpu/drm/i915/gt/intel_gsc.c | 15 -- drivers/gpu/drm/i915/gt/intel_gt.c| 4 +- drivers/gpu/drm/i915/gt/intel_gt_mcr.c| 52 + drivers/gpu/drm/i915/gt/intel_gt_mcr.h| 2 +- drivers/gpu/drm/i915/gt/intel_gt_regs.h | 59 -- drivers/gpu/drm/i915/gt/intel_gt_sysfs_pm.c | 21 +- drivers/gpu/drm/i915/gt/intel_gtt.c | 2 +- drivers/gpu/drm/i915/gt/intel_lrc.c | 51 + drivers/gpu/drm/i915/gt/intel_migrate.c | 22 +- drivers/gpu/drm/i915/gt/intel_mocs.c | 52 + drivers/gpu/drm/i915/gt/intel_rps.c | 6 +- drivers/gpu/drm/i915/gt/intel_sseu.c | 13 +- drivers/gpu/drm/i915/gt/intel_workarounds.c | 193 +- drivers/gpu/drm/i915/gt/uc/intel_guc.c| 6 +- drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c| 4 +- drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h | 1 - .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_uc.c | 4 - drivers/gpu/drm/i915/i915_debugfs.c | 12 -- drivers/gpu/drm/i915/i915_drv.h | 13 -- drivers/gpu/drm/i915/i915_getparam.c | 4 +- drivers/gpu/drm/i915/i915_gpu_error.c | 5 +- drivers/gpu/drm/i915/i915_hwmon.c | 6 - drivers/gpu/drm/i915/i915_pci.c | 61 +- drivers/gpu/drm/i915/i915_perf.c | 19 +- drivers/gpu/drm/i915/i915_query.c | 2 +- drivers/gpu/drm/i915/i915_reg.h | 4 +- drivers/gpu/drm/i915/intel_clock_gating.c | 26 +-- drivers/gpu/drm/i915/intel_device_info.c | 2 - drivers/gpu/drm/i915/intel_device_info.h | 2 - drivers/gpu/drm/i915/intel_step.c | 80 +--- drivers/gpu/drm/i915/intel_uncore.c | 159 +-- drivers/gpu/drm/i915/selftests/intel_uncore.c | 3 - .../gpu/drm/xe/compat-i915-headers/i915_drv.h | 6 - 43 files changed, 110 insertions(+), 928 deletions(-)
Re: Proposal to add CRIU support to DRM render nodes
Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on render nodes for more GPU memory management. In this email I'm providing some background why we are doing this, and outlining some of the problems we need to solve to checkpoint and restore render node state and shared memory (DMABuf) state. I have some thoughts on the API design, leaning on what we did for KFD, but would like to get feedback from the DRI community regarding that API and to what extent there is interest in making that generic. We are working on using DRM render nodes for virtual address mappings in ROCm applications to implement the CUDA11-style VM API and improve interoperability between graphics and compute. This uses DMABufs for sharing buffer objects between KFD and multiple render node devices, as well as between processes. In the long run this also provides a path to moving all or most memory management from the KFD ioctl API to libdrm. Once ROCm user mode starts using render nodes for virtual address management, that creates a problem for checkpointing and restoring ROCm applications with CRIU. Currently there is no support for checkpointing and restoring render node state, other than CPU virtual address mappings. Support will be needed for checkpointing GEM buffer objects and handles, their GPU virtual address mappings and memory sharing relationships between devices and processes. Eventually, if full CRIU support for graphics applications is desired, more state would need to be captured, including scheduler contexts and BO lists. Most of this state is driver-specific. After some internal discussions we decided to take our design process public as this potentially touches DRM GEM and DMABuf APIs and may have implications for other drivers in the future. One basic question before going into any API details: Is there a desire to have CRIU support for other DRM drivers? This sounds like a very interesting feature on the overall, although I cannot answer on the last question here. Funnily enough, it has a tiny relation to an i915 feature I recently implemented on Mesa's request, which is to be able to "upload" the GPU context from the GPU hang error state and replay the hanging request. It is kind of (at a stretch) a very special tiny subset of checkout and restore so I am not mentioning it as a curiosity. And there is also another partical conceptual intersect with the (at the moment not yet upstream) i915 online debugger. This part being in the area of discovering and enumerating GPU resources beloning to the client. I don't see an immediate design or code sharing opportunities though but just mentioning. I did spend some time reading your plugin and kernel implementation out of curiousity and have some comments and questions. With that out of the way, some considerations for a possible DRM CRIU API (either generic of AMDGPU driver specific): The API goes through several phases during checkpoint and restore: Checkpoint: 1. Process-info (enumerates objects and sizes so user mode can allocate memory for the checkpoint, stops execution on the GPU) 2. Checkpoint (store object metadata for BOs, queues, etc.) 3. Unpause (resumes execution after the checkpoint is complete) Restore: 1. Restore (restore objects, VMAs are not in the right place at this time) 2. Resume (final fixups after the VMAs are sorted out, resume execution) Btw is check-pointing guaranteeing all relevant activity is idled? For instance dma_resv objects are free of fences which would need to restored for things to continue executing sensibly? Or how is that handled? For some more background about our implementation in KFD, you can refer to this whitepaper: https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md Potential objections to a KFD-style CRIU API in DRM render nodes, I'll address each of them in more detail below: * Opaque information in the checkpoint data that user mode can't interpret or do anything with * A second API for creating objects (e.g. BOs) that is separate from the regular BO creation API * Kernel mode would need to be involved in restoring BO sharing relationships rather than replaying BO creation, export and import from user mode # Opaque information in the checkpoint This comes out of ABI compatibility considerations. Adding any new objects or attributes to the driver/HW state that needs to be checkpointed could potentially break the ABI of the CRIU checkpoint/restore ioctl if the plugin needs to parse that information. Therefore, much of the information in our KFD CRIU ioctl API is opaque. It is written by kernel mode in the checkpoint, it is consumed by kernel mode when restoring the checkpoint, but user mode doesn't care about the contents or binary
Re: [PATCH v3 1/1] drm/panfrost: Replace fdinfo's profiling debugfs knob with sysfs
On 06/03/2024 01:56, Adrián Larumbe wrote: Debugfs isn't always available in production builds that try to squeeze every single byte out of the kernel image, but we still need a way to toggle the timestamp and cycle counter registers so that jobs can be profiled for fdinfo's drm engine and cycle calculations. Drop the debugfs knob and replace it with a sysfs file that accomplishes the same functionality, and document its ABI in a separate file. Signed-off-by: Adrián Larumbe --- .../testing/sysfs-driver-panfrost-profiling | 10 + Documentation/gpu/panfrost.rst| 9 drivers/gpu/drm/panfrost/Makefile | 2 - drivers/gpu/drm/panfrost/panfrost_debugfs.c | 21 -- drivers/gpu/drm/panfrost/panfrost_debugfs.h | 14 --- drivers/gpu/drm/panfrost/panfrost_device.h| 2 +- drivers/gpu/drm/panfrost/panfrost_drv.c | 41 --- drivers/gpu/drm/panfrost/panfrost_job.c | 2 +- 8 files changed, 57 insertions(+), 44 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-driver-panfrost-profiling delete mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.c delete mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.h diff --git a/Documentation/ABI/testing/sysfs-driver-panfrost-profiling b/Documentation/ABI/testing/sysfs-driver-panfrost-profiling new file mode 100644 index ..1d8bb0978920 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-driver-panfrost-profiling @@ -0,0 +1,10 @@ +What: /sys/bus/platform/drivers/panfrost/.../profiling +Date: February 2024 +KernelVersion: 6.8.0 +Contact: Adrian Larumbe +Description: + Get/set drm fdinfo's engine and cycles profiling status. + Valid values are: + 0: Don't enable fdinfo job profiling sources. + 1: Enable fdinfo job profiling sources, this enables both the GPU's + timestamp and cycle counter registers. \ No newline at end of file diff --git a/Documentation/gpu/panfrost.rst b/Documentation/gpu/panfrost.rst index b80e41f4b2c5..51ba375fd80d 100644 --- a/Documentation/gpu/panfrost.rst +++ b/Documentation/gpu/panfrost.rst @@ -38,3 +38,12 @@ the currently possible format options: Possible `drm-engine-` key names are: `fragment`, and `vertex-tiler`. `drm-curfreq-` values convey the current operating frequency for that engine. + +Users must bear in mind that engine and cycle sampling are disabled by default, +because of power saving concerns. `fdinfo` users and benchmark applications which +query the fdinfo file must make sure to toggle the job profiling status of the +driver by writing into the appropriate sysfs node:: + +echo > /sys/bus/platform/drivers/panfrost/[a-f0-9]*.gpu/profiling A late thought - how it would work to not output the inactive fdinfo keys when this knob is not enabled? Generic userspace like gputop already handles that and wouldn't show the stat. Which may be more user friendly than showing stats permanently at zero. It may be moot once you add the auto-toggle to gputop (or so) but perhaps worth considering. Regards, Tvrtko + +Where `N` is either `0` or `1`, depending on the desired enablement status. diff --git a/drivers/gpu/drm/panfrost/Makefile b/drivers/gpu/drm/panfrost/Makefile index 2c01c1e7523e..7da2b3f02ed9 100644 --- a/drivers/gpu/drm/panfrost/Makefile +++ b/drivers/gpu/drm/panfrost/Makefile @@ -12,6 +12,4 @@ panfrost-y := \ panfrost_perfcnt.o \ panfrost_dump.o -panfrost-$(CONFIG_DEBUG_FS) += panfrost_debugfs.o - obj-$(CONFIG_DRM_PANFROST) += panfrost.o diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.c b/drivers/gpu/drm/panfrost/panfrost_debugfs.c deleted file mode 100644 index 72d4286a6bf7.. --- a/drivers/gpu/drm/panfrost/panfrost_debugfs.c +++ /dev/null @@ -1,21 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* Copyright 2023 Collabora ltd. */ -/* Copyright 2023 Amazon.com, Inc. or its affiliates. */ - -#include -#include -#include -#include -#include - -#include "panfrost_device.h" -#include "panfrost_gpu.h" -#include "panfrost_debugfs.h" - -void panfrost_debugfs_init(struct drm_minor *minor) -{ - struct drm_device *dev = minor->dev; - struct panfrost_device *pfdev = platform_get_drvdata(to_platform_device(dev->dev)); - - debugfs_create_atomic_t("profile", 0600, minor->debugfs_root, >profile_mode); -} diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.h b/drivers/gpu/drm/panfrost/panfrost_debugfs.h deleted file mode 100644 index c5af5f35877f.. --- a/drivers/gpu/drm/panfrost/panfrost_debugfs.h +++ /dev/null @@ -1,14 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* - * Copyright 2023 Collabora ltd. - * Copyright 2023 Amazon.com, Inc. or its affiliates. - */ - -#ifndef PANFROST_DEBUGFS_H -#define PANFROST_DEBUGFS_H - -#ifdef CONFIG_DEBUG_FS -void panfrost_debugfs_init(struct drm_minor *minor); -#endif - -#endif /*
[PATCH] MAINTAINERS: Update email address for Tvrtko Ursulin
From: Tvrtko Ursulin I will lose access to my @.*intel.com e-mail addresses soon so let me adjust the maintainers entry and update the mailmap too. While at it consolidate a few other of my old emails to point to the main one. Signed-off-by: Tvrtko Ursulin Cc: Daniel Vetter Cc: Dave Airlie Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi --- .mailmap| 5 + MAINTAINERS | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/.mailmap b/.mailmap index b99a238ee3bd..d67e351bce8e 100644 --- a/.mailmap +++ b/.mailmap @@ -608,6 +608,11 @@ TripleX Chung TripleX Chung Tsuneo Yoshioka Tudor Ambarus +Tvrtko Ursulin +Tvrtko Ursulin +Tvrtko Ursulin +Tvrtko Ursulin +Tvrtko Ursulin Tycho Andersen Tzung-Bi Shih Uwe Kleine-König diff --git a/MAINTAINERS b/MAINTAINERS index 19f6f8014f94..b940bfe2a692 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10734,7 +10734,7 @@ INTEL DRM I915 DRIVER (Meteor Lake, DG2 and older excluding Poulsbo, Moorestown M: Jani Nikula M: Joonas Lahtinen M: Rodrigo Vivi -M: Tvrtko Ursulin +M: Tvrtko Ursulin L: intel-...@lists.freedesktop.org S: Supported W: https://drm.pages.freedesktop.org/intel-docs/ -- 2.40.1
[PULL] drm-intel-gt-next
Hi Dave, Sima, Last drm-intel-gt-next pull request for 6.9. There are only two small fixes in there so could also wait for the -next-fixes round if so would be preferred. One fix is for a kerneldoc warning and other for a very unlikely userptr object creation failure where cleanup would oops. Regards, Tvrtko drm-intel-gt-next-2024-02-28: Driver Changes: Fixes: - Add some boring kerneldoc (Tvrtko Ursulin) - Check before removing mm notifier (Nirmoy The following changes since commit eb927f01dfb6309c8a184593c2c0618c4000c481: drm/i915/gt: Restart the heartbeat timer when forcing a pulse (2024-02-14 17:17:35 -0800) are available in the Git repository at: git://anongit.freedesktop.org/drm/drm-intel tags/drm-intel-gt-next-2024-02-28 for you to fetch changes up to db7bbd13f08774cde0332c705f042e327fe21e73: drm/i915: Check before removing mm notifier (2024-02-28 13:11:32 +) Driver Changes: Fixes: - Add some boring kerneldoc (Tvrtko Ursulin) - Check before removing mm notifier (Nirmoy Nirmoy Das (1): drm/i915: Check before removing mm notifier Tvrtko Ursulin (1): drm/i915: Add some boring kerneldoc drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++ include/uapi/drm/i915_drm.h | 4 2 files changed, 7 insertions(+)
Re: [PATCH] drm/i915: check before removing mm notifier
On 27/02/2024 09:26, Nirmoy Das wrote: Hi Tvrtko, On 2/27/2024 10:04 AM, Tvrtko Ursulin wrote: On 21/02/2024 11:52, Nirmoy Das wrote: Merged it to drm-intel-gt-next with s/check/Check Shouldn't this have had: Fixes: ed29c2691188 ("drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.") Cc: # v5.13+ ? Yes. Sorry, I missed that. Can we still the tag ? I've added them and force pushed the branch since commit was still at the top. FYI + Jani, Joonas and Rodrigo Regards, Tvrtko Thanks, Nirmoy Regards, Tvrtko On 2/19/2024 1:50 PM, Nirmoy Das wrote: Error in mmu_interval_notifier_insert() can leave a NULL notifier.mm pointer. Catch that and return early. Cc: Andi Shyti Cc: Shawn Lee Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c index 0e21ce9d3e5a..61abfb505766 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c @@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj) { GEM_WARN_ON(obj->userptr.page_ref); + if (!obj->userptr.notifier.mm) + return; + mmu_interval_notifier_remove(>userptr.notifier); obj->userptr.notifier.mm = NULL; }
Re: [PATCH v2] drm/i915/guc: Use context hints for GT freq
On 27/02/2024 23:51, Vinay Belgaumkar wrote: Allow user to provide a low latency context hint. When set, KMD sends a hint to GuC which results in special handling for this context. SLPC will ramp the GT frequency aggressively every time it switches to this context. The down freq threshold will also be lower so GuC will ramp down the GT freq for this context more slowly. We also disable waitboost for this context as that will interfere with the strategy. We need to enable the use of SLPC Compute strategy during init, but it will apply only to contexts that set this bit during context creation. Userland can check whether this feature is supported using a new param- I915_PARAM_HAS_CONTEXT_FREQ_HINTS. This flag is true for all guc submission enabled platforms as they use SLPC for frequency management. The Mesa usage model for this flag is here - https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint v2: Rename flags as per review suggestions (Rodrigo, Tvrtko). Also, use flag bits in intel_context as it allows finer control for toggling per engine if needed (Tvrtko). Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Sushma Venkatesh Reddy Signed-off-by: Vinay Belgaumkar --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 15 +++-- .../gpu/drm/i915/gem/i915_gem_context_types.h | 1 + drivers/gpu/drm/i915/gt/intel_context_types.h | 1 + drivers/gpu/drm/i915/gt/intel_rps.c | 5 + .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 17 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h | 1 + .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 6 ++ drivers/gpu/drm/i915/i915_getparam.c | 12 +++ include/uapi/drm/i915_drm.h | 15 + 10 files changed, 92 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index dcbfe32fd30c..0799cb0b2803 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, struct i915_gem_proto_context *pc, struct drm_i915_gem_context_param *args) { + struct drm_i915_private *i915 = fpriv->i915; int ret = 0; switch (args->param) { @@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, pc->user_flags &= ~BIT(UCONTEXT_BANNABLE); break; + case I915_CONTEXT_PARAM_LOW_LATENCY: + if (intel_uc_uses_guc_submission(_gt(i915)->uc)) + pc->user_flags |= BIT(UCONTEXT_LOW_LATENCY); + else + ret = -EINVAL; + break; + case I915_CONTEXT_PARAM_RECOVERABLE: if (args->size) ret = -EINVAL; @@ -992,6 +1000,9 @@ static int intel_context_set_gem(struct intel_context *ce, if (sseu.slice_mask && !WARN_ON(ce->engine->class != RENDER_CLASS)) ret = intel_context_reconfigure_sseu(ce, sseu); + if (test_bit(UCONTEXT_LOW_LATENCY, >user_flags)) + set_bit(CONTEXT_LOW_LATENCY, >flags); Does not need to be atomic so can use __set_bit as higher up in the function. + return ret; } @@ -1630,6 +1641,8 @@ i915_gem_create_context(struct drm_i915_private *i915, if (vm) ctx->vm = vm; + ctx->user_flags = pc->user_flags; + Given how most ctx->something assignments are at the bottom of the function I would stick a comment here saying along the lines of "assign early for intel_context_set_gem called when creating engines". mutex_init(>engines_mutex); if (pc->num_user_engines >= 0) { i915_gem_context_set_user_engines(ctx); @@ -1652,8 +1665,6 @@ i915_gem_create_context(struct drm_i915_private *i915, * is no remap info, it will be a NOP. */ ctx->remap_slice = ALL_L3_SLICES(i915); - ctx->user_flags = pc->user_flags; - for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp); i++) ctx->hang_timestamp[i] = jiffies - CONTEXT_FAST_HANG_JIFFIES; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h index 03bc7f9d191b..b6d97da63d1f 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h @@ -338,6 +338,7 @@ struct i915_gem_context { #define UCONTEXT_BANNABLE 2 #define UCONTEXT_RECOVERABLE 3 #define UCONTEXT_PERSISTENCE 4 +#define UCONTEXT_LOW_LATENCY 5 /** * @flags: small set of booleans diff --git a/drivers/gpu/drm/i915/g
Re: [PATCH] drm/i915: check before removing mm notifier
On 21/02/2024 11:52, Nirmoy Das wrote: Merged it to drm-intel-gt-next with s/check/Check Shouldn't this have had: Fixes: ed29c2691188 ("drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.") Cc: # v5.13+ ? Regards, Tvrtko On 2/19/2024 1:50 PM, Nirmoy Das wrote: Error in mmu_interval_notifier_insert() can leave a NULL notifier.mm pointer. Catch that and return early. Cc: Andi Shyti Cc: Shawn Lee Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c index 0e21ce9d3e5a..61abfb505766 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c @@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj) { GEM_WARN_ON(obj->userptr.page_ref); + if (!obj->userptr.notifier.mm) + return; + mmu_interval_notifier_remove(>userptr.notifier); obj->userptr.notifier.mm = NULL; }
Re: [PATCH 2/2] drm/i915: Support replaying GPU hangs with captured context image
On 22/02/2024 21:07, Rodrigo Vivi wrote: On Wed, Feb 21, 2024 at 02:22:45PM +, Tvrtko Ursulin wrote: From: Tvrtko Ursulin When debugging GPU hangs Mesa developers are finding it useful to replay the captured error state against the simulator. But due various simulator limitations which prevent replicating all hangs, one step further is being able to replay against a real GPU. This is almost doable today with the missing part being able to upload the captured context image into the driver state prior to executing the uploaded hanging batch and all the buffers. To enable this last part we add a new context parameter called I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU configuration pattern of being able to select which context to apply against, paired with the actual image and its size. Since this is adding a new concept of debug only uapi, we hide it behind a new kconfig option and also require activation with a module parameter. Together with a warning banner printed at driver load, all those combined should be sufficient to guard against inadvertently enabling the feature. In terms of implementation we allow the legacy context set param to be used since that removes the need to record the per context data in the proto context, while still allowing flexibility of specifying context images for any context. Mesa MR using the uapi can be seen at: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594 v2: * Fix whitespace alignment as per checkpatch. * Added warning on userspace misuse. * Rebase for extracting ce->default_state shadowing. Signed-off-by: Tvrtko Ursulin Cc: Lionel Landwerlin Cc: Carlos Santa Cc: Rodrigo Vivi Reviewed-by: Rodrigo Vivi # v1 still valid for v2. Thanks for splitting the patch. Great, thanks! Now we need to hear from Lionel if he is still keen to have this. In which case some acks or tested by would be good. Regards, Tvrtko --- drivers/gpu/drm/i915/Kconfig.debug| 17 +++ drivers/gpu/drm/i915/gem/i915_gem_context.c | 113 ++ drivers/gpu/drm/i915/gt/intel_context.c | 2 + drivers/gpu/drm/i915/gt/intel_context.h | 22 drivers/gpu/drm/i915/gt/intel_context_types.h | 1 + drivers/gpu/drm/i915/gt/intel_lrc.c | 3 +- .../gpu/drm/i915/gt/intel_ring_submission.c | 3 +- drivers/gpu/drm/i915/i915_params.c| 5 + drivers/gpu/drm/i915/i915_params.h| 3 +- include/uapi/drm/i915_drm.h | 27 + 10 files changed, 193 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/Kconfig.debug b/drivers/gpu/drm/i915/Kconfig.debug index 5b7162076850..32e9f70e91ed 100644 --- a/drivers/gpu/drm/i915/Kconfig.debug +++ b/drivers/gpu/drm/i915/Kconfig.debug @@ -16,6 +16,23 @@ config DRM_I915_WERROR If in doubt, say "N". +config DRM_I915_REPLAY_GPU_HANGS_API + bool "Enable GPU hang replay userspace API" + depends on DRM_I915 + depends on EXPERT + default n + help + Choose this option if you want to enable special and unstable + userspace API used for replaying GPU hangs on a running system. + + This API is intended to be used by userspace graphics stack developers + and provides no stability guarantees. + + The API needs to be activated at boot time using the + enable_debug_only_api module parameter. + + If in doubt, say "N". + config DRM_I915_DEBUG bool "Enable additional driver debugging" depends on DRM_I915 diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index dcbfe32fd30c..481aacbc1772 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -78,6 +78,7 @@ #include "gt/intel_engine_user.h" #include "gt/intel_gpu_commands.h" #include "gt/intel_ring.h" +#include "gt/shmem_utils.h" #include "pxp/intel_pxp.h" @@ -949,6 +950,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, case I915_CONTEXT_PARAM_NO_ZEROMAP: case I915_CONTEXT_PARAM_BAN_PERIOD: case I915_CONTEXT_PARAM_RINGSIZE: + case I915_CONTEXT_PARAM_CONTEXT_IMAGE: default: ret = -EINVAL; break; @@ -2092,6 +2094,95 @@ static int get_protected(struct i915_gem_context *ctx, return 0; } +static int set_context_image(struct i915_gem_context *ctx, +struct drm_i915_gem_context_param *args) +{ + struct i915_gem_context_param_context_image user; + struct intel_context *ce; + struct file *shmem_state; + unsigned long lookup; + void *state; + int ret = 0; + + if (!IS_ENABLED(CONFIG_DRM_I915_REPLAY_GPU_HANGS_API)) + return -EINVAL; + + if (!ctx-
Re: [PATCH] drm/i915/guc: Add Compute context hint
On 26/02/2024 08:47, Tvrtko Ursulin wrote: On 23/02/2024 19:25, Rodrigo Vivi wrote: On Fri, Feb 23, 2024 at 10:31:41AM -0800, Belgaumkar, Vinay wrote: On 2/23/2024 12:51 AM, Tvrtko Ursulin wrote: On 22/02/2024 23:31, Belgaumkar, Vinay wrote: On 2/22/2024 7:32 AM, Tvrtko Ursulin wrote: On 21/02/2024 21:28, Rodrigo Vivi wrote: On Wed, Feb 21, 2024 at 09:42:34AM +, Tvrtko Ursulin wrote: On 21/02/2024 00:14, Vinay Belgaumkar wrote: Allow user to provide a context hint. When this is set, KMD will send a hint to GuC which results in special handling for this context. SLPC will ramp the GT frequency aggressively every time it switches to this context. The down freq threshold will also be lower so GuC will ramp down the GT freq for this context more slowly. We also disable waitboost for this context as that will interfere with the strategy. We need to enable the use of Compute strategy during SLPC init, but it will apply only to contexts that set this bit during context creation. Userland can check whether this feature is supported using a new param- I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true for all guc submission enabled platforms since they use SLPC for freq management. The Mesa usage model for this flag is here - https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint This allows for setting it for the whole application, correct? Upsides, downsides? Are there any plans for per context? Currently there's no extension on a high level API (Vulkan/OpenGL/OpenCL/etc) that would allow the application to hint for power/freq/latency. So Mesa cannot decide when to hint. So their solution was to use .drirc and make per-application decision. I would prefer a high level extension for a more granular and informative decision. We need to work with that goal, but for now I don't see any cons on this approach. In principle yeah I doesn't harm to have the option. I am just not sure how useful this intermediate step this is with its lack of intra-process granularity. Cc: Rodrigo Vivi Signed-off-by: Vinay Belgaumkar --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 8 +++ .../gpu/drm/i915/gem/i915_gem_context_types.h | 1 + drivers/gpu/drm/i915/gt/intel_rps.c | 8 +++ .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 17 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h | 1 + .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++ drivers/gpu/drm/i915/i915_getparam.c | 11 ++ include/uapi/drm/i915_drm.h | 15 + 9 files changed, 89 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index dcbfe32fd30c..ceab7dbe9b47 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, struct i915_gem_proto_context *pc, struct drm_i915_gem_context_param *args) { + struct drm_i915_private *i915 = fpriv->i915; int ret = 0; switch (args->param) { @@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, pc->user_flags &= ~BIT(UCONTEXT_BANNABLE); break; + case I915_CONTEXT_PARAM_IS_COMPUTE: + if (!intel_uc_uses_guc_submission(_gt(i915)->uc)) + ret = -EINVAL; + else + pc->user_flags |= BIT(UCONTEXT_COMPUTE); + break; + case I915_CONTEXT_PARAM_RECOVERABLE: if (args->size) ret = -EINVAL; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h index 03bc7f9d191b..db86d6f6245f 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h @@ -338,6 +338,7 @@ struct i915_gem_context { #define UCONTEXT_BANNABLE 2 #define UCONTEXT_RECOVERABLE 3 #define UCONTEXT_PERSISTENCE 4 +#define UCONTEXT_COMPUTE 5 What is the GuC behaviour when SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for non-compute engines? Wondering if per intel_context is what we want instead. (Which could then be the i915_context_param_engines extension to mark individual contexts as compute strategy.) Perhaps we should rename this? This is a freq-decision-strategy inside GuC that is there mostly targeting compute workloads that needs lower latency with short burst execution. But the engine itself doesn't matter. It can be applied to any engine. I have no idea if it makes sense for other engines, such as video, and what would be pros and cons in terms of PnP. But in the case we end up allowing it on any engine, then at least userspace name shouldn't be
Re: [PATCH] drm/i915/guc: Add Compute context hint
On 23/02/2024 19:25, Rodrigo Vivi wrote: On Fri, Feb 23, 2024 at 10:31:41AM -0800, Belgaumkar, Vinay wrote: On 2/23/2024 12:51 AM, Tvrtko Ursulin wrote: On 22/02/2024 23:31, Belgaumkar, Vinay wrote: On 2/22/2024 7:32 AM, Tvrtko Ursulin wrote: On 21/02/2024 21:28, Rodrigo Vivi wrote: On Wed, Feb 21, 2024 at 09:42:34AM +, Tvrtko Ursulin wrote: On 21/02/2024 00:14, Vinay Belgaumkar wrote: Allow user to provide a context hint. When this is set, KMD will send a hint to GuC which results in special handling for this context. SLPC will ramp the GT frequency aggressively every time it switches to this context. The down freq threshold will also be lower so GuC will ramp down the GT freq for this context more slowly. We also disable waitboost for this context as that will interfere with the strategy. We need to enable the use of Compute strategy during SLPC init, but it will apply only to contexts that set this bit during context creation. Userland can check whether this feature is supported using a new param- I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true for all guc submission enabled platforms since they use SLPC for freq management. The Mesa usage model for this flag is here - https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint This allows for setting it for the whole application, correct? Upsides, downsides? Are there any plans for per context? Currently there's no extension on a high level API (Vulkan/OpenGL/OpenCL/etc) that would allow the application to hint for power/freq/latency. So Mesa cannot decide when to hint. So their solution was to use .drirc and make per-application decision. I would prefer a high level extension for a more granular and informative decision. We need to work with that goal, but for now I don't see any cons on this approach. In principle yeah I doesn't harm to have the option. I am just not sure how useful this intermediate step this is with its lack of intra-process granularity. Cc: Rodrigo Vivi Signed-off-by: Vinay Belgaumkar --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 8 +++ .../gpu/drm/i915/gem/i915_gem_context_types.h | 1 + drivers/gpu/drm/i915/gt/intel_rps.c | 8 +++ .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 17 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h | 1 + .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++ drivers/gpu/drm/i915/i915_getparam.c | 11 ++ include/uapi/drm/i915_drm.h | 15 + 9 files changed, 89 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index dcbfe32fd30c..ceab7dbe9b47 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, struct i915_gem_proto_context *pc, struct drm_i915_gem_context_param *args) { + struct drm_i915_private *i915 = fpriv->i915; int ret = 0; switch (args->param) { @@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, pc->user_flags &= ~BIT(UCONTEXT_BANNABLE); break; + case I915_CONTEXT_PARAM_IS_COMPUTE: + if (!intel_uc_uses_guc_submission(_gt(i915)->uc)) + ret = -EINVAL; + else + pc->user_flags |= BIT(UCONTEXT_COMPUTE); + break; + case I915_CONTEXT_PARAM_RECOVERABLE: if (args->size) ret = -EINVAL; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h index 03bc7f9d191b..db86d6f6245f 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h @@ -338,6 +338,7 @@ struct i915_gem_context { #define UCONTEXT_BANNABLE 2 #define UCONTEXT_RECOVERABLE 3 #define UCONTEXT_PERSISTENCE 4 +#define UCONTEXT_COMPUTE 5 What is the GuC behaviour when SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for non-compute engines? Wondering if per intel_context is what we want instead. (Which could then be the i915_context_param_engines extension to mark individual contexts as compute strategy.) Perhaps we should rename this? This is a freq-decision-strategy inside GuC that is there mostly targeting compute workloads that needs lower latency with short burst execution. But the engine itself doesn't matter. It can be applied to any engine. I have no idea if it makes sense for other engines, such as video, and what would be pros and cons in terms of PnP. But in the case we end up allowing it on any engine, then at least userspace name shouldn't be compute. :) Yes, one of the suggestions from Daniele was t
Re: [PATCH] drm/i915/guc: Add Compute context hint
On 22/02/2024 23:31, Belgaumkar, Vinay wrote: On 2/22/2024 7:32 AM, Tvrtko Ursulin wrote: On 21/02/2024 21:28, Rodrigo Vivi wrote: On Wed, Feb 21, 2024 at 09:42:34AM +, Tvrtko Ursulin wrote: On 21/02/2024 00:14, Vinay Belgaumkar wrote: Allow user to provide a context hint. When this is set, KMD will send a hint to GuC which results in special handling for this context. SLPC will ramp the GT frequency aggressively every time it switches to this context. The down freq threshold will also be lower so GuC will ramp down the GT freq for this context more slowly. We also disable waitboost for this context as that will interfere with the strategy. We need to enable the use of Compute strategy during SLPC init, but it will apply only to contexts that set this bit during context creation. Userland can check whether this feature is supported using a new param- I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true for all guc submission enabled platforms since they use SLPC for freq management. The Mesa usage model for this flag is here - https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint This allows for setting it for the whole application, correct? Upsides, downsides? Are there any plans for per context? Currently there's no extension on a high level API (Vulkan/OpenGL/OpenCL/etc) that would allow the application to hint for power/freq/latency. So Mesa cannot decide when to hint. So their solution was to use .drirc and make per-application decision. I would prefer a high level extension for a more granular and informative decision. We need to work with that goal, but for now I don't see any cons on this approach. In principle yeah I doesn't harm to have the option. I am just not sure how useful this intermediate step this is with its lack of intra-process granularity. Cc: Rodrigo Vivi Signed-off-by: Vinay Belgaumkar --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 8 +++ .../gpu/drm/i915/gem/i915_gem_context_types.h | 1 + drivers/gpu/drm/i915/gt/intel_rps.c | 8 +++ .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 17 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h | 1 + .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++ drivers/gpu/drm/i915/i915_getparam.c | 11 ++ include/uapi/drm/i915_drm.h | 15 + 9 files changed, 89 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index dcbfe32fd30c..ceab7dbe9b47 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, struct i915_gem_proto_context *pc, struct drm_i915_gem_context_param *args) { + struct drm_i915_private *i915 = fpriv->i915; int ret = 0; switch (args->param) { @@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, pc->user_flags &= ~BIT(UCONTEXT_BANNABLE); break; + case I915_CONTEXT_PARAM_IS_COMPUTE: + if (!intel_uc_uses_guc_submission(_gt(i915)->uc)) + ret = -EINVAL; + else + pc->user_flags |= BIT(UCONTEXT_COMPUTE); + break; + case I915_CONTEXT_PARAM_RECOVERABLE: if (args->size) ret = -EINVAL; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h index 03bc7f9d191b..db86d6f6245f 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h @@ -338,6 +338,7 @@ struct i915_gem_context { #define UCONTEXT_BANNABLE 2 #define UCONTEXT_RECOVERABLE 3 #define UCONTEXT_PERSISTENCE 4 +#define UCONTEXT_COMPUTE 5 What is the GuC behaviour when SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for non-compute engines? Wondering if per intel_context is what we want instead. (Which could then be the i915_context_param_engines extension to mark individual contexts as compute strategy.) Perhaps we should rename this? This is a freq-decision-strategy inside GuC that is there mostly targeting compute workloads that needs lower latency with short burst execution. But the engine itself doesn't matter. It can be applied to any engine. I have no idea if it makes sense for other engines, such as video, and what would be pros and cons in terms of PnP. But in the case we end up allowing it on any engine, then at least userspace name shouldn't be compute. :) Yes, one of the suggestions from Daniele was to have something along the lines of UCONTEXT_HIFREQ or something along those lines so we don't confuse it with the Compute Engine. Okay, but additional qu
Re: [PATCH] drm/i915/guc: Add Compute context hint
On 21/02/2024 21:28, Rodrigo Vivi wrote: On Wed, Feb 21, 2024 at 09:42:34AM +, Tvrtko Ursulin wrote: On 21/02/2024 00:14, Vinay Belgaumkar wrote: Allow user to provide a context hint. When this is set, KMD will send a hint to GuC which results in special handling for this context. SLPC will ramp the GT frequency aggressively every time it switches to this context. The down freq threshold will also be lower so GuC will ramp down the GT freq for this context more slowly. We also disable waitboost for this context as that will interfere with the strategy. We need to enable the use of Compute strategy during SLPC init, but it will apply only to contexts that set this bit during context creation. Userland can check whether this feature is supported using a new param- I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true for all guc submission enabled platforms since they use SLPC for freq management. The Mesa usage model for this flag is here - https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint This allows for setting it for the whole application, correct? Upsides, downsides? Are there any plans for per context? Currently there's no extension on a high level API (Vulkan/OpenGL/OpenCL/etc) that would allow the application to hint for power/freq/latency. So Mesa cannot decide when to hint. So their solution was to use .drirc and make per-application decision. I would prefer a high level extension for a more granular and informative decision. We need to work with that goal, but for now I don't see any cons on this approach. In principle yeah I doesn't harm to have the option. I am just not sure how useful this intermediate step this is with its lack of intra-process granularity. Cc: Rodrigo Vivi Signed-off-by: Vinay Belgaumkar --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 8 +++ .../gpu/drm/i915/gem/i915_gem_context_types.h | 1 + drivers/gpu/drm/i915/gt/intel_rps.c | 8 +++ .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 17 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h | 1 + .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++ drivers/gpu/drm/i915/i915_getparam.c | 11 ++ include/uapi/drm/i915_drm.h | 15 + 9 files changed, 89 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index dcbfe32fd30c..ceab7dbe9b47 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, struct i915_gem_proto_context *pc, struct drm_i915_gem_context_param *args) { + struct drm_i915_private *i915 = fpriv->i915; int ret = 0; switch (args->param) { @@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, pc->user_flags &= ~BIT(UCONTEXT_BANNABLE); break; + case I915_CONTEXT_PARAM_IS_COMPUTE: + if (!intel_uc_uses_guc_submission(_gt(i915)->uc)) + ret = -EINVAL; + else + pc->user_flags |= BIT(UCONTEXT_COMPUTE); + break; + case I915_CONTEXT_PARAM_RECOVERABLE: if (args->size) ret = -EINVAL; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h index 03bc7f9d191b..db86d6f6245f 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h @@ -338,6 +338,7 @@ struct i915_gem_context { #define UCONTEXT_BANNABLE2 #define UCONTEXT_RECOVERABLE 3 #define UCONTEXT_PERSISTENCE 4 +#define UCONTEXT_COMPUTE 5 What is the GuC behaviour when SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for non-compute engines? Wondering if per intel_context is what we want instead. (Which could then be the i915_context_param_engines extension to mark individual contexts as compute strategy.) Perhaps we should rename this? This is a freq-decision-strategy inside GuC that is there mostly targeting compute workloads that needs lower latency with short burst execution. But the engine itself doesn't matter. It can be applied to any engine. I have no idea if it makes sense for other engines, such as video, and what would be pros and cons in terms of PnP. But in the case we end up allowing it on any engine, then at least userspace name shouldn't be compute. :) Or if we decide to call it compute and only apply to compute engines, then I would strongly suggest making the uapi per intel_context i.e. the set engines extension instead of the GEM context param. Othe
Re: [PATCH 0/1] Always record job cycle and timestamp information
On 21/02/2024 09:40, Adrián Larumbe wrote: Hi, I just wanted to make sure we're on the same page on this matter. So in Panfrost, and I guess in almost every other single driver out there, HW perf counters and their uapi interface are orthogonal to fdinfo's reporting on drm engine utilisation. At the moment it seems like HW perfcounters and the way they're exposed to UM are very idiosincratic and any attempt to unify their interface into a common set of ioctl's sounds like a gargantuan task I wouldn't like to be faced with. I share the same feeling on this sub-topic. As for fdinfo, I guess there's more room for coming up with common helpers that could handle the toggling of HW support for drm engine calculations, but I'd at least have to see how things are being done in let's say, Freedreno or Intel. For Intel we don't need this ability, well at least for pre-GuC platforms. Stat collection is super cheap and permanently enabled there. But let me copy Umesh because something at the back of my mind is telling me that perhaps there was something expensive about collecting these stats with the GuC backend? If so maybe a toggle would be beneficial there. Right now there's a pressing need to get rid of the debugfs knob for fdinfo's drm engine profiling sources in Panfrost, after which I could perhaps draw up an RFC for how to generalise this onto other drivers. There is a knob currently meaning fdinfo does not work by default? If that is so, I would have at least expected someone had submitted a patch for gputop to handle this toggle. It being kind of a common reference implementation I don't think it is great if it does not work out of the box. The toggle as an idea sounds a bit annoying, but if there is no other realistic way maybe it is not too bad. As long as it is documented in the drm-usage-stats.rst, doesn't live in debugfs, and has some common plumbing implemented both on the kernel side and for the aforementioned gputop / igt_drm_fdinfo / igt_drm_clients. Where and how exactly TBD. Regards, Tvrtko On 16.02.2024 17:43, Tvrtko Ursulin wrote: On 16/02/2024 16:57, Daniel Vetter wrote: On Wed, Feb 14, 2024 at 01:52:05PM +, Steven Price wrote: Hi Adrián, On 14/02/2024 12:14, Adrián Larumbe wrote: A driver user expressed interest in being able to access engine usage stats through fdinfo when debugfs is not built into their kernel. In the current implementation, this wasn't possible, because it was assumed even for inflight jobs enabling the cycle counter and timestamp registers would incur in additional power consumption, so both were kept disabled until toggled through debugfs. A second read of the TRM made me think otherwise, but this is something that would be best clarified by someone from ARM's side. I'm afraid I can't give a definitive answer. This will probably vary depending on implementation. The command register enables/disables "propagation" of the cycle/timestamp values. This propagation will cost some power (gates are getting toggled) but whether that power is completely in the noise of the GPU as a whole I can't say. The out-of-tree kbase driver only enables the counters for jobs explicitly marked (BASE_JD_REQ_PERMON) or due to an explicit connection from a profiler. I'd be happier moving the debugfs file to sysfs rather than assuming that the power consumption is small enough for all platforms. Ideally we'd have some sort of kernel interface for a profiler to inform the kernel what it is interested in, but I can't immediately see how to make that useful across different drivers. kbase's profiling support is great with our profiling tools, but there's a very strong connection between the two. Yeah I'm not sure whether a magic (worse probably per-driver massively different) file in sysfs is needed to enable gpu perf monitoring stats in fdinfo. I get that we do have a bit a gap because the linux perf pmu stuff is global, and you want per-process, and there's kinda no per-process support for perf stats for devices. But that's probably the direction we want to go, not so much fdinfo. At least for hardware performance counters and things like that. Iirc the i915 pmu support had some integration for per-process support, you might want to chat with Tvrtko for kernel side and Lionel for more userspace side. At least if I'm not making a complete mess and my memory is vaguely related to reality. Adding them both. Yeah there are two separate things, i915 PMU and i915 Perf/OA. If my memory serves me right I indeed did have a per-process support for i915 PMU implemented as an RFC (or at least a branch somewhere) some years back. IIRC it only exposed the per engine GPU utilisation and did not find it very useful versus the complexity. (I think it at least required maintaining a map of drm clients per task.) Our more useful profiling is using a custom Perf/OA interface (Observation Architecture) which is possibly similar to kbase mentioned
[PATCH 1/2] drm/i915: Shadow default engine context image in the context
From: Tvrtko Ursulin To enable adding override of the default engine context image let us start shadowing the per engine state in the context. Signed-off-by: Tvrtko Ursulin Cc: Lionel Landwerlin Cc: Carlos Santa Cc: Rodrigo Vivi --- drivers/gpu/drm/i915/gt/intel_context_types.h | 2 ++ drivers/gpu/drm/i915/gt/intel_lrc.c | 7 --- drivers/gpu/drm/i915/gt/intel_ring_submission.c | 7 --- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h index 7eccbd70d89f..b179178680a5 100644 --- a/drivers/gpu/drm/i915/gt/intel_context_types.h +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h @@ -99,6 +99,8 @@ struct intel_context { struct i915_address_space *vm; struct i915_gem_context __rcu *gem_context; + struct file *default_state; + /* * @signal_lock protects the list of requests that need signaling, * @signals. While there are any requests that need signaling, diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index 7c367ba8d9dc..d4eb822d20ae 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -1060,9 +1060,8 @@ void lrc_init_state(struct intel_context *ce, set_redzone(state, engine); - if (engine->default_state) { - shmem_read(engine->default_state, 0, - state, engine->context_size); + if (ce->default_state) { + shmem_read(ce->default_state, 0, state, engine->context_size); __set_bit(CONTEXT_VALID_BIT, >flags); inhibit = false; } @@ -1174,6 +1173,8 @@ int lrc_alloc(struct intel_context *ce, struct intel_engine_cs *engine) GEM_BUG_ON(ce->state); + ce->default_state = engine->default_state; + vma = __lrc_alloc_state(ce, engine); if (IS_ERR(vma)) return PTR_ERR(vma); diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c index 92085ffd23de..8625e88e785f 100644 --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c @@ -474,8 +474,7 @@ static int ring_context_init_default_state(struct intel_context *ce, if (IS_ERR(vaddr)) return PTR_ERR(vaddr); - shmem_read(ce->engine->default_state, 0, - vaddr, ce->engine->context_size); + shmem_read(ce->default_state, 0, vaddr, ce->engine->context_size); i915_gem_object_flush_map(obj); __i915_gem_object_release_map(obj); @@ -491,7 +490,7 @@ static int ring_context_pre_pin(struct intel_context *ce, struct i915_address_space *vm; int err = 0; - if (ce->engine->default_state && + if (ce->default_state && !test_bit(CONTEXT_VALID_BIT, >flags)) { err = ring_context_init_default_state(ce, ww); if (err) @@ -570,6 +569,8 @@ static int ring_context_alloc(struct intel_context *ce) { struct intel_engine_cs *engine = ce->engine; + ce->default_state = engine->default_state; + /* One ringbuffer to rule them all */ GEM_BUG_ON(!engine->legacy.ring); ce->ring = engine->legacy.ring; -- 2.40.1
[PATCH 2/2] drm/i915: Support replaying GPU hangs with captured context image
From: Tvrtko Ursulin When debugging GPU hangs Mesa developers are finding it useful to replay the captured error state against the simulator. But due various simulator limitations which prevent replicating all hangs, one step further is being able to replay against a real GPU. This is almost doable today with the missing part being able to upload the captured context image into the driver state prior to executing the uploaded hanging batch and all the buffers. To enable this last part we add a new context parameter called I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU configuration pattern of being able to select which context to apply against, paired with the actual image and its size. Since this is adding a new concept of debug only uapi, we hide it behind a new kconfig option and also require activation with a module parameter. Together with a warning banner printed at driver load, all those combined should be sufficient to guard against inadvertently enabling the feature. In terms of implementation we allow the legacy context set param to be used since that removes the need to record the per context data in the proto context, while still allowing flexibility of specifying context images for any context. Mesa MR using the uapi can be seen at: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594 v2: * Fix whitespace alignment as per checkpatch. * Added warning on userspace misuse. * Rebase for extracting ce->default_state shadowing. Signed-off-by: Tvrtko Ursulin Cc: Lionel Landwerlin Cc: Carlos Santa Cc: Rodrigo Vivi Reviewed-by: Rodrigo Vivi # v1 --- drivers/gpu/drm/i915/Kconfig.debug| 17 +++ drivers/gpu/drm/i915/gem/i915_gem_context.c | 113 ++ drivers/gpu/drm/i915/gt/intel_context.c | 2 + drivers/gpu/drm/i915/gt/intel_context.h | 22 drivers/gpu/drm/i915/gt/intel_context_types.h | 1 + drivers/gpu/drm/i915/gt/intel_lrc.c | 3 +- .../gpu/drm/i915/gt/intel_ring_submission.c | 3 +- drivers/gpu/drm/i915/i915_params.c| 5 + drivers/gpu/drm/i915/i915_params.h| 3 +- include/uapi/drm/i915_drm.h | 27 + 10 files changed, 193 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/Kconfig.debug b/drivers/gpu/drm/i915/Kconfig.debug index 5b7162076850..32e9f70e91ed 100644 --- a/drivers/gpu/drm/i915/Kconfig.debug +++ b/drivers/gpu/drm/i915/Kconfig.debug @@ -16,6 +16,23 @@ config DRM_I915_WERROR If in doubt, say "N". +config DRM_I915_REPLAY_GPU_HANGS_API + bool "Enable GPU hang replay userspace API" + depends on DRM_I915 + depends on EXPERT + default n + help + Choose this option if you want to enable special and unstable + userspace API used for replaying GPU hangs on a running system. + + This API is intended to be used by userspace graphics stack developers + and provides no stability guarantees. + + The API needs to be activated at boot time using the + enable_debug_only_api module parameter. + + If in doubt, say "N". + config DRM_I915_DEBUG bool "Enable additional driver debugging" depends on DRM_I915 diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index dcbfe32fd30c..481aacbc1772 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -78,6 +78,7 @@ #include "gt/intel_engine_user.h" #include "gt/intel_gpu_commands.h" #include "gt/intel_ring.h" +#include "gt/shmem_utils.h" #include "pxp/intel_pxp.h" @@ -949,6 +950,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, case I915_CONTEXT_PARAM_NO_ZEROMAP: case I915_CONTEXT_PARAM_BAN_PERIOD: case I915_CONTEXT_PARAM_RINGSIZE: + case I915_CONTEXT_PARAM_CONTEXT_IMAGE: default: ret = -EINVAL; break; @@ -2092,6 +2094,95 @@ static int get_protected(struct i915_gem_context *ctx, return 0; } +static int set_context_image(struct i915_gem_context *ctx, +struct drm_i915_gem_context_param *args) +{ + struct i915_gem_context_param_context_image user; + struct intel_context *ce; + struct file *shmem_state; + unsigned long lookup; + void *state; + int ret = 0; + + if (!IS_ENABLED(CONFIG_DRM_I915_REPLAY_GPU_HANGS_API)) + return -EINVAL; + + if (!ctx->i915->params.enable_debug_only_api) + return -EINVAL; + + if (args->size < sizeof(user)) + return -EINVAL; + + if (copy_from_user(, u64_to_user_ptr(args->value), sizeof(user))) + return -EFAULT; + + if (user.mbz) + return -EINVAL; + + if
[PATCH v2 0/2] GPU hang replay
From: Tvrtko Ursulin Please see 2/2 for explanation and rationale. v2: * Extracted shadowing of default state into a leading patch. Tvrtko Ursulin (2): drm/i915: Shadow default engine context image in the context drm/i915: Support replaying GPU hangs with captured context image drivers/gpu/drm/i915/Kconfig.debug| 17 +++ drivers/gpu/drm/i915/gem/i915_gem_context.c | 113 ++ drivers/gpu/drm/i915/gt/intel_context.c | 2 + drivers/gpu/drm/i915/gt/intel_context.h | 22 drivers/gpu/drm/i915/gt/intel_context_types.h | 3 + drivers/gpu/drm/i915/gt/intel_lrc.c | 8 +- .../gpu/drm/i915/gt/intel_ring_submission.c | 8 +- drivers/gpu/drm/i915/i915_params.c| 5 + drivers/gpu/drm/i915/i915_params.h| 3 +- include/uapi/drm/i915_drm.h | 27 + 10 files changed, 201 insertions(+), 7 deletions(-) -- 2.40.1
Re: [PATCH v2 2/2] drm/i915/gt: Enable only one CCS for compute workload
On 21/02/2024 12:08, Tvrtko Ursulin wrote: On 21/02/2024 11:19, Andi Shyti wrote: Hi Tvrtko, On Wed, Feb 21, 2024 at 08:19:34AM +, Tvrtko Ursulin wrote: On 21/02/2024 00:14, Andi Shyti wrote: On Tue, Feb 20, 2024 at 02:48:31PM +, Tvrtko Ursulin wrote: On 20/02/2024 14:35, Andi Shyti wrote: Enable only one CCS engine by default with all the compute sices slices Thanks! diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c index 833987015b8b..7041acc77810 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_user.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c @@ -243,6 +243,15 @@ void intel_engines_driver_register(struct drm_i915_private *i915) if (engine->uabi_class == I915_NO_UABI_CLASS) continue; + /* + * Do not list and do not count CCS engines other than the first + */ + if (engine->uabi_class == I915_ENGINE_CLASS_COMPUTE && + engine->uabi_instance > 0) { + i915->engine_uabi_class_count[engine->uabi_class]--; + continue; + } It's a bit ugly to decrement after increment, instead of somehow restructuring the loop to satisfy both cases more elegantly. yes, agree, indeed I had a hard time here to accept this change myself. But moving the check above where the counter was incremented it would have been much uglier. This check looks ugly everywhere you place it :-) One idea would be to introduce a separate local counter array for name_instance, so not use i915->engine_uabi_class_count[]. First one increments for every engine, second only for the exposed ones. That way feels wouldn't be too ugly. Ah... you mean that whenever we change the CCS mode, we update the indexes of the exposed engines from list of the real engines. Will try. My approach was to regenerate the list everytime the CCS mode was changed, but your suggestion looks a bit simplier. No, I meant just for this first stage of permanently single engine. For avoiding the decrement after increment. Something like this, but not compile tested even: diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c index 833987015b8b..4c33f30612c4 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_user.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c @@ -203,7 +203,8 @@ static void engine_rename(struct intel_engine_cs *engine, const char *name, u16 void intel_engines_driver_register(struct drm_i915_private *i915) { - u16 name_instance, other_instance = 0; + u16 class_instance[I915_LAST_UABI_ENGINE_CLASS + 2] = { }; + u16 uabi_class, other_instance = 0; struct legacy_ring ring = {}; struct list_head *it, *next; struct rb_node **p, *prev; @@ -222,15 +223,14 @@ void intel_engines_driver_register(struct drm_i915_private *i915) GEM_BUG_ON(engine->class >= ARRAY_SIZE(uabi_classes)); engine->uabi_class = uabi_classes[engine->class]; + if (engine->uabi_class == I915_NO_UABI_CLASS) { - name_instance = other_instance++; - } else { - GEM_BUG_ON(engine->uabi_class >= - ARRAY_SIZE(i915->engine_uabi_class_count)); - name_instance = - i915->engine_uabi_class_count[engine->uabi_class]++; - } - engine->uabi_instance = name_instance; + uabi_class = I915_LAST_UABI_ENGINE_CLASS + 1; + else + uabi_class = engine->uabi_class; + + GEM_BUG_ON(uabi_class >= ARRAY_SIZE(class_instance)); + engine->uabi_instance = class_instance[uabi_class]++; /* * Replace the internal name with the final user and log facing @@ -238,11 +238,15 @@ void intel_engines_driver_register(struct drm_i915_private *i915) */ engine_rename(engine, intel_engine_class_repr(engine->class), - name_instance); + engine->uabi_instance); - if (engine->uabi_class == I915_NO_UABI_CLASS) + if (uabi_class == I915_NO_UABI_CLASS) continue; Here you just add the ccs skip condition. Anyway.. I rushed it a bit so see what you think. Regards, Tvrtko + GEM_BUG_ON(uabi_class >= + ARRAY_SIZE(i915->engine_uabi_class_count)); + i915->engine_uabi_class_count[uabi_class]++; + rb_link_node(>uabi_node, prev, p); rb_insert_color(>uabi_node, >uabi_engines); In any case, I'm working on a patch that is splitting this function in two parts
Re: [PATCH v2 2/2] drm/i915/gt: Enable only one CCS for compute workload
On 21/02/2024 11:19, Andi Shyti wrote: Hi Tvrtko, On Wed, Feb 21, 2024 at 08:19:34AM +, Tvrtko Ursulin wrote: On 21/02/2024 00:14, Andi Shyti wrote: On Tue, Feb 20, 2024 at 02:48:31PM +, Tvrtko Ursulin wrote: On 20/02/2024 14:35, Andi Shyti wrote: Enable only one CCS engine by default with all the compute sices slices Thanks! diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c index 833987015b8b..7041acc77810 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_user.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c @@ -243,6 +243,15 @@ void intel_engines_driver_register(struct drm_i915_private *i915) if (engine->uabi_class == I915_NO_UABI_CLASS) continue; + /* +* Do not list and do not count CCS engines other than the first +*/ + if (engine->uabi_class == I915_ENGINE_CLASS_COMPUTE && + engine->uabi_instance > 0) { + i915->engine_uabi_class_count[engine->uabi_class]--; + continue; + } It's a bit ugly to decrement after increment, instead of somehow restructuring the loop to satisfy both cases more elegantly. yes, agree, indeed I had a hard time here to accept this change myself. But moving the check above where the counter was incremented it would have been much uglier. This check looks ugly everywhere you place it :-) One idea would be to introduce a separate local counter array for name_instance, so not use i915->engine_uabi_class_count[]. First one increments for every engine, second only for the exposed ones. That way feels wouldn't be too ugly. Ah... you mean that whenever we change the CCS mode, we update the indexes of the exposed engines from list of the real engines. Will try. My approach was to regenerate the list everytime the CCS mode was changed, but your suggestion looks a bit simplier. No, I meant just for this first stage of permanently single engine. For avoiding the decrement after increment. Something like this, but not compile tested even: diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c index 833987015b8b..4c33f30612c4 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_user.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c @@ -203,7 +203,8 @@ static void engine_rename(struct intel_engine_cs *engine, const char *name, u16 void intel_engines_driver_register(struct drm_i915_private *i915) { - u16 name_instance, other_instance = 0; + u16 class_instance[I915_LAST_UABI_ENGINE_CLASS + 2] = { }; + u16 uabi_class, other_instance = 0; struct legacy_ring ring = {}; struct list_head *it, *next; struct rb_node **p, *prev; @@ -222,15 +223,14 @@ void intel_engines_driver_register(struct drm_i915_private *i915) GEM_BUG_ON(engine->class >= ARRAY_SIZE(uabi_classes)); engine->uabi_class = uabi_classes[engine->class]; + if (engine->uabi_class == I915_NO_UABI_CLASS) { - name_instance = other_instance++; - } else { - GEM_BUG_ON(engine->uabi_class >= - ARRAY_SIZE(i915->engine_uabi_class_count)); - name_instance = - i915->engine_uabi_class_count[engine->uabi_class]++; - } - engine->uabi_instance = name_instance; + uabi_class = I915_LAST_UABI_ENGINE_CLASS + 1; + else + uabi_class = engine->uabi_class; + + GEM_BUG_ON(uabi_class >= ARRAY_SIZE(class_instance)); + engine->uabi_instance = class_instance[uabi_class]++; /* * Replace the internal name with the final user and log facing @@ -238,11 +238,15 @@ void intel_engines_driver_register(struct drm_i915_private *i915) */ engine_rename(engine, intel_engine_class_repr(engine->class), - name_instance); + engine->uabi_instance); - if (engine->uabi_class == I915_NO_UABI_CLASS) + if (uabi_class == I915_NO_UABI_CLASS) continue; + GEM_BUG_ON(uabi_class >= + ARRAY_SIZE(i915->engine_uabi_class_count)); + i915->engine_uabi_class_count[uabi_class]++; + rb_link_node(>uabi_node, prev, p); rb_insert_color(>uabi_node, >uabi_engines); In any case, I'm working on a patch that is splitting this function in two parts and there is some refactoring happening here (for the first initialization and the dynamic update). Please
Re: [PATCH] drm/i915/guc: Add Compute context hint
On 21/02/2024 00:14, Vinay Belgaumkar wrote: Allow user to provide a context hint. When this is set, KMD will send a hint to GuC which results in special handling for this context. SLPC will ramp the GT frequency aggressively every time it switches to this context. The down freq threshold will also be lower so GuC will ramp down the GT freq for this context more slowly. We also disable waitboost for this context as that will interfere with the strategy. We need to enable the use of Compute strategy during SLPC init, but it will apply only to contexts that set this bit during context creation. Userland can check whether this feature is supported using a new param- I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true for all guc submission enabled platforms since they use SLPC for freq management. The Mesa usage model for this flag is here - https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint This allows for setting it for the whole application, correct? Upsides, downsides? Are there any plans for per context? Cc: Rodrigo Vivi Signed-off-by: Vinay Belgaumkar --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 8 +++ .../gpu/drm/i915/gem/i915_gem_context_types.h | 1 + drivers/gpu/drm/i915/gt/intel_rps.c | 8 +++ .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 17 +++ drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h | 1 + .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++ drivers/gpu/drm/i915/i915_getparam.c | 11 ++ include/uapi/drm/i915_drm.h | 15 + 9 files changed, 89 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index dcbfe32fd30c..ceab7dbe9b47 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, struct i915_gem_proto_context *pc, struct drm_i915_gem_context_param *args) { + struct drm_i915_private *i915 = fpriv->i915; int ret = 0; switch (args->param) { @@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, pc->user_flags &= ~BIT(UCONTEXT_BANNABLE); break; + case I915_CONTEXT_PARAM_IS_COMPUTE: + if (!intel_uc_uses_guc_submission(_gt(i915)->uc)) + ret = -EINVAL; + else + pc->user_flags |= BIT(UCONTEXT_COMPUTE); + break; + case I915_CONTEXT_PARAM_RECOVERABLE: if (args->size) ret = -EINVAL; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h index 03bc7f9d191b..db86d6f6245f 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h @@ -338,6 +338,7 @@ struct i915_gem_context { #define UCONTEXT_BANNABLE 2 #define UCONTEXT_RECOVERABLE 3 #define UCONTEXT_PERSISTENCE 4 +#define UCONTEXT_COMPUTE 5 What is the GuC behaviour when SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for non-compute engines? Wondering if per intel_context is what we want instead. (Which could then be the i915_context_param_engines extension to mark individual contexts as compute strategy.) /** * @flags: small set of booleans diff --git a/drivers/gpu/drm/i915/gt/intel_rps.c b/drivers/gpu/drm/i915/gt/intel_rps.c index 4feef874e6d6..1ed40cd61b70 100644 --- a/drivers/gpu/drm/i915/gt/intel_rps.c +++ b/drivers/gpu/drm/i915/gt/intel_rps.c @@ -24,6 +24,7 @@ #include "intel_pcode.h" #include "intel_rps.h" #include "vlv_sideband.h" +#include "../gem/i915_gem_context.h" #include "../../../platform/x86/intel_ips.h" #define BUSY_MAX_EI 20u /* ms */ @@ -1018,6 +1019,13 @@ void intel_rps_boost(struct i915_request *rq) struct intel_rps *rps = _ONCE(rq->engine)->gt->rps; if (rps_uses_slpc(rps)) { + const struct i915_gem_context *ctx; + + ctx = i915_request_gem_context(rq); + if (ctx && + test_bit(UCONTEXT_COMPUTE, >user_flags)) + return; + I think request and intel_context do not own a strong reference to GEM context. So at minimum you need a local one obtained under a RCU lock with kref_get_unless_zero, as do some other places do. However.. it may be simpler to just store the flag in intel_context->flags. If you carry it over at the time GEM context is assigned to intel_context, not only you simplify runtime rules, but you get the ability to not set the compute flags for video etc. It may even make
Re: [RFC] drm/i915: Support replaying GPU hangs with captured context image
On 20/02/2024 22:50, Rodrigo Vivi wrote: On Tue, Feb 13, 2024 at 01:14:34PM +, Tvrtko Ursulin wrote: From: Tvrtko Ursulin When debugging GPU hangs Mesa developers are finding it useful to replay the captured error state against the simulator. But due various simulator limitations which prevent replicating all hangs, one step further is being able to replay against a real GPU. This is almost doable today with the missing part being able to upload the captured context image into the driver state prior to executing the uploaded hanging batch and all the buffers. To enable this last part we add a new context parameter called I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU configuration pattern of being able to select which context to apply against, paired with the actual image and its size. Since this is adding a new concept of debug only uapi, we hide it behind a new kconfig option and also require activation with a module parameter. Together with a warning banner printed at driver load, all those combined should be sufficient to guard against inadvertently enabling the feature. In terms of implementation the only trivial change is shadowing of the default state from engine to context. We also allow the legacy context set param to be used since that removes the need to record the per context data in the proto context, while still allowing flexibility of specifying context images for any context. Mesa MR using the uapi can be seen at: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594 I just wonder if it would be better to split the default_state in a separate patch but from what I could see it looks correct. It definitely makes sense to split it. I was just a bit lazy while testing the waters. After all this is a very novel idea of debug only uapi outside debugfs so I wasn't too sure how it will be received. Stay tuned for v2. Regards, Tvrtko Also, I have to say that this approach is nice, clean and well protected. And much simpler then I imagined when I saw the idea around. Feel free to use: Reviewed-by: Rodrigo Vivi Signed-off-by: Tvrtko Ursulin Cc: Lionel Landwerlin Cc: Carlos Santa --- drivers/gpu/drm/i915/Kconfig.debug| 17 +++ drivers/gpu/drm/i915/gem/i915_gem_context.c | 106 ++ drivers/gpu/drm/i915/gt/intel_context.c | 2 + drivers/gpu/drm/i915/gt/intel_context.h | 22 drivers/gpu/drm/i915/gt/intel_context_types.h | 3 + drivers/gpu/drm/i915/gt/intel_lrc.c | 8 +- .../gpu/drm/i915/gt/intel_ring_submission.c | 8 +- drivers/gpu/drm/i915/i915_params.c| 5 + drivers/gpu/drm/i915/i915_params.h| 3 +- include/uapi/drm/i915_drm.h | 27 + 10 files changed, 194 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/i915/Kconfig.debug b/drivers/gpu/drm/i915/Kconfig.debug index 5b7162076850..32e9f70e91ed 100644 --- a/drivers/gpu/drm/i915/Kconfig.debug +++ b/drivers/gpu/drm/i915/Kconfig.debug @@ -16,6 +16,23 @@ config DRM_I915_WERROR If in doubt, say "N". +config DRM_I915_REPLAY_GPU_HANGS_API + bool "Enable GPU hang replay userspace API" + depends on DRM_I915 + depends on EXPERT + default n + help + Choose this option if you want to enable special and unstable + userspace API used for replaying GPU hangs on a running system. + + This API is intended to be used by userspace graphics stack developers + and provides no stability guarantees. + + The API needs to be activated at boot time using the + enable_debug_only_api module parameter. + + If in doubt, say "N". + config DRM_I915_DEBUG bool "Enable additional driver debugging" depends on DRM_I915 diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index dcbfe32fd30c..1cfd624bd978 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -78,6 +78,7 @@ #include "gt/intel_engine_user.h" #include "gt/intel_gpu_commands.h" #include "gt/intel_ring.h" +#include "gt/shmem_utils.h" #include "pxp/intel_pxp.h" @@ -949,6 +950,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv, case I915_CONTEXT_PARAM_NO_ZEROMAP: case I915_CONTEXT_PARAM_BAN_PERIOD: case I915_CONTEXT_PARAM_RINGSIZE: + case I915_CONTEXT_PARAM_CONTEXT_IMAGE: default: ret = -EINVAL; break; @@ -2092,6 +2094,88 @@ static int get_protected(struct i915_gem_context *ctx, return 0; } +static int set_context_image(struct i915_gem_context *ctx, +struct drm_i915_gem_context_param *args) +{ + struct i915_gem_context_param_context_image user; +
Re: [PATCH v2 2/2] drm/i915/gt: Enable only one CCS for compute workload
On 21/02/2024 00:14, Andi Shyti wrote: Hi Tvrtko, On Tue, Feb 20, 2024 at 02:48:31PM +, Tvrtko Ursulin wrote: On 20/02/2024 14:35, Andi Shyti wrote: Enable only one CCS engine by default with all the compute sices slices Thanks! diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c index 833987015b8b..7041acc77810 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_user.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c @@ -243,6 +243,15 @@ void intel_engines_driver_register(struct drm_i915_private *i915) if (engine->uabi_class == I915_NO_UABI_CLASS) continue; + /* +* Do not list and do not count CCS engines other than the first +*/ + if (engine->uabi_class == I915_ENGINE_CLASS_COMPUTE && + engine->uabi_instance > 0) { + i915->engine_uabi_class_count[engine->uabi_class]--; + continue; + } It's a bit ugly to decrement after increment, instead of somehow restructuring the loop to satisfy both cases more elegantly. yes, agree, indeed I had a hard time here to accept this change myself. But moving the check above where the counter was incremented it would have been much uglier. This check looks ugly everywhere you place it :-) One idea would be to introduce a separate local counter array for name_instance, so not use i915->engine_uabi_class_count[]. First one increments for every engine, second only for the exposed ones. That way feels wouldn't be too ugly. In any case, I'm working on a patch that is splitting this function in two parts and there is some refactoring happening here (for the first initialization and the dynamic update). Please let me know if it's OK with you or you want me to fix it in this run. And I wonder if internally (in dmesg when engine name is logged) we don't end up with ccs0 ccs0 ccs0 ccs0.. for all instances. I don't see this. Even in sysfs we see only one ccs. Where is it? When you run this patch on something with two or more ccs-es, the "renamed ccs... to ccs.." debug logs do not all log the new name as ccs0? Regards, Tvrtko + rb_link_node(>uabi_node, prev, p); rb_insert_color(>uabi_node, >uabi_engines); [...] diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c index 3baa2f54a86e..d5a5143971f5 100644 --- a/drivers/gpu/drm/i915/i915_query.c +++ b/drivers/gpu/drm/i915/i915_query.c @@ -124,6 +124,7 @@ static int query_geometry_subslices(struct drm_i915_private *i915, return fill_topology_info(sseu, query_item, sseu->geometry_subslice_mask); } + Zap please. yes... yes... I noticed it after sending the patch :-) Thanks, Andi
Re: [PATCH v2 2/2] drm/i915/gt: Enable only one CCS for compute workload
On 20/02/2024 14:35, Andi Shyti wrote: Enable only one CCS engine by default with all the compute sices slices allocated to it. While generating the list of UABI engines to be exposed to the user, exclude any additional CCS engines beyond the first instance. This change can be tested with igt i915_query. Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Signed-off-by: Andi Shyti Cc: Chris Wilson Cc: Joonas Lahtinen Cc: Matt Roper Cc: # v6.2+ --- drivers/gpu/drm/i915/gt/intel_engine_user.c | 9 + drivers/gpu/drm/i915/gt/intel_gt.c | 11 +++ drivers/gpu/drm/i915/gt/intel_gt_regs.h | 2 ++ drivers/gpu/drm/i915/i915_query.c | 1 + 4 files changed, 23 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c index 833987015b8b..7041acc77810 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_user.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c @@ -243,6 +243,15 @@ void intel_engines_driver_register(struct drm_i915_private *i915) if (engine->uabi_class == I915_NO_UABI_CLASS) continue; + /* +* Do not list and do not count CCS engines other than the first +*/ + if (engine->uabi_class == I915_ENGINE_CLASS_COMPUTE && + engine->uabi_instance > 0) { + i915->engine_uabi_class_count[engine->uabi_class]--; + continue; + } It's a bit ugly to decrement after increment, instead of somehow restructuring the loop to satisfy both cases more elegantly. And I wonder if internally (in dmesg when engine name is logged) we don't end up with ccs0 ccs0 ccs0 ccs0.. for all instances. + rb_link_node(>uabi_node, prev, p); rb_insert_color(>uabi_node, >uabi_engines); diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index a425db5ed3a2..e19df4ef47f6 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -168,6 +168,14 @@ static void init_unused_rings(struct intel_gt *gt) } } +static void intel_gt_apply_ccs_mode(struct intel_gt *gt) +{ + if (!IS_DG2(gt->i915)) + return; + + intel_uncore_write(gt->uncore, XEHP_CCS_MODE, 0); +} + int intel_gt_init_hw(struct intel_gt *gt) { struct drm_i915_private *i915 = gt->i915; @@ -195,6 +203,9 @@ int intel_gt_init_hw(struct intel_gt *gt) intel_gt_init_swizzling(gt); + /* Configure CCS mode */ + intel_gt_apply_ccs_mode(gt); + /* * At least 830 can leave some of the unused rings * "active" (ie. head != tail) after resume which diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h index cf709f6c05ae..c148113770ea 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h @@ -1605,6 +1605,8 @@ #define GEN12_VOLTAGE_MASK REG_GENMASK(10, 0) #define GEN12_CAGF_MASK REG_GENMASK(19, 11) +#define XEHP_CCS_MODE _MMIO(0x14804) + #define GEN11_GT_INTR_DW(x) _MMIO(0x190018 + ((x) * 4)) #define GEN11_CSME (31) #define GEN12_HECI_2(30) diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c index 3baa2f54a86e..d5a5143971f5 100644 --- a/drivers/gpu/drm/i915/i915_query.c +++ b/drivers/gpu/drm/i915/i915_query.c @@ -124,6 +124,7 @@ static int query_geometry_subslices(struct drm_i915_private *i915, return fill_topology_info(sseu, query_item, sseu->geometry_subslice_mask); } + Zap please. static int query_engine_info(struct drm_i915_private *i915, struct drm_i915_query_item *query_item) Regards, Tvrtko
Re: [PATCH 2/2] drm/i915/gt: Set default CCS mode '1'
On 20/02/2024 14:20, Andi Shyti wrote: Since CCS automatic load balancing is disabled, we will impose a fixed balancing policy that involves setting all the CCS engines to work together on the same load. Erm *all* CSS engines work together.. Simultaneously, the user will see only 1 CCS rather than the actual number. As of now, this change affects only DG2. ... *one* CCS engine. Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Signed-off-by: Andi Shyti Cc: Chris Wilson Cc: Joonas Lahtinen Cc: Matt Roper Cc: # v6.2+ --- drivers/gpu/drm/i915/gt/intel_gt.c | 11 +++ drivers/gpu/drm/i915/gt/intel_gt_regs.h | 2 ++ drivers/gpu/drm/i915/i915_drv.h | 17 + drivers/gpu/drm/i915/i915_query.c | 5 +++-- 4 files changed, 33 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index a425db5ed3a2..e19df4ef47f6 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -168,6 +168,14 @@ static void init_unused_rings(struct intel_gt *gt) } } +static void intel_gt_apply_ccs_mode(struct intel_gt *gt) +{ + if (!IS_DG2(gt->i915)) + return; + + intel_uncore_write(gt->uncore, XEHP_CCS_MODE, 0); +} + int intel_gt_init_hw(struct intel_gt *gt) { struct drm_i915_private *i915 = gt->i915; @@ -195,6 +203,9 @@ int intel_gt_init_hw(struct intel_gt *gt) intel_gt_init_swizzling(gt); + /* Configure CCS mode */ + intel_gt_apply_ccs_mode(gt); + /* * At least 830 can leave some of the unused rings * "active" (ie. head != tail) after resume which diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h index cf709f6c05ae..c148113770ea 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h @@ -1605,6 +1605,8 @@ #define GEN12_VOLTAGE_MASK REG_GENMASK(10, 0) #define GEN12_CAGF_MASK REG_GENMASK(19, 11) +#define XEHP_CCS_MODE _MMIO(0x14804) + #define GEN11_GT_INTR_DW(x) _MMIO(0x190018 + ((x) * 4)) #define GEN11_CSME (31) #define GEN12_HECI_2(30) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index e81b3b2858ac..0853ffd3cb8d 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -396,6 +396,23 @@ static inline struct intel_gt *to_gt(const struct drm_i915_private *i915) (engine__); \ (engine__) = rb_to_uabi_engine(rb_next(&(engine__)->uabi_node))) +/* + * Exclude unavailable engines. + * + * Only the first CCS engine is utilized due to the disabling of CCS auto load + * balancing. As a result, all CCS engines operate collectively, functioning + * essentially as a single CCS engine, hence the count of active CCS engines is + * considered '1'. + * Currently, this applies to platforms with more than one CCS engine, + * specifically DG2. + */ +#define for_each_available_uabi_engine(engine__, i915__) \ + for_each_uabi_engine(engine__, i915__) \ + if ((IS_DG2(i915__)) && \ + ((engine__)->uabi_class == I915_ENGINE_CLASS_COMPUTE) && \ + ((engine__)->uabi_instance)) { } \ + else + I thought the plan was to simply not register the engine. Like that it would be a simpler patch. #define INTEL_INFO(i915) ((i915)->__info) #define RUNTIME_INFO(i915)(&(i915)->__runtime) #define DRIVER_CAPS(i915) (&(i915)->caps) diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c index fa3e937ed3f5..2d41bda626a6 100644 --- a/drivers/gpu/drm/i915/i915_query.c +++ b/drivers/gpu/drm/i915/i915_query.c @@ -124,6 +124,7 @@ static int query_geometry_subslices(struct drm_i915_private *i915, return fill_topology_info(sseu, query_item, sseu->geometry_subslice_mask); } + ! static int query_engine_info(struct drm_i915_private *i915, struct drm_i915_query_item *query_item) @@ -140,7 +141,7 @@ query_engine_info(struct drm_i915_private *i915, if (query_item->flags) return -EINVAL; - for_each_uabi_engine(engine, i915) + for_each_available_uabi_engine(engine, i915) num_uabi_engines++; len = struct_size(query_ptr, engines, num_uabi_engines); @@ -155,7 +156,7 @@ query_engine_info(struct drm_i915_private *i915, info_ptr = _ptr->engines[0]; - for_each_uabi_engine(engine, i915) { + for_each_available_uabi_engine(engine, i915) { info.engine.engine_class = engine->uabi_class; info.engine.engine_instance = engine->uabi_instance; info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE; I thought you agreed that this still
Re: [PATCH] drm/i915: Fix possible null pointer dereference after drm_dbg_printer conversion
On 20/02/2024 10:36, Maxime Ripard wrote: On Tue, Feb 20, 2024 at 09:16:43AM +, Tvrtko Ursulin wrote: On 19/02/2024 20:02, Rodrigo Vivi wrote: On Mon, Feb 19, 2024 at 01:14:23PM +, Tvrtko Ursulin wrote: From: Tvrtko Ursulin Request can be NULL if no guilty request was identified so simply use engine->i915 instead. Signed-off-by: Tvrtko Ursulin Fixes: d50892a9554c ("drm/i915: switch from drm_debug_printer() to device specific drm_dbg_printer()") Reported-by: Dan Carpenter Cc: Jani Nikula Cc: Luca Coelho Cc: Maxime Ripard Cc: Jani Nikula Reviewed-by: Rodrigo Vivi Thanks Rodrigo! Given how d50892a9554c landed via drm-misc-next, Maxime or Thomas - could you take this via drm-misc-next-fixes or if there will be another drm-misc-next pull request? There will be a drm-misc-next PR on thursday Could you pull this one into which branch is needed so it appears in that pull request? Regards, Tvrtko
Re: [PATCH 2/2] drm/i915/gt: Set default CCS mode '1'
On 20/02/2024 10:11, Andi Shyti wrote: Hi Tvrtko, On Mon, Feb 19, 2024 at 12:51:44PM +, Tvrtko Ursulin wrote: On 19/02/2024 11:16, Tvrtko Ursulin wrote: On 15/02/2024 13:59, Andi Shyti wrote: ... +/* + * Exclude unavailable engines. + * + * Only the first CCS engine is utilized due to the disabling of CCS auto load + * balancing. As a result, all CCS engines operate collectively, functioning + * essentially as a single CCS engine, hence the count of active CCS engines is + * considered '1'. + * Currently, this applies to platforms with more than one CCS engine, + * specifically DG2. + */ +#define for_each_available_uabi_engine(engine__, i915__) \ + for_each_uabi_engine(engine__, i915__) \ + if ((IS_DG2(i915__)) && \ + ((engine__)->uabi_class == I915_ENGINE_CLASS_COMPUTE) && \ + ((engine__)->uabi_instance)) { } \ + else + If you don't want userspace to see some engines, just don't add them to the uabi list in intel_engines_driver_register or thereabouts? It will be dynamic. In next series I am preparing the user will be able to increase the number of CCS engines he wants to use. Oh tricky and new. Does it need to be at runtime or could be boot time? If you are aiming to make the static single CCS only into the 6.9 release, and you feel running out of time, you could always do a simple solution for now. The one I mentioned of simply not registering on the uabi list. Then you can refine more leisurely for the next release. Regards, Tvrtko Similar as we do for gsc which uses I915_NO_UABI_CLASS, although for ccs you can choose a different approach, whatever is more elegant. That is also needed for i915->engine_uabi_class_count to be right, so userspace stats which rely on it are correct. Oh yes. Will update it. I later realized it is more than that - everything that uses intel_engine_lookup_user to look up class instance passed in from userspace relies on the engine not being on the user list otherwise userspace could bypass the fact engine query does not list it. Like PMU, Perf/POA, context engine map and SSEU context query. Correct, will look into that, thank you! Andi
Re: [PATCH] drm/i915: Fix possible null pointer dereference after drm_dbg_printer conversion
On 19/02/2024 20:02, Rodrigo Vivi wrote: On Mon, Feb 19, 2024 at 01:14:23PM +, Tvrtko Ursulin wrote: From: Tvrtko Ursulin Request can be NULL if no guilty request was identified so simply use engine->i915 instead. Signed-off-by: Tvrtko Ursulin Fixes: d50892a9554c ("drm/i915: switch from drm_debug_printer() to device specific drm_dbg_printer()") Reported-by: Dan Carpenter Cc: Jani Nikula Cc: Luca Coelho Cc: Maxime Ripard Cc: Jani Nikula Reviewed-by: Rodrigo Vivi Thanks Rodrigo! Given how d50892a9554c landed via drm-misc-next, Maxime or Thomas - could you take this via drm-misc-next-fixes or if there will be another drm-misc-next pull request? Regards, Tvrtko --- drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c index 5f8d86e25993..8d4bb95f8424 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c @@ -96,8 +96,8 @@ static void heartbeat_commit(struct i915_request *rq, static void show_heartbeat(const struct i915_request *rq, struct intel_engine_cs *engine) { - struct drm_printer p = drm_dbg_printer(>i915->drm, DRM_UT_DRIVER, - "heartbeat"); + struct drm_printer p = + drm_dbg_printer(>i915->drm, DRM_UT_DRIVER, "heartbeat"); if (!rq) { intel_engine_dump(engine, , -- 2.40.1
[PATCH] drm/i915: Add some boring kerneldoc
From: Tvrtko Ursulin Tooling appears very strict so lets pacify it by adding some comments, even if fields are completely self-explanatory. Signed-off-by: Tvrtko Ursulin Fixes: b11236486749 ("drm/i915: Add GuC submission interface version query") Reported-by: Stephen Rothwell Cc: Jose Souza --- include/uapi/drm/i915_drm.h | 4 1 file changed, 4 insertions(+) diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index bd87386a8243..2ee338860b7e 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -3572,9 +3572,13 @@ struct drm_i915_query_memory_regions { * struct drm_i915_query_guc_submission_version - query GuC submission interface version */ struct drm_i915_query_guc_submission_version { + /** @branch: Firmware branch version. */ __u32 branch; + /** @major: Firmware major version. */ __u32 major; + /** @minor: Firmware minor version. */ __u32 minor; + /** @patch: Firmware patch version. */ __u32 patch; }; -- 2.40.1
[PATCH] drm/i915: Fix possible null pointer dereference after drm_dbg_printer conversion
From: Tvrtko Ursulin Request can be NULL if no guilty request was identified so simply use engine->i915 instead. Signed-off-by: Tvrtko Ursulin Fixes: d50892a9554c ("drm/i915: switch from drm_debug_printer() to device specific drm_dbg_printer()") Reported-by: Dan Carpenter Cc: Jani Nikula Cc: Luca Coelho Cc: Maxime Ripard Cc: Jani Nikula --- drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c index 5f8d86e25993..8d4bb95f8424 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c @@ -96,8 +96,8 @@ static void heartbeat_commit(struct i915_request *rq, static void show_heartbeat(const struct i915_request *rq, struct intel_engine_cs *engine) { - struct drm_printer p = drm_dbg_printer(>i915->drm, DRM_UT_DRIVER, - "heartbeat"); + struct drm_printer p = + drm_dbg_printer(>i915->drm, DRM_UT_DRIVER, "heartbeat"); if (!rq) { intel_engine_dump(engine, , -- 2.40.1