[PATCH] drm/i915: 2 GiB of relocations ought to be enough for anybody*

2024-05-21 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Kernel test robot reports i915 can hit a warn in kvmalloc_node which has
a purpose of dissalowing crazy size kernel allocations. This was added in
7661809d493b ("mm: don't allow oversized kvmalloc() calls"):

   /* Don't even allow crazy sizes */
   if (WARN_ON_ONCE(size > INT_MAX))
   return NULL;

This would be kind of okay since i915 at one point dropped the need for
making a shadow copy of the relocation list, but then it got re-added in
fd1500fcd442 ("Revert "drm/i915/gem: Drop relocation slowpath".") a year
after Linus added the above warning.

It is plausible that the issue was not seen until now because to trigger
gem_exec_reloc test requires a combination of an relatively older
generation hardware but with at least 8GiB of RAM installed. Probably even
more depending on runtime checks.

Lets cap what we allow userspace to pass in using the matching limit.
There should be no issue for real userspace since we are talking about
"crazy" number of relocations which have no practical purpose.

*) Well IGT tests might get upset but they can be easily adjusted.

Signed-off-by: Tvrtko Ursulin 
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-lkp/202405151008.6ddd1aaf-oliver.s...@intel.com
Cc: Kees Cook 
Cc: Kent Overstreet 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
---
 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index d3a771afb083..4b34bf4fde77 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1533,7 +1533,7 @@ static int eb_relocate_vma(struct i915_execbuffer *eb, 
struct eb_vma *ev)
u64_to_user_ptr(entry->relocs_ptr);
unsigned long remain = entry->relocation_count;
 
-   if (unlikely(remain > N_RELOC(ULONG_MAX)))
+   if (unlikely(remain > N_RELOC(INT_MAX)))
return -EINVAL;
 
/*
@@ -1641,7 +1641,7 @@ static int check_relocations(const struct 
drm_i915_gem_exec_object2 *entry)
if (size == 0)
return 0;
 
-   if (size > N_RELOC(ULONG_MAX))
+   if (size > N_RELOC(INT_MAX))
return -EINVAL;
 
addr = u64_to_user_ptr(entry->relocs_ptr);
-- 
2.44.0



[PATCH 2/2] drm/amdgpu: Use drm_print_memory_stats helper from fdinfo

2024-05-20 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Convert fdinfo memory stats to use the common drm_print_memory_stats
helper.

This achieves alignment with the common keys as documented in
drm-usage-stats.rst, adding specifically drm-total- key the driver was
missing until now.

Additionally I made the code stop skipping total size for objects which
currently do not have a backing store, and I added resident, active and
purgeable reporting.

Legacy keys have been preserved, with the outlook of only potentially
removing only the drm-memory- when the time gets right.

The example output now looks like this:

 pos:   0
 flags: 0212
 mnt_id:24
 ino:   1239
 drm-driver:amdgpu
 drm-client-id: 4
 drm-pdev:  :04:00.0
 pasid: 32771
 drm-total-cpu: 0
 drm-shared-cpu:0
 drm-active-cpu:0
 drm-resident-cpu:  0
 drm-purgeable-cpu: 0
 drm-total-gtt: 2392 KiB
 drm-shared-gtt:0
 drm-active-gtt:0
 drm-resident-gtt:  2392 KiB
 drm-purgeable-gtt: 0
 drm-total-vram:44564 KiB
 drm-shared-vram:   31952 KiB
 drm-active-vram:   0
 drm-resident-vram: 44564 KiB
 drm-purgeable-vram:0
 drm-memory-vram:   44564 KiB
 drm-memory-gtt:2392 KiB
 drm-memory-cpu:0 KiB
 amd-memory-visible-vram:   44564 KiB
 amd-evicted-vram:  0 KiB
 amd-evicted-visible-vram:  0 KiB
 amd-requested-vram:44564 KiB
 amd-requested-visible-vram:11952 KiB
 amd-requested-gtt: 2392 KiB
 drm-engine-compute:46464671 ns

v2:
 * Track purgeable via AMDGPU_GEM_CREATE_DISCARDABLE.

Signed-off-by: Tvrtko Ursulin 
Cc: Alex Deucher 
Cc: Christian König 
Cc: Daniel Vetter 
Cc: Rob Clark 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 48 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 96 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 35 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 20 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  3 +-
 6 files changed, 122 insertions(+), 81 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
index c7df7fa3459f..00a4ab082459 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
@@ -59,18 +59,21 @@ void amdgpu_show_fdinfo(struct drm_printer *p, struct 
drm_file *file)
struct amdgpu_fpriv *fpriv = file->driver_priv;
struct amdgpu_vm *vm = >vm;
 
-   struct amdgpu_mem_stats stats;
+   struct amdgpu_mem_stats stats[__AMDGPU_PL_LAST + 1] = { };
ktime_t usage[AMDGPU_HW_IP_NUM];
-   unsigned int hw_ip;
+   const char *pl_name[] = {
+   [TTM_PL_VRAM] = "vram",
+   [TTM_PL_TT] = "gtt",
+   [TTM_PL_SYSTEM] = "cpu",
+   };
+   unsigned int hw_ip, i;
int ret;
 
-   memset(, 0, sizeof(stats));
-
ret = amdgpu_bo_reserve(vm->root.bo, false);
if (ret)
return;
 
-   amdgpu_vm_get_memory(vm, );
+   amdgpu_vm_get_memory(vm, stats, ARRAY_SIZE(stats));
amdgpu_bo_unreserve(vm->root.bo);
 
amdgpu_ctx_mgr_usage(>ctx_mgr, usage);
@@ -82,24 +85,35 @@ void amdgpu_show_fdinfo(struct drm_printer *p, struct 
drm_file *file)
 */
 
drm_printf(p, "pasid:\t%u\n", fpriv->vm.pasid);
-   drm_printf(p, "drm-memory-vram:\t%llu KiB\n", stats.vram/1024UL);
-   drm_printf(p, "drm-memory-gtt: \t%llu KiB\n", stats.gtt/1024UL);
-   drm_printf(p, "drm-memory-cpu: \t%llu KiB\n", stats.cpu/1024UL);
+
+   for (i = 0; i < TTM_PL_PRIV; i++)
+   drm_print_memory_stats(p,
+  [i].drm,
+  DRM_GEM_OBJECT_RESIDENT |
+  DRM_GEM_OBJECT_PURGEABLE,
+  pl_name[i]);
+
+   /* Legacy amdgpu keys, alias to drm-resident-memory-: */
+   drm_printf(p, "drm-memory-vram:\t%llu KiB\n",
+  stats[TTM_PL_VRAM].total/1024UL);
+   drm_printf(p, "drm-memory-gtt: \t%llu KiB\n",
+  stats[TTM_PL_TT].total/1024UL);
+   drm_printf(p, "drm-memory-cpu: \t%llu KiB\n",
+  stats[TTM_PL_SYSTEM].total/1024UL);
+
+   /* Amdgpu specific memory accounting keys: */
drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
-  stats.visible_vram/1024UL);
+  stats[TTM_PL_VRAM].visible/1024UL);
drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
-  stats.evicted_vram/1024UL);
+  stats[TTM_PL_VRAM].evicted/1024UL);
drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
-  stats.evicted_visible_vram/1024UL);
+  stats[TTM_PL_VRAM].evicted_visible/1024UL);
  

[PATCH 1/2] Documentation/gpu: Document the situation with unqualified drm-memory-

2024-05-20 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Currently it is not well defined what is drm-memory- compared to other
categories.

In practice the only driver which emits these keys is amdgpu and in them
exposes the current resident buffer object memory (including shared).

To prevent any confusion, document that drm-memory- is deprecated and an
alias for drm-resident-memory-.

While at it also clarify that the reserved sub-string 'memory' refers to
the memory region component, and also clarify the intended semantics of
other memory categories.

v2:
 * Also mark drm-memory- as deprecated.
 * Add some more text describing memory categories. (Alex)

v3:
 * Semantics of the amdgpu drm-memory is actually as drm-resident.

Signed-off-by: Tvrtko Ursulin 
Cc: Alex Deucher 
Cc: Christian König 
Cc: Rob Clark 
---
 Documentation/gpu/drm-usage-stats.rst | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/Documentation/gpu/drm-usage-stats.rst 
b/Documentation/gpu/drm-usage-stats.rst
index 6dc299343b48..45d9b76a5748 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -128,7 +128,9 @@ Memory
 
 Each possible memory type which can be used to store buffer objects by the
 GPU in question shall be given a stable and unique name to be returned as the
-string here.  The name "memory" is reserved to refer to normal system memory.
+string here.
+
+The region name "memory" is reserved to refer to normal system memory.
 
 Value shall reflect the amount of storage currently consumed by the buffer
 objects belong to this client, in the respective memory region.
@@ -136,6 +138,9 @@ objects belong to this client, in the respective memory 
region.
 Default unit shall be bytes with optional unit specifiers of 'KiB' or 'MiB'
 indicating kibi- or mebi-bytes.
 
+This key is deprecated and is an alias for drm-resident-. Only one of
+the two should be present in the output.
+
 - drm-shared-:  [KiB|MiB]
 
 The total size of buffers that are shared with another file (e.g., have more
@@ -143,20 +148,34 @@ than a single handle).
 
 - drm-total-:  [KiB|MiB]
 
-The total size of buffers that including shared and private memory.
+The total size of all created buffers including shared and private memory. The
+backing store for the buffers does not have to be currently instantiated to be
+counted under this category.
 
 - drm-resident-:  [KiB|MiB]
 
-The total size of buffers that are resident in the specified region.
+The total size of buffers that are resident (have their backing store present 
or
+instantiated) in the specified region.
+
+This is an alias for drm-memory- and only one of the two should be
+present in the output.
 
 - drm-purgeable-:  [KiB|MiB]
 
 The total size of buffers that are purgeable.
 
+For example drivers which implement a form of 'madvise' like functionality can
+here count buffers which have instantiated backing store, but have been marked
+with an equivalent of MADV_DONTNEED.
+
 - drm-active-:  [KiB|MiB]
 
 The total size of buffers that are active on one or more engines.
 
+One practical example of this can be presence of unsignaled fences in an GEM
+buffer reservation object. Therefore the active category is a subset of
+resident.
+
 Implementation Details
 ==
 
-- 
2.44.0



Re: [RFC v2 0/2] Discussion around eviction improvements

2024-05-17 Thread Tvrtko Ursulin



On 16/05/2024 20:21, Alex Deucher wrote:

On Thu, May 16, 2024 at 8:18 AM Tvrtko Ursulin  wrote:


From: Tvrtko Ursulin 

Reduced re-spin of my previous series after Christian corrected a few
misconceptions that I had. So lets see if what remains makes sense or is still
misguided.

To summarise, the series address the following two issues:

  * Migration rate limiting does not work, at least not for the common case
where userspace configures VRAM+GTT. It thinks it can stop migration 
attempts
by playing with bo->allowed_domains vs bo->preferred domains but, both from
the code, and from empirical experiments, I see that not working at all. 
When
both masks are identical fiddling with them achieves nothing. Even when they
are not identical allowed has a fallback GTT placement which means that when
over the migration budget ttm_bo_validate with bo->allowed_domains can cause
migration from GTT to VRAM.

  * Driver thinks it will be re-validating evicted buffers on the next 
submission
but it does not for the very common case of VRAM+GTT because it only checks
if current placement is *none* of the preferred placements.


For APUs at least, we should never migrate because GTT and VRAM are
both system memory so are effectively equal performance-wise.  Maybe


I was curious about this but thought there could be a reason why VRAM 
carve-out is a fix small-ish size. It cannot be made 1:1 with RAM or 
some other solution?



this regressed when Christian reworked ttm to better handle migrating
buffers back to VRAM after suspend on dGPUs?


I will leave this to Christian to answer but for what this series is 
concerned I'd say it is orthogonal to that.


Here we have two fixes not limited to APU use cases, just so it happens 
fixing the migration throttling improves things there too. And that even 
despite the first patch which triggering *more* migration attempts. 
Because the second patch then correctly curbs them.


First patch should help with transient overcommit on discrete, allowing 
things get back into VRAM as soon as there is space.


Second patch tries to makes migration throttling work as intended.

Volunteers for testing on discrete? :)



These two patches appear to have a positive result for a memory intensive game
like Assassin's Creed Valhalla. On an APU like Steam Deck the game has a working
set around 5 GiB, while the VRAM is configured to 1 GiB. Correctly respecting
the migration budget appears to keep buffer blits at bay and improves the
minimum frame rate, ie. makes things smoother.

 From the game's built-in benchmark, average of three runs each:

 FPS
 migrated KiBmin avg max min-1%  min-0.1%
   because  20784781 10.00  37.00   89.6722.0012.33
   patched   4227688 13.67  37.00   81.3323.3315.00


Hmm! s/because/before/ here obviously!

Regards,

Tvrtko


Disclaimers that I have is that more runs would be needed to be more confident
about the results. And more games. And APU versus discrete.

Cc: Christian König 
Cc: Friedrich Vock 

Tvrtko Ursulin (2):
   drm/amdgpu: Re-validate evicted buffers
   drm/amdgpu: Actually respect buffer migration budget

  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 112 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |  21 -
  2 files changed, 103 insertions(+), 30 deletions(-)

--
2.44.0



[RFC v2 0/2] Discussion around eviction improvements

2024-05-16 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Reduced re-spin of my previous series after Christian corrected a few
misconceptions that I had. So lets see if what remains makes sense or is still
misguided.

To summarise, the series address the following two issues:

 * Migration rate limiting does not work, at least not for the common case
   where userspace configures VRAM+GTT. It thinks it can stop migration attempts
   by playing with bo->allowed_domains vs bo->preferred domains but, both from
   the code, and from empirical experiments, I see that not working at all. When
   both masks are identical fiddling with them achieves nothing. Even when they
   are not identical allowed has a fallback GTT placement which means that when
   over the migration budget ttm_bo_validate with bo->allowed_domains can cause
   migration from GTT to VRAM.

 * Driver thinks it will be re-validating evicted buffers on the next submission
   but it does not for the very common case of VRAM+GTT because it only checks
   if current placement is *none* of the preferred placements.

These two patches appear to have a positive result for a memory intensive game
like Assassin's Creed Valhalla. On an APU like Steam Deck the game has a working
set around 5 GiB, while the VRAM is configured to 1 GiB. Correctly respecting
the migration budget appears to keep buffer blits at bay and improves the
minimum frame rate, ie. makes things smoother.

>From the game's built-in benchmark, average of three runs each:

FPS
migrated KiBmin avg max min-1%  min-0.1%
  because  20784781 10.00  37.00   89.6722.0012.33
  patched   4227688 13.67  37.00   81.3323.3315.00

Disclaimers that I have is that more runs would be needed to be more confident
about the results. And more games. And APU versus discrete.

Cc: Christian König 
Cc: Friedrich Vock 

Tvrtko Ursulin (2):
  drm/amdgpu: Re-validate evicted buffers
  drm/amdgpu: Actually respect buffer migration budget

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 112 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c |  21 -
 2 files changed, 103 insertions(+), 30 deletions(-)

-- 
2.44.0



[RFC 2/2] drm/amdgpu: Actually respect buffer migration budget

2024-05-16 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Current code appears to live in a misconception that playing with buffer
allowed and preferred placements can always control the decision on
whether backing store migration will be attempted or not.

That is however not the case when userspace sets buffer placements of
VRAM+GTT, which is what radv does since commit 862b6a9a ("radv: Improve
spilling on discrete GPUs."), with the end result of completely ignoring
the migration budget.

Fix this by validating against a local singleton placement set to the
current backing store location. This way, when migration budget has been
depleted, we can prevent ttm_bo_validate from seeing any other than the
current placement.

For the case of implicit GTT allowed domain added in amdgpu_bo_create
when userspace only sets VRAM the behaviour should be the same. On the
first pass the re-validation will attempt to migrate away from the
fallback GTT domain, and if that did not succeed the buffer will remain in
the fallback placement.

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 112 +++--
 1 file changed, 85 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index ec888fc6ead8..08e7631f3a2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -32,6 +32,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #include "amdgpu_cs.h"
@@ -775,6 +776,56 @@ void amdgpu_cs_report_moved_bytes(struct amdgpu_device 
*adev, u64 num_bytes,
spin_unlock(>mm_stats.lock);
 }
 
+static bool
+amdgpu_cs_bo_move_under_budget(struct amdgpu_cs_parser *p,
+  struct amdgpu_bo *abo)
+{
+   struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
+
+   /*
+* Don't move this buffer if we have depleted our allowance
+* to move it. Don't move anything if the threshold is zero.
+*/
+   if (p->bytes_moved >= p->bytes_moved_threshold)
+   return false;
+
+   if ((!abo->tbo.base.dma_buf ||
+list_empty(>tbo.base.dma_buf->attachments)) &&
+   (!amdgpu_gmc_vram_full_visible(>gmc) &&
+(abo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED)) &&
+   p->bytes_moved_vis >= p->bytes_moved_vis_threshold) {
+   /*
+* And don't move a CPU_ACCESS_REQUIRED BO to limited
+* visible VRAM if we've depleted our allowance to do
+* that.
+*/
+   return false;
+   }
+
+   return true;
+}
+
+static bool
+amdgpu_bo_fill_current_placement(struct amdgpu_bo *abo,
+struct ttm_placement *placement,
+struct ttm_place *place)
+{
+   struct ttm_placement *bo_placement = >placement;
+   int i;
+
+   for (i = 0; i < bo_placement->num_placement; i++) {
+   if (bo_placement->placement[i].mem_type ==
+   abo->tbo.resource->mem_type) {
+   *place = bo_placement->placement[i];
+   placement->num_placement = 1;
+   placement->placement = place;
+   return true;
+   }
+   }
+
+   return false;
+}
+
 static int amdgpu_cs_bo_validate(void *param, struct amdgpu_bo *bo)
 {
struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
@@ -784,46 +835,53 @@ static int amdgpu_cs_bo_validate(void *param, struct 
amdgpu_bo *bo)
.no_wait_gpu = false,
.resv = bo->tbo.base.resv
};
-   uint32_t domain;
+   bool allow_move;
int r;
 
if (bo->tbo.pin_count)
return 0;
 
-   /* Don't move this buffer if we have depleted our allowance
-* to move it. Don't move anything if the threshold is zero.
-*/
-   if (p->bytes_moved < p->bytes_moved_threshold &&
-   (!bo->tbo.base.dma_buf ||
-   list_empty(>tbo.base.dma_buf->attachments))) {
-   if (!amdgpu_gmc_vram_full_visible(>gmc) &&
-   (bo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED)) {
-   /* And don't move a CPU_ACCESS_REQUIRED BO to limited
-* visible VRAM if we've depleted our allowance to do
-* that.
-*/
-   if (p->bytes_moved_vis < p->bytes_moved_vis_threshold)
-   domain = bo->preferred_domains;
-   else
-   domain = bo->allowed_domains;
-   } else {
-   domain = bo->preferred_domains;
-   }
-  

[RFC 1/2] drm/amdgpu: Re-validate evicted buffers

2024-05-16 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Currently the driver appears to be thinking that it will be attempting to
re-validate the evicted buffers on the next submission if they are not in
their preferred placement.

That however appears not to be true for the very common case of buffers
with allowed placements of VRAM+GTT. Simply because the check can only
detect if the current placement is *none* of the preferred ones, happily
leaving VRAM+GTT buffers in the GTT placement "forever".

Fix it by extending the VRAM+GTT special case to the re-validation logic.

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 6bddd43604bc..e53ff914b62e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1248,10 +1248,25 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, 
struct amdgpu_bo_va *bo_va,
 * next command submission.
 */
if (amdgpu_vm_is_bo_always_valid(vm, bo)) {
-   uint32_t mem_type = bo->tbo.resource->mem_type;
+   unsigned current_domain =
+   amdgpu_mem_type_to_domain(bo->tbo.resource->mem_type);
+   bool move_to_evict = false;
 
-   if (!(bo->preferred_domains &
- amdgpu_mem_type_to_domain(mem_type)))
+   if (!(bo->preferred_domains & current_domain)) {
+   move_to_evict = true;
+   } else if ((bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK) ==
+  (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT) &&
+  current_domain != AMDGPU_GEM_DOMAIN_VRAM) {
+   /*
+* If userspace has provided a list of possible
+* placements equal to VRAM+GTT, we assume VRAM is *the*
+* preferred placement and so try to move it back there
+* on the next submission.
+*/
+   move_to_evict = true;
+   }
+
+   if (move_to_evict)
amdgpu_vm_bo_evicted(_va->base);
else
amdgpu_vm_bo_idle(_va->base);
-- 
2.44.0



Re: [PATCH v4 8/8] drm/xe/client: Print runtime to fdinfo

2024-05-16 Thread Tvrtko Ursulin



On 15/05/2024 22:42, Lucas De Marchi wrote:

Print the accumulated runtime for client when printing fdinfo.
Each time a query is done it first does 2 things:

1) loop through all the exec queues for the current client and
accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
that, being read from the context image.

2) Read a "GPU timestamp" that can be used for considering "how much GPU
time has passed" and that has the same unit/refclock as the one
recording the runtime. RING_TIMESTAMP is used for that via MMIO.

Since for all current platforms RING_TIMESTAMP follows the same
refclock, just read it once, using any first engine available.

This is exported to userspace as 2 numbers in fdinfo:

drm-cycles-: 
drm-total-cycles-: 

Userspace is expected to collect at least 2 samples, which allows to
know the client engine busyness as per:

RUNTIME1 - RUNTIME0
busyness = -
  T1 - T0

Since drm-cycles- always starts at 0, it's also possible to know
if and engine was ever used by a client.

It's expected that userspace will read any 2 samples every few seconds.
Given the update frequency of the counters involved and that
CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap
around (assuming 100% utilization) after ~200s. The wraparound is not
perceived by userspace since it's just accumulated for all the
exec_queues in a 64-bit counter) but the measurement will not be
accurate if the samples are too far apart.

This could be mitigated by adding a workqueue to accumulate the counters
every so often, but it's additional complexity for something that is
done already by userspace every few seconds in tools like gputop (from
igt), htop, nvtop, etc, with none of them really defaulting to 1 sample
per minute or more.

Signed-off-by: Lucas De Marchi 
---
  Documentation/gpu/drm-usage-stats.rst   |  21 +++-
  Documentation/gpu/xe/index.rst  |   1 +
  Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
  drivers/gpu/drm/xe/xe_drm_client.c  | 121 +++-
  4 files changed, 150 insertions(+), 3 deletions(-)
  create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst

diff --git a/Documentation/gpu/drm-usage-stats.rst 
b/Documentation/gpu/drm-usage-stats.rst
index 6dc299343b48..a80f95ca1b2f 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -112,6 +112,19 @@ larger value within a reasonable period. Upon observing a 
value lower than what
  was previously read, userspace is expected to stay with that larger previous
  value until a monotonic update is seen.
  
+- drm-total-cycles-: 

+
+Engine identifier string must be the same as the one specified in the
+drm-cycles- tag and shall contain the total number cycles for the given
+engine.
+
+This is a timestamp in GPU unspecified unit that matches the update rate
+of drm-cycles-. For drivers that implement this interface, the engine
+utilization can be calculated entirely on the GPU clock domain, without
+considering the CPU sleep time between 2 samples.
+
+A driver may implement either this key or drm-maxfreq-, but not both.
+
  - drm-maxfreq-:  [Hz|MHz|KHz]
  
  Engine identifier string must be the same as the one specified in the

@@ -121,6 +134,9 @@ percentage utilization of the engine, whereas 
drm-engine- only reflects
  time active without considering what frequency the engine is operating as a
  percentage of its maximum frequency.
  
+A driver may implement either this key or drm-total-cycles-, but not

+both.
+


For the spec part:

Acked-by: Tvrtko Ursulin 

Some minor comments and questions below.


  Memory
  ^^
  
@@ -168,5 +184,6 @@ be documented above and where possible, aligned with other drivers.

  Driver specific implementations
  ---
  
-:ref:`i915-usage-stats`

-:ref:`panfrost-usage-stats`
+* :ref:`i915-usage-stats`
+* :ref:`panfrost-usage-stats`
+* :ref:`xe-usage-stats`
diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index c224ecaee81e..3f07aa3b5432 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver.
 xe_firmware
 xe_tile
 xe_debugging
+   xe-drm-usage-stats.rst
diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst 
b/Documentation/gpu/xe/xe-drm-usage-stats.rst
new file mode 100644
index ..482d503ae68a
--- /dev/null
+++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+.. _xe-usage-stats:
+
+
+Xe DRM client usage stats implementation
+
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c
+   :doc: DRM Client usage stats
diff --git a/drivers/gpu/drm/xe/xe_drm_client.c 
b/drivers/gpu/drm/xe/xe_drm_client.

Re: [RFC 2/5] drm/amdgpu: Actually respect buffer migration budget

2024-05-15 Thread Tvrtko Ursulin



On 15/05/2024 15:31, Christian König wrote:

Am 15.05.24 um 12:59 schrieb Tvrtko Ursulin:


On 15/05/2024 08:20, Christian König wrote:

Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin:

From: Tvrtko Ursulin 

Current code appears to live in a misconception that playing with 
buffer

allowed and preferred placements can control the decision on whether
backing store migration will be attempted or not.

Both from code inspection and from empirical experiments I see that not
being true, and that both allowed and preferred placement are typically
set to the same bitmask.


That's not correct for the use case handled here, but see below.


Which part is not correct, that bo->preferred_domains and 
bo->allower_domains are the same bitmask?


Sorry totally forgot to explain that.

This rate limit here was specially made for OpenGL applications which 
over commit VRAM. In those case preferred_domains will be VRAM only and 
allowed_domains will be VRAM|GTT.


RADV always uses VRAM|GTT for both (which is correct).


Got it, thanks!

As such, when the code decides to throttle the migration for a 
client, it
is in fact not achieving anything. Buffers can still be either 
migrated or
not migrated based on the external (to this function and facility) 
logic.


Fix it by not changing the buffer object placements if the migration
budget has been spent.

FIXME:
Is it still required to call validate is the question..

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

index 22708954ae68..d07a1dd7c880 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -784,6 +784,7 @@ static int amdgpu_cs_bo_validate(void *param, 
struct amdgpu_bo *bo)

  .no_wait_gpu = false,
  .resv = bo->tbo.base.resv
  };
+    bool migration_allowed = true;
  struct ttm_resource *old_res;
  uint32_t domain;
  int r;
@@ -805,19 +806,24 @@ static int amdgpu_cs_bo_validate(void *param, 
struct amdgpu_bo *bo)

   * visible VRAM if we've depleted our allowance to do
   * that.
   */
-    if (p->bytes_moved_vis < p->bytes_moved_vis_threshold)
+    if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) {
  domain = bo->preferred_domains;
-    else
+    } else {
  domain = bo->allowed_domains;
+    migration_allowed = false;
+    }
  } else {
  domain = bo->preferred_domains;
  }
  } else {
  domain = bo->allowed_domains;
+    migration_allowed = false;
  }
  retry:
-    amdgpu_bo_placement_from_domain(bo, domain);
+    if (migration_allowed)
+    amdgpu_bo_placement_from_domain(bo, domain);


That's completely invalid. Calling amdgpu_bo_placement_from_domain() 
is a mandatory prerequisite for calling ttm_bo_validate();


E.g. the usually code fow is:

/* This initializes bo->placement */
amdgpu_bo_placement_from_domain()

/* Eventually modify bo->placement to fit special requirements */


/* Apply the placement to the BO */
ttm_bo_validate(>tbo, >placement, )

To sum it up bo->placement should be a variable on the stack instead, 
but we never bothered to clean that up.


I am not clear if you agree or not that the current method of trying 
to avoid migration doesn't really do anything?


I totally agree, but the approach you taken to fix it is just quite 
broken. You can't leave bo->placement uninitialized and expect that 
ttm_bo_validate() won't move the BO.


Yep, that much was clear, sorry that I did not explicitly acknowledge 
but just moved on to discussing how to fix it properly.


On stack placements sounds plausible to force migration avoidance by 
putting a single current object placement in that list, if that is 
what you have in mind? Or a specialized flag/version of 
amdgpu_bo_placement_from_domain with an bool input like 
"allow_placement_change"?


A very rough idea with no guarantee that it actually works:

Add a TTM_PL_FLAG_RATE_LIMITED with all the TTM code to actually figure 
out how many bytes have been moved and how many bytes the current 
operation can move etc...


Friedrich's patches actually looked like quite a step into the right 
direction for that already, so I would start from there.


Then always feed amdgpu_bo_placement_from_domain() with the 
allowed_domains in the CS path and VM validation.


Finally extend amdgpu_bo_placement_from_domain() to take a closer look 
at bo->preferred_domains, similar to how we do for the 
TTM_PL_FLAG_FALLBACK already and set the TTM_PL_FLAG_RATE_LIMITED flag 
as appropriate.


Two things which I kind of don't like with the placement flag idea is

Re: [RFC 2/5] drm/amdgpu: Actually respect buffer migration budget

2024-05-15 Thread Tvrtko Ursulin



On 15/05/2024 08:20, Christian König wrote:

Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin:

From: Tvrtko Ursulin 

Current code appears to live in a misconception that playing with buffer
allowed and preferred placements can control the decision on whether
backing store migration will be attempted or not.

Both from code inspection and from empirical experiments I see that not
being true, and that both allowed and preferred placement are typically
set to the same bitmask.


That's not correct for the use case handled here, but see below.


Which part is not correct, that bo->preferred_domains and 
bo->allower_domains are the same bitmask?




As such, when the code decides to throttle the migration for a client, it
is in fact not achieving anything. Buffers can still be either 
migrated or

not migrated based on the external (to this function and facility) logic.

Fix it by not changing the buffer object placements if the migration
budget has been spent.

FIXME:
Is it still required to call validate is the question..

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

index 22708954ae68..d07a1dd7c880 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -784,6 +784,7 @@ static int amdgpu_cs_bo_validate(void *param, 
struct amdgpu_bo *bo)

  .no_wait_gpu = false,
  .resv = bo->tbo.base.resv
  };
+    bool migration_allowed = true;
  struct ttm_resource *old_res;
  uint32_t domain;
  int r;
@@ -805,19 +806,24 @@ static int amdgpu_cs_bo_validate(void *param, 
struct amdgpu_bo *bo)

   * visible VRAM if we've depleted our allowance to do
   * that.
   */
-    if (p->bytes_moved_vis < p->bytes_moved_vis_threshold)
+    if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) {
  domain = bo->preferred_domains;
-    else
+    } else {
  domain = bo->allowed_domains;
+    migration_allowed = false;
+    }
  } else {
  domain = bo->preferred_domains;
  }
  } else {
  domain = bo->allowed_domains;
+    migration_allowed = false;
  }
  retry:
-    amdgpu_bo_placement_from_domain(bo, domain);
+    if (migration_allowed)
+    amdgpu_bo_placement_from_domain(bo, domain);


That's completely invalid. Calling amdgpu_bo_placement_from_domain() is 
a mandatory prerequisite for calling ttm_bo_validate();


E.g. the usually code fow is:

/* This initializes bo->placement */
amdgpu_bo_placement_from_domain()

/* Eventually modify bo->placement to fit special requirements */


/* Apply the placement to the BO */
ttm_bo_validate(>tbo, >placement, )

To sum it up bo->placement should be a variable on the stack instead, 
but we never bothered to clean that up.


I am not clear if you agree or not that the current method of trying to 
avoid migration doesn't really do anything?


On stack placements sounds plausible to force migration avoidance by 
putting a single current object placement in that list, if that is what 
you have in mind? Or a specialized flag/version of 
amdgpu_bo_placement_from_domain with an bool input like 
"allow_placement_change"?


Regards,

Tvrtko



Regards,
Christian.


+
  r = ttm_bo_validate(>tbo, >placement, );
  if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {




Re: [RFC 1/5] drm/amdgpu: Fix migration rate limiting accounting

2024-05-15 Thread Tvrtko Ursulin




On 15/05/2024 08:14, Christian König wrote:

Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin:

From: Tvrtko Ursulin 

The logic assumed any migration attempt worked and therefore would over-
account the amount of data migrated during buffer re-validation. As a
consequence client can be unfairly penalised by incorrectly considering
its migration budget spent.

Fix it by looking at the before and after buffer object backing store and
only account if there was a change.

FIXME:
I think this needs a better solution to account for migrations between
VRAM visible and non-visible portions.

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 +-
  1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

index ec888fc6ead8..22708954ae68 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -784,12 +784,15 @@ static int amdgpu_cs_bo_validate(void *param, 
struct amdgpu_bo *bo)

  .no_wait_gpu = false,
  .resv = bo->tbo.base.resv
  };
+    struct ttm_resource *old_res;
  uint32_t domain;
  int r;
  if (bo->tbo.pin_count)
  return 0;
+    old_res = bo->tbo.resource;
+
  /* Don't move this buffer if we have depleted our allowance
   * to move it. Don't move anything if the threshold is zero.
   */
@@ -817,16 +820,29 @@ static int amdgpu_cs_bo_validate(void *param, 
struct amdgpu_bo *bo)

  amdgpu_bo_placement_from_domain(bo, domain);
  r = ttm_bo_validate(>tbo, >placement, );
-    p->bytes_moved += ctx.bytes_moved;
-    if (!amdgpu_gmc_vram_full_visible(>gmc) &&
-    amdgpu_res_cpu_visible(adev, bo->tbo.resource))
-    p->bytes_moved_vis += ctx.bytes_moved;
-
  if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
  domain = bo->allowed_domains;
  goto retry;
  }
+    if (!r) {
+    struct ttm_resource *new_res = bo->tbo.resource;
+    bool moved = true;
+
+    if (old_res == new_res)
+    moved = false;
+    else if (old_res && new_res &&
+ old_res->mem_type == new_res->mem_type)
+    moved = false;


The old resource might already be destroyed after you return from 
validation. So this here won't work.


Apart from that even when a migration attempt fails the moved bytes 
should be accounted.


When the validation attempt doesn't caused any moves then the bytecount 
here would be zero.


So as far as I can see that is as fair as you can get.


Right, I think I suffered a bit of tunnel vision here and completely 
ignore the _ctx_.moved_bytes part. Scratch this one too then.


Regards,

Tvrtko



Regards,
Christian.

PS: Looks like our mail servers are once more not very reliable.

If you get mails from me multiple times please just ignore it.


+
+    if (moved) {
+    p->bytes_moved += ctx.bytes_moved;
+    if (!amdgpu_gmc_vram_full_visible(>gmc) &&
+    amdgpu_res_cpu_visible(adev, bo->tbo.resource))
+    p->bytes_moved_vis += ctx.bytes_moved;
+    }
+    }
+
  return r;
  }




Re: [RFC 0/5] Discussion around eviction improvements

2024-05-14 Thread Tvrtko Ursulin



On 13/05/2024 14:49, Tvrtko Ursulin wrote:


On 09/05/2024 13:40, Tvrtko Ursulin wrote:


On 08/05/2024 19:09, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Last few days I was looking at the situation with VRAM over 
subscription, what
happens versus what perhaps should happen. Browsing through the 
driver and

running some simple experiments.

I ended up with this patch series which, as a disclaimer, may be 
completely
wrong but as I found some suspicious things, to me at least, I 
thought it was a

good point to stop and request some comments.

To perhaps summarise what are the main issues I think I found:

  * Migration rate limiting does not bother knowing if actual 
migration happened

    and so can over-account and unfairly penalise.

  * Migration rate limiting does not even work, at least not for the 
common case
    where userspace configures VRAM+GTT. It thinks it can stop 
migration attempts
    by playing with bo->allowed_domains vs bo->preferred domains but, 
both from
    the code, and from empirical experiments, I see that not working 
at all. Both

    masks are identical so fiddling with them achieves nothing.

  * Idea of the fallback placement only works when VRAM has free 
space. As soon
    as it does not, ttm_resource_compatible is happy to leave the 
buffers in the

    secondary placement forever.

  * Driver thinks it will be re-validating evicted buffers on the 
next submission
    but it does not for the very common case of VRAM+GTT because it 
only checks

    if current placement is *none* of the preferred placements.

All those problems are addressed in individual patches.

End result of this series appears to be driver which will try harder 
to move
buffers back into VRAM, but will be (more) correctly throttled in 
doing so by

the existing rate limiting logic.

I have run a quick benchmark of Cyberpunk 2077 and cannot say that I 
saw a
change but that could be a good thing too. At least I did not break 
anything,
perhaps.. On one occassion I did see the rate limiting logic get 
confused while
for a period of few minutes it went to a mode where it was constantly 
giving a
high migration budget. But that recovered itself when I switched 
clients and did
not come back so I don't know. If there is something wrong there I 
don't think

it would be caused by any patches in this series.


Since yesterday I also briefly tested with Far Cry New Dawn. One run 
each so possibly doesn't mean anything apart that there isn't a 
regression aka migration throttling is keeping things at bay even with 
increased requests to migrate things back to VRAM:


  before after
min/avg/max fps    36/44/54    37/45/55

Cyberpunk 2077 from yesterday was similarly close:

 26.96/29.59/30.40    29.70/30.00/30.32

I guess the real story is proper DGPU where misplaced buffers have a 
real cost.


I found one game which regresses spectacularly badly with this series - 
Assasin's Creed Valhalla. The built-in benchmark at least. The game 
appears to have a working set much larger than the other games I tested, 
around 5GiB total during the benchmark. And for some reason migration 
throttling totally fails to put it in check. I will be investigating 
this shortly.


I think that the conclusion is everything I attempted to add relating to 
TTM_PL_PREFERRED does not really work as I initially thought it did. 
Therefore please imagine this series as only containing patches 1, 2 and 5.


(And FWIW it was quite annoying to get to the bottom of since for some 
reason the system exibits some sort of a latching behaviour, where on 
some boots and/or some minutes of runtime things were fine, and then it 
would latch onto a mode where the TTM_PL_PREFERRED induced breakage 
would show. And sometimes this breakage would appear straight away. Odd.)


I still need to test though if the subset of patches manage to achieve 
some positive improvement on their own. It is possible, as patch 5 marks 
more buffers for re-validation so once overcommit subsides they would 
get promoted to preferred placement straight away. And 1&2 are 
notionally fixes for migration throttling so at least in broad sense 
should be still valid as discussion points.


Regards,

Tvrtko

Series is probably rough but should be good enough for dicsussion. I 
am curious

to hear if I identified at least something correctly as a real problem.

It would also be good to hear what are the suggested games to check 
and see

whether there is any improvement.

Cc: Christian König 
Cc: Friedrich Vock 

Tvrtko Ursulin (5):
   drm/amdgpu: Fix migration rate limiting accounting
   drm/amdgpu: Actually respect buffer migration budget
   drm/ttm: Add preferred placement flag
   drm/amdgpu: Use preferred placement for VRAM+GTT
   drm/amdgpu: Re-validate evicted buffers

  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
  drivers/gpu/drm/

Re: [RFC 0/5] Discussion around eviction improvements

2024-05-13 Thread Tvrtko Ursulin



On 09/05/2024 13:40, Tvrtko Ursulin wrote:


On 08/05/2024 19:09, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Last few days I was looking at the situation with VRAM over 
subscription, what
happens versus what perhaps should happen. Browsing through the driver 
and

running some simple experiments.

I ended up with this patch series which, as a disclaimer, may be 
completely
wrong but as I found some suspicious things, to me at least, I thought 
it was a

good point to stop and request some comments.

To perhaps summarise what are the main issues I think I found:

  * Migration rate limiting does not bother knowing if actual 
migration happened

    and so can over-account and unfairly penalise.

  * Migration rate limiting does not even work, at least not for the 
common case
    where userspace configures VRAM+GTT. It thinks it can stop 
migration attempts
    by playing with bo->allowed_domains vs bo->preferred domains but, 
both from
    the code, and from empirical experiments, I see that not working 
at all. Both

    masks are identical so fiddling with them achieves nothing.

  * Idea of the fallback placement only works when VRAM has free 
space. As soon
    as it does not, ttm_resource_compatible is happy to leave the 
buffers in the

    secondary placement forever.

  * Driver thinks it will be re-validating evicted buffers on the next 
submission
    but it does not for the very common case of VRAM+GTT because it 
only checks

    if current placement is *none* of the preferred placements.

All those problems are addressed in individual patches.

End result of this series appears to be driver which will try harder 
to move
buffers back into VRAM, but will be (more) correctly throttled in 
doing so by

the existing rate limiting logic.

I have run a quick benchmark of Cyberpunk 2077 and cannot say that I 
saw a
change but that could be a good thing too. At least I did not break 
anything,
perhaps.. On one occassion I did see the rate limiting logic get 
confused while
for a period of few minutes it went to a mode where it was constantly 
giving a
high migration budget. But that recovered itself when I switched 
clients and did
not come back so I don't know. If there is something wrong there I 
don't think

it would be caused by any patches in this series.


Since yesterday I also briefly tested with Far Cry New Dawn. One run 
each so possibly doesn't mean anything apart that there isn't a 
regression aka migration throttling is keeping things at bay even with 
increased requests to migrate things back to VRAM:


  before after
min/avg/max fps    36/44/54    37/45/55

Cyberpunk 2077 from yesterday was similarly close:

     26.96/29.59/30.40    29.70/30.00/30.32

I guess the real story is proper DGPU where misplaced buffers have a 
real cost.


I found one game which regresses spectacularly badly with this series - 
Assasin's Creed Valhalla. The built-in benchmark at least. The game 
appears to have a working set much larger than the other games I tested, 
around 5GiB total during the benchmark. And for some reason migration 
throttling totally fails to put it in check. I will be investigating 
this shortly.


Regards,

Tvrtko

Series is probably rough but should be good enough for dicsussion. I 
am curious

to hear if I identified at least something correctly as a real problem.

It would also be good to hear what are the suggested games to check 
and see

whether there is any improvement.

Cc: Christian König 
Cc: Friedrich Vock 

Tvrtko Ursulin (5):
   drm/amdgpu: Fix migration rate limiting accounting
   drm/amdgpu: Actually respect buffer migration budget
   drm/ttm: Add preferred placement flag
   drm/amdgpu: Use preferred placement for VRAM+GTT
   drm/amdgpu: Re-validate evicted buffers

  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++--
  drivers/gpu/drm/ttm/ttm_resource.c | 13 +---
  include/drm/ttm/ttm_placement.h    |  3 ++
  5 files changed, 65 insertions(+), 18 deletions(-)



Re: [PATCH 12/12] accel/ivpu: Share NPU busy time in sysfs

2024-05-13 Thread Tvrtko Ursulin



On 13/05/2024 11:22, Jacek Lawrynowicz wrote:

Hi,

On 10.05.2024 18:55, Jeffrey Hugo wrote:

On 5/8/2024 7:29 AM, Jacek Lawrynowicz wrote:

From: Tomasz Rusinowicz 

The driver tracks the time spent by NPU executing jobs
and shares it through sysfs `npu_busy_time_us` file.
It can be then used by user space applications to monitor device
utilization.

NPU is considered 'busy' starting with a first job submitted
to firmware and ending when there is no more jobs pending/executing.

Signed-off-by: Tomasz Rusinowicz 
Signed-off-by: Jacek Lawrynowicz 


This feels like something that would normally be handled by perf. Why not use 
that mechanism?


Yeah, probably but we had several request to provide easy to use interface for 
this metric that
could be integrated in various user space apps/tools that do not use ftrace.


Probably more Perf/PMU aka performance counters? Which would be 
scriptable via $kernel/tools/perf or directly via perf_event_open(2) and 
read(2).


Note it is not easy to get right and in the i915 implementation (see 
i915_pmu.c) we have a known issue with PCI hot unplug and use after free 
which needs input from perf core folks.


Regards,

Tvrtko


Re: [RFC 0/5] Discussion around eviction improvements

2024-05-09 Thread Tvrtko Ursulin



On 08/05/2024 19:09, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Last few days I was looking at the situation with VRAM over subscription, what
happens versus what perhaps should happen. Browsing through the driver and
running some simple experiments.

I ended up with this patch series which, as a disclaimer, may be completely
wrong but as I found some suspicious things, to me at least, I thought it was a
good point to stop and request some comments.

To perhaps summarise what are the main issues I think I found:

  * Migration rate limiting does not bother knowing if actual migration happened
and so can over-account and unfairly penalise.

  * Migration rate limiting does not even work, at least not for the common case
where userspace configures VRAM+GTT. It thinks it can stop migration 
attempts
by playing with bo->allowed_domains vs bo->preferred domains but, both from
the code, and from empirical experiments, I see that not working at all. 
Both
masks are identical so fiddling with them achieves nothing.

  * Idea of the fallback placement only works when VRAM has free space. As soon
as it does not, ttm_resource_compatible is happy to leave the buffers in the
secondary placement forever.

  * Driver thinks it will be re-validating evicted buffers on the next 
submission
but it does not for the very common case of VRAM+GTT because it only checks
if current placement is *none* of the preferred placements.

All those problems are addressed in individual patches.

End result of this series appears to be driver which will try harder to move
buffers back into VRAM, but will be (more) correctly throttled in doing so by
the existing rate limiting logic.

I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a
change but that could be a good thing too. At least I did not break anything,
perhaps.. On one occassion I did see the rate limiting logic get confused while
for a period of few minutes it went to a mode where it was constantly giving a
high migration budget. But that recovered itself when I switched clients and did
not come back so I don't know. If there is something wrong there I don't think
it would be caused by any patches in this series.


Since yesterday I also briefly tested with Far Cry New Dawn. One run 
each so possibly doesn't mean anything apart that there isn't a 
regression aka migration throttling is keeping things at bay even with 
increased requests to migrate things back to VRAM:


 before  after
min/avg/max fps 36/44/5437/45/55

Cyberpunk 2077 from yesterday was similarly close:

26.96/29.59/30.40   29.70/30.00/30.32

I guess the real story is proper DGPU where misplaced buffers have a 
real cost.


Regards,

Tvrtko


Series is probably rough but should be good enough for dicsussion. I am curious
to hear if I identified at least something correctly as a real problem.

It would also be good to hear what are the suggested games to check and see
whether there is any improvement.

Cc: Christian König 
Cc: Friedrich Vock 

Tvrtko Ursulin (5):
   drm/amdgpu: Fix migration rate limiting accounting
   drm/amdgpu: Actually respect buffer migration budget
   drm/ttm: Add preferred placement flag
   drm/amdgpu: Use preferred placement for VRAM+GTT
   drm/amdgpu: Re-validate evicted buffers

  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++--
  drivers/gpu/drm/ttm/ttm_resource.c | 13 +---
  include/drm/ttm/ttm_placement.h|  3 ++
  5 files changed, 65 insertions(+), 18 deletions(-)



Re: [PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo

2024-05-09 Thread Tvrtko Ursulin



On 08/05/2024 21:53, Lucas De Marchi wrote:

On Wed, May 08, 2024 at 09:23:17AM GMT, Tvrtko Ursulin wrote:


On 07/05/2024 22:35, Lucas De Marchi wrote:

On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote:


On 24/04/2024 00:56, Lucas De Marchi wrote:

Print the accumulated runtime for client when printing fdinfo.
Each time a query is done it first does 2 things:

1) loop through all the exec queues for the current client and
   accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
   that, being read from the context image.

2) Read a "GPU timestamp" that can be used for considering "how 
much GPU

   time has passed" and that has the same unit/refclock as the one
   recording the runtime. RING_TIMESTAMP is used for that via MMIO.

Since for all current platforms RING_TIMESTAMP follows the same
refclock, just read it once, using any first engine.

This is exported to userspace as 2 numbers in fdinfo:

drm-cycles-: 
drm-total-cycles-: 

Userspace is expected to collect at least 2 samples, which allows to
know the client engine busyness as per:

    RUNTIME1 - RUNTIME0
busyness = -
  T1 - T0

Another thing to point out is that it's expected that userspace will
read any 2 samples every few seconds.  Given the update frequency 
of the

counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
each exec_queue can wrap around (assuming 100% utilization) after 
~200s.
The wraparound is not perceived by userspace since it's just 
accumulated

for all the exec_queues in a 64-bit counter), but the measurement will
not be accurate if the samples are too far apart.

This could be mitigated by adding a workqueue to accumulate the 
counters

every so often, but it's additional complexity for something that is
done already by userspace every few seconds in tools like gputop (from
igt), htop, nvtop, etc with none of them really defaulting to 1 sample
per minute or more.

Signed-off-by: Lucas De Marchi 
---
 Documentation/gpu/drm-usage-stats.rst   |  16 ++-
 Documentation/gpu/xe/index.rst  |   1 +
 Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
 drivers/gpu/drm/xe/xe_drm_client.c  | 138 
+++-

 4 files changed, 162 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst

diff --git a/Documentation/gpu/drm-usage-stats.rst 
b/Documentation/gpu/drm-usage-stats.rst

index 6dc299343b48..421766289b78 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -112,6 +112,17 @@ larger value within a reasonable period. Upon 
observing a value lower than what
 was previously read, userspace is expected to stay with that 
larger previous

 value until a monotonic update is seen.
+- drm-total-cycles-: 
+
+Engine identifier string must be the same as the one specified in the
+drm-cycles- tag and shall contain the total number cycles 
for the given

+engine.
+
+This is a timestamp in GPU unspecified unit that matches the 
update rate
+of drm-cycles-. For drivers that implement this interface, 
the engine
+utilization can be calculated entirely on the GPU clock domain, 
without

+considering the CPU sleep time between 2 samples.


Two opens.

1)
Do we need to explicity document that drm-total-cycles and 
drm-maxfreq are mutually exclusive?


so userspace has a fallback mechanism to calculate utilization depending
on what keys are available?


No, to document all three at once do not make sense. Or at least are 
not expected. Or you envisage someone might legitimately emit all 
three? I don't see what would be the semantics. When we have 
cycles+maxfreq the latter is in Hz. And when we have cycles+total then 
it is unitless. All three?


I don't follow what you mean here. *cycles* is actually a unit.

The engine spent 10 cycles running this context (drm-cycles). In the
same period there were 100 cycles available (drm-total-cycles). Current
frequency is X MHz. Max frequency is Y MHz. For me all of them make
sense if one wants to mix them together. For xe it doesn't make sense
because the counter backing drm-cycles and drm-total-cycles is unrelated
to the engine frequency.

I can add something in the doc that we do not expected to see all of them
together until we see a usecase. Each driver may implement a subset.


I still don't quite see how a combination of cycles, total cycles and 
maxfreq makes sense together. It would require a driver where cycle 
period is equal to 1 / maxfreq, which also means total-cycles would be 
equal to maxfreq, making one of them redundant. So both for drivers like 
xe where cycle period is unrelated to maxfreq (or even the fataly 
misguided curfreq) it doens't make sense, and for driver like above is 
not needed. What use case am I missing?


We need to document this properly so userspace knows how to do the right 
thing depending on what keys they discover.



2)
Should drm-

Re: [RFC 1/5] drm/amdgpu: Fix migration rate limiting accounting

2024-05-09 Thread Tvrtko Ursulin



On 08/05/2024 20:08, Friedrich Vock wrote:

On 08.05.24 20:09, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

The logic assumed any migration attempt worked and therefore would over-
account the amount of data migrated during buffer re-validation. As a
consequence client can be unfairly penalised by incorrectly considering
its migration budget spent.


If the migration failed but data was still moved (which I think could be
the case when we try evicting everything but it still doesn't work?),
shouldn't the eviction movements count towards the ratelimit too?


Possibly, which path would that be?

I mean there are definitely more migration which *should not* be counted 
which I think your mini-series approaches more accurately. What this 
patch achieves, in its current RFC form, is reduces the "false-positive" 
migration budget depletions.


So larger improvements aside, point of the series was to illustrate that 
even the things which were said to be working do not seem to. See cover 
letter to see what I thought does not work either well or at all.

Fix it by looking at the before and after buffer object backing store and
only account if there was a change.

FIXME:
I think this needs a better solution to account for migrations between
VRAM visible and non-visible portions.


FWIW, I have some WIP patches (not posted on any MLs yet though) that
attempt to solve this issue (+actually enforcing ratelimits) by moving
the ratelimit accounting/enforcement to TTM entirely.

By moving the accounting to TTM we can count moved bytes when we move
them, and don't have to rely on comparing resources to determine whether
moving actually happened. This should address your FIXME as well.


Yep, I've seen them. They are not necessarily conflicting with this 
series, potentialy TTM placement flag aside. *If* something like this 
can be kept small and still manage to fix up a few simple things which 
do not appear to work at all at the moment.


For the larger re-work it is quite, well, large and it is not easy to be 
certain the end result would work as expected. IMO it would be best to 
sketch out a larger series which brings some practical and masurable 
change in behaviour before commiting to merge things piecemeal.


For instance I have a niggling feeling the runtime games driver plays 
with placements and domains are not great and wonder if things could be 
cleaner if simplified by letting TTM manage things more, more 
explicitly, and having the list of placements more static. Thinking 
about it seems a step too far for now though.


Regards,

Tvrtko



Regards,
Friedrich


Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 +-
  1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

index ec888fc6ead8..22708954ae68 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -784,12 +784,15 @@ static int amdgpu_cs_bo_validate(void *param, 
struct amdgpu_bo *bo)

  .no_wait_gpu = false,
  .resv = bo->tbo.base.resv
  };
+    struct ttm_resource *old_res;
  uint32_t domain;
  int r;

  if (bo->tbo.pin_count)
  return 0;

+    old_res = bo->tbo.resource;
+
  /* Don't move this buffer if we have depleted our allowance
   * to move it. Don't move anything if the threshold is zero.
   */
@@ -817,16 +820,29 @@ static int amdgpu_cs_bo_validate(void *param, 
struct amdgpu_bo *bo)

  amdgpu_bo_placement_from_domain(bo, domain);
  r = ttm_bo_validate(>tbo, >placement, );

-    p->bytes_moved += ctx.bytes_moved;
-    if (!amdgpu_gmc_vram_full_visible(>gmc) &&
-    amdgpu_res_cpu_visible(adev, bo->tbo.resource))
-    p->bytes_moved_vis += ctx.bytes_moved;
-
  if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
  domain = bo->allowed_domains;
  goto retry;
  }

+    if (!r) {
+    struct ttm_resource *new_res = bo->tbo.resource;
+    bool moved = true;
+
+    if (old_res == new_res)
+    moved = false;
+    else if (old_res && new_res &&
+ old_res->mem_type == new_res->mem_type)
+    moved = false;
+
+    if (moved) {
+    p->bytes_moved += ctx.bytes_moved;
+    if (!amdgpu_gmc_vram_full_visible(>gmc) &&
+    amdgpu_res_cpu_visible(adev, bo->tbo.resource))
+    p->bytes_moved_vis += ctx.bytes_moved;
+    }
+    }
+
  return r;
  }



[RFC 1/5] drm/amdgpu: Fix migration rate limiting accounting

2024-05-08 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

The logic assumed any migration attempt worked and therefore would over-
account the amount of data migrated during buffer re-validation. As a
consequence client can be unfairly penalised by incorrectly considering
its migration budget spent.

Fix it by looking at the before and after buffer object backing store and
only account if there was a change.

FIXME:
I think this needs a better solution to account for migrations between
VRAM visible and non-visible portions.

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index ec888fc6ead8..22708954ae68 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -784,12 +784,15 @@ static int amdgpu_cs_bo_validate(void *param, struct 
amdgpu_bo *bo)
.no_wait_gpu = false,
.resv = bo->tbo.base.resv
};
+   struct ttm_resource *old_res;
uint32_t domain;
int r;
 
if (bo->tbo.pin_count)
return 0;
 
+   old_res = bo->tbo.resource;
+
/* Don't move this buffer if we have depleted our allowance
 * to move it. Don't move anything if the threshold is zero.
 */
@@ -817,16 +820,29 @@ static int amdgpu_cs_bo_validate(void *param, struct 
amdgpu_bo *bo)
amdgpu_bo_placement_from_domain(bo, domain);
r = ttm_bo_validate(>tbo, >placement, );
 
-   p->bytes_moved += ctx.bytes_moved;
-   if (!amdgpu_gmc_vram_full_visible(>gmc) &&
-   amdgpu_res_cpu_visible(adev, bo->tbo.resource))
-   p->bytes_moved_vis += ctx.bytes_moved;
-
if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
domain = bo->allowed_domains;
goto retry;
}
 
+   if (!r) {
+   struct ttm_resource *new_res = bo->tbo.resource;
+   bool moved = true;
+
+   if (old_res == new_res)
+   moved = false;
+   else if (old_res && new_res &&
+old_res->mem_type == new_res->mem_type)
+   moved = false;
+
+   if (moved) {
+   p->bytes_moved += ctx.bytes_moved;
+   if (!amdgpu_gmc_vram_full_visible(>gmc) &&
+   amdgpu_res_cpu_visible(adev, bo->tbo.resource))
+   p->bytes_moved_vis += ctx.bytes_moved;
+   }
+   }
+
return r;
 }
 
-- 
2.44.0



[RFC 2/5] drm/amdgpu: Actually respect buffer migration budget

2024-05-08 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Current code appears to live in a misconception that playing with buffer
allowed and preferred placements can control the decision on whether
backing store migration will be attempted or not.

Both from code inspection and from empirical experiments I see that not
being true, and that both allowed and preferred placement are typically
set to the same bitmask.

As such, when the code decides to throttle the migration for a client, it
is in fact not achieving anything. Buffers can still be either migrated or
not migrated based on the external (to this function and facility) logic.

Fix it by not changing the buffer object placements if the migration
budget has been spent.

FIXME:
Is it still required to call validate is the question..

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 22708954ae68..d07a1dd7c880 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -784,6 +784,7 @@ static int amdgpu_cs_bo_validate(void *param, struct 
amdgpu_bo *bo)
.no_wait_gpu = false,
.resv = bo->tbo.base.resv
};
+   bool migration_allowed = true;
struct ttm_resource *old_res;
uint32_t domain;
int r;
@@ -805,19 +806,24 @@ static int amdgpu_cs_bo_validate(void *param, struct 
amdgpu_bo *bo)
 * visible VRAM if we've depleted our allowance to do
 * that.
 */
-   if (p->bytes_moved_vis < p->bytes_moved_vis_threshold)
+   if (p->bytes_moved_vis < p->bytes_moved_vis_threshold) {
domain = bo->preferred_domains;
-   else
+   } else {
domain = bo->allowed_domains;
+   migration_allowed = false;
+   }
} else {
domain = bo->preferred_domains;
}
} else {
domain = bo->allowed_domains;
+   migration_allowed = false;
}
 
 retry:
-   amdgpu_bo_placement_from_domain(bo, domain);
+   if (migration_allowed)
+   amdgpu_bo_placement_from_domain(bo, domain);
+
r = ttm_bo_validate(>tbo, >placement, );
 
if (unlikely(r == -ENOMEM) && domain != bo->allowed_domains) {
-- 
2.44.0



[RFC 0/5] Discussion around eviction improvements

2024-05-08 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Last few days I was looking at the situation with VRAM over subscription, what
happens versus what perhaps should happen. Browsing through the driver and
running some simple experiments.

I ended up with this patch series which, as a disclaimer, may be completely
wrong but as I found some suspicious things, to me at least, I thought it was a
good point to stop and request some comments.

To perhaps summarise what are the main issues I think I found:

 * Migration rate limiting does not bother knowing if actual migration happened
   and so can over-account and unfairly penalise.

 * Migration rate limiting does not even work, at least not for the common case
   where userspace configures VRAM+GTT. It thinks it can stop migration attempts
   by playing with bo->allowed_domains vs bo->preferred domains but, both from
   the code, and from empirical experiments, I see that not working at all. Both
   masks are identical so fiddling with them achieves nothing.

 * Idea of the fallback placement only works when VRAM has free space. As soon
   as it does not, ttm_resource_compatible is happy to leave the buffers in the
   secondary placement forever.

 * Driver thinks it will be re-validating evicted buffers on the next submission
   but it does not for the very common case of VRAM+GTT because it only checks
   if current placement is *none* of the preferred placements.

All those problems are addressed in individual patches.

End result of this series appears to be driver which will try harder to move
buffers back into VRAM, but will be (more) correctly throttled in doing so by
the existing rate limiting logic.

I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a
change but that could be a good thing too. At least I did not break anything,
perhaps.. On one occassion I did see the rate limiting logic get confused while
for a period of few minutes it went to a mode where it was constantly giving a
high migration budget. But that recovered itself when I switched clients and did
not come back so I don't know. If there is something wrong there I don't think
it would be caused by any patches in this series.

Series is probably rough but should be good enough for dicsussion. I am curious
to hear if I identified at least something correctly as a real problem.

It would also be good to hear what are the suggested games to check and see
whether there is any improvement.

Cc: Christian König 
Cc: Friedrich Vock 

Tvrtko Ursulin (5):
  drm/amdgpu: Fix migration rate limiting accounting
  drm/amdgpu: Actually respect buffer migration budget
  drm/ttm: Add preferred placement flag
  drm/amdgpu: Use preferred placement for VRAM+GTT
  drm/amdgpu: Re-validate evicted buffers

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 38 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++--
 drivers/gpu/drm/ttm/ttm_resource.c | 13 +---
 include/drm/ttm/ttm_placement.h|  3 ++
 5 files changed, 65 insertions(+), 18 deletions(-)

-- 
2.44.0



[RFC 5/5] drm/amdgpu: Re-validate evicted buffers

2024-05-08 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Currently the driver appears to be thinking that it will be attempting to
re-validate the evicted buffers on the next submission if they are not in
their preferred placement.

That however appears not to be true for the very common case of buffers
with allowed placements of VRAM+GTT. Simply because the check can only
detect if the current placement is *none* of the preferred ones, happily
leaving VRAM+GTT buffers in the GTT placement "forever".

Fix it by extending the VRAM+GTT special case to the re-validation logic.

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 6bddd43604bc..e53ff914b62e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1248,10 +1248,25 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, 
struct amdgpu_bo_va *bo_va,
 * next command submission.
 */
if (amdgpu_vm_is_bo_always_valid(vm, bo)) {
-   uint32_t mem_type = bo->tbo.resource->mem_type;
+   unsigned current_domain =
+   amdgpu_mem_type_to_domain(bo->tbo.resource->mem_type);
+   bool move_to_evict = false;
 
-   if (!(bo->preferred_domains &
- amdgpu_mem_type_to_domain(mem_type)))
+   if (!(bo->preferred_domains & current_domain)) {
+   move_to_evict = true;
+   } else if ((bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK) ==
+  (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT) &&
+  current_domain != AMDGPU_GEM_DOMAIN_VRAM) {
+   /*
+* If userspace has provided a list of possible
+* placements equal to VRAM+GTT, we assume VRAM is *the*
+* preferred placement and so try to move it back there
+* on the next submission.
+*/
+   move_to_evict = true;
+   }
+
+   if (move_to_evict)
amdgpu_vm_bo_evicted(_va->base);
else
amdgpu_vm_bo_idle(_va->base);
-- 
2.44.0



[RFC 3/5] drm/ttm: Add preferred placement flag

2024-05-08 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Currently the fallback placement flag can achieve a hint that buffer
should be migrated back to the non-fallback placement, however that only
works while there is no memory pressure. As soon as we reach full VRAM
utilisation, or worse overcommit, the logic is happy to leave buffers in
the fallback placement. Consequence of this is that once buffers are
evicted they never get considered to be migrated back until the memory
pressure subsides, leaving a potentially active client not able to bring
its buffers back in.

Add a "preferred" placement flag which drivers can set when they want some
extra effort to be attempted for bringing a buffer back in.

QQQ:
Is the current "desired" flag unfortunately named perhaps? I ended up
understanding it as more like "would be nice if possible but absolutely
don't bother under memory pressure".

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
 drivers/gpu/drm/ttm/ttm_resource.c | 13 +
 include/drm/ttm/ttm_placement.h|  3 +++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_resource.c 
b/drivers/gpu/drm/ttm/ttm_resource.c
index 4a66b851b67d..59f3d1bcc11f 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -305,6 +305,8 @@ bool ttm_resource_compatible(struct ttm_resource *res,
 struct ttm_placement *placement,
 bool evicting)
 {
+   const u32 incompatible_flag = evicting ? TTM_PL_FLAG_DESIRED :
+TTM_PL_FLAG_FALLBACK;
struct ttm_buffer_object *bo = res->bo;
struct ttm_device *bdev = bo->bdev;
unsigned i;
@@ -316,11 +318,14 @@ bool ttm_resource_compatible(struct ttm_resource *res,
const struct ttm_place *place = >placement[i];
struct ttm_resource_manager *man;
 
-   if (res->mem_type != place->mem_type)
-   continue;
+   if (res->mem_type != place->mem_type) {
+   if (place->flags & TTM_PL_FLAG_PREFERRED)
+   return false;
+   else
+   continue;
+   }
 
-   if (place->flags & (evicting ? TTM_PL_FLAG_DESIRED :
-   TTM_PL_FLAG_FALLBACK))
+   if (place->flags & incompatible_flag)
continue;
 
if (place->flags & TTM_PL_FLAG_CONTIGUOUS &&
diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h
index b510a4812609..8ea0865e9cc8 100644
--- a/include/drm/ttm/ttm_placement.h
+++ b/include/drm/ttm/ttm_placement.h
@@ -70,6 +70,9 @@
 /* Placement is only used during eviction */
 #define TTM_PL_FLAG_FALLBACK   (1 << 4)
 
+/* Placement is only used during eviction */
+#define TTM_PL_FLAG_PREFERRED  (1 << 5)
+
 /**
  * struct ttm_place
  *
-- 
2.44.0



[RFC 4/5] drm/amdgpu: Use preferred placement for VRAM+GTT

2024-05-08 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Now that TTM has the preferred placement flag, extend the current
workaround which assumes the GTT placement as fallback in the presence of
the additional VRAM placement.

By marking the VRAM placement as preferred we will make the buffer re-
validation phase actually attempt to migrate them back to VRAM.

Without it, TTM core logic is happy to leave them in GTT placement
"forever".

Signed-off-by: Tvrtko Ursulin 
Cc: Christian König 
Cc: Friedrich Vock 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 50b7e7c0ce50..9be767357e86 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -128,8 +128,8 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, 
u32 domain)
struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
struct ttm_placement *placement = >placement;
struct ttm_place *places = abo->placements;
+   int c = 0, vram_index = -1;
u64 flags = abo->flags;
-   u32 c = 0;
 
if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
unsigned int visible_pfn = adev->gmc.visible_vram_size >> 
PAGE_SHIFT;
@@ -158,7 +158,7 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, 
u32 domain)
flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS)
places[c].flags |= TTM_PL_FLAG_CONTIGUOUS;
 
-   c++;
+   vram_index = c++;
}
 
if (domain & AMDGPU_GEM_DOMAIN_DOORBELL) {
@@ -180,8 +180,10 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo 
*abo, u32 domain)
 * When GTT is just an alternative to VRAM make sure that we
 * only use it as fallback and still try to fill up VRAM first.
 */
-   if (domain & abo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM)
+   if (vram_index >= 0) {
places[c].flags |= TTM_PL_FLAG_FALLBACK;
+   places[vram_index].flags |= TTM_PL_FLAG_PREFERRED;
+   }
c++;
}
 
-- 
2.44.0



Re: [PATCH v3 5/6] drm/xe: Add helper to accumulate exec queue runtime

2024-05-08 Thread Tvrtko Ursulin



On 07/05/2024 23:45, Lucas De Marchi wrote:

From: Umesh Nerlige Ramappa 

Add a helper to accumulate per-client runtime of all its
exec queues. This is called every time a sched job is finished.

v2:
   - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
 runtime when job is finished since xe_sched_job_completed() is not a
 notification that job finished.
   - Stop trying to update runtime from xe_exec_queue_fini() - that is
 redundant and may happen after xef is closed, leading to a
 use-after-free
   - Do not special case the first timestamp read: the default LRC sets
 CTX_TIMESTAMP to zero, so even the first sample should be a valid
 one.
   - Handle the parallel submission case by multiplying the runtime by
 width.

Signed-off-by: Umesh Nerlige Ramappa 
Signed-off-by: Lucas De Marchi 
---
  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
  drivers/gpu/drm/xe/xe_exec_queue.c   | 35 
  drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
  drivers/gpu/drm/xe/xe_execlist.c |  1 +
  drivers/gpu/drm/xe/xe_guc_submit.c   |  2 ++
  5 files changed, 48 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
b/drivers/gpu/drm/xe/xe_device_types.h
index 906b98fb973b..de078bdf0ab9 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -560,6 +560,15 @@ struct xe_file {
struct mutex lock;
} exec_queue;
  
+	/**

+* @runtime: hw engine class runtime in ticks for this drm client
+*
+* Only stats from xe_exec_queue->lrc[0] are accumulated. For multi-lrc
+* case, since all jobs run in parallel on the engines, only the stats
+* from lrc[0] are sufficient.
+*/
+   u64 runtime[XE_ENGINE_CLASS_MAX];
+
/** @client: drm client */
struct xe_drm_client *client;
  };
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c 
b/drivers/gpu/drm/xe/xe_exec_queue.c
index 395de93579fa..86eb22e22c95 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -769,6 +769,41 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
q->lrc[0].fence_ctx.next_seqno - 1;
  }
  
+/**

+ * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw
+ * @q: The exec queue
+ *
+ * Update the timestamp saved by HW for this exec queue and save runtime
+ * calculated by using the delta from last update. On multi-lrc case, only the
+ * first is considered.
+ */
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
+{
+   struct xe_file *xef;
+   struct xe_lrc *lrc;
+   u32 old_ts, new_ts;
+
+   /*
+* Jobs that are run during driver load may use an exec_queue, but are
+* not associated with a user xe file, so avoid accumulating busyness
+* for kernel specific work.
+*/
+   if (!q->vm || !q->vm->xef)
+   return;
+
+   xef = q->vm->xef;
+
+   /*
+* Only sample the first LRC. For parallel submission, all of them are
+* scheduled together and we compensate that below by multiplying by
+* width
+*/
+   lrc = >lrc[0];
+
+   new_ts = xe_lrc_update_timestamp(lrc, _ts);
+   xef->runtime[q->class] += (new_ts - old_ts) * q->width;


I think in theory this could be introducing a systematic error depending 
on how firmware handles things and tick resolution. Or even regardless 
of the firmware, if the timestamps are saved on context exit by the GPU 
hw itself and parallel contexts do not exit 100% aligned. Undershoot 
would be I think fine, but systematic overshoot under constant 100% 
parallel load from mutlitple client could constantly show >100% class 
utilisation. Probably not a concern in practice but worthy a comment?


Regards,

Tvrtko


+}
+
  void xe_exec_queue_kill(struct xe_exec_queue *q)
  {
struct xe_exec_queue *eq = q, *next;
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h 
b/drivers/gpu/drm/xe/xe_exec_queue.h
index 48f6da53a292..e0f07d28ee1a 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue.h
@@ -75,5 +75,6 @@ struct dma_fence *xe_exec_queue_last_fence_get(struct 
xe_exec_queue *e,
   struct xe_vm *vm);
  void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct xe_vm *vm,
  struct dma_fence *fence);
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q);
  
  #endif

diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
index dece2785933c..a316431025c7 100644
--- a/drivers/gpu/drm/xe/xe_execlist.c
+++ b/drivers/gpu/drm/xe/xe_execlist.c
@@ -307,6 +307,7 @@ static void execlist_job_free(struct drm_sched_job *drm_job)
  {
struct xe_sched_job *job = to_xe_sched_job(drm_job);
  
+	xe_exec_queue_update_runtime(job->q);

xe_sched_job_put(job);
  }
  
diff --git 

Re: [PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo

2024-05-08 Thread Tvrtko Ursulin



On 07/05/2024 22:35, Lucas De Marchi wrote:

On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote:


On 24/04/2024 00:56, Lucas De Marchi wrote:

Print the accumulated runtime for client when printing fdinfo.
Each time a query is done it first does 2 things:

1) loop through all the exec queues for the current client and
   accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
   that, being read from the context image.

2) Read a "GPU timestamp" that can be used for considering "how much GPU
   time has passed" and that has the same unit/refclock as the one
   recording the runtime. RING_TIMESTAMP is used for that via MMIO.

Since for all current platforms RING_TIMESTAMP follows the same
refclock, just read it once, using any first engine.

This is exported to userspace as 2 numbers in fdinfo:

drm-cycles-: 
drm-total-cycles-: 

Userspace is expected to collect at least 2 samples, which allows to
know the client engine busyness as per:

    RUNTIME1 - RUNTIME0
busyness = -
  T1 - T0

Another thing to point out is that it's expected that userspace will
read any 2 samples every few seconds.  Given the update frequency of the
counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
each exec_queue can wrap around (assuming 100% utilization) after ~200s.
The wraparound is not perceived by userspace since it's just accumulated
for all the exec_queues in a 64-bit counter), but the measurement will
not be accurate if the samples are too far apart.

This could be mitigated by adding a workqueue to accumulate the counters
every so often, but it's additional complexity for something that is
done already by userspace every few seconds in tools like gputop (from
igt), htop, nvtop, etc with none of them really defaulting to 1 sample
per minute or more.

Signed-off-by: Lucas De Marchi 
---
 Documentation/gpu/drm-usage-stats.rst   |  16 ++-
 Documentation/gpu/xe/index.rst  |   1 +
 Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
 drivers/gpu/drm/xe/xe_drm_client.c  | 138 +++-
 4 files changed, 162 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst

diff --git a/Documentation/gpu/drm-usage-stats.rst 
b/Documentation/gpu/drm-usage-stats.rst

index 6dc299343b48..421766289b78 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -112,6 +112,17 @@ larger value within a reasonable period. Upon 
observing a value lower than what
 was previously read, userspace is expected to stay with that larger 
previous

 value until a monotonic update is seen.
+- drm-total-cycles-: 
+
+Engine identifier string must be the same as the one specified in the
+drm-cycles- tag and shall contain the total number cycles 
for the given

+engine.
+
+This is a timestamp in GPU unspecified unit that matches the update 
rate
+of drm-cycles-. For drivers that implement this interface, 
the engine

+utilization can be calculated entirely on the GPU clock domain, without
+considering the CPU sleep time between 2 samples.


Two opens.

1)
Do we need to explicity document that drm-total-cycles and drm-maxfreq 
are mutually exclusive?


so userspace has a fallback mechanism to calculate utilization depending
on what keys are available?


No, to document all three at once do not make sense. Or at least are not 
expected. Or you envisage someone might legitimately emit all three? I 
don't see what would be the semantics. When we have cycles+maxfreq the 
latter is in Hz. And when we have cycles+total then it is unitless. All 
three?



2)
Should drm-total-cycles for now be documents as driver specific?


you mean to call it xe-total-cycles?


Yes but it is not an ask, just an open.

I have added some more poeple in the cc who were involved with driver 
fdinfo implementations if they will have an opinion.


I would say potentially yes, and promote it to common if more than one 
driver would use it.


For instance I see panfrost has the driver specific drm-curfreq 
(although isn't documenting it fully in panfrost.rst). And I have to 
say it is somewhat questionable to expose the current frequency per 
fdinfo per engine but not my call.


aren't all of Documentation/gpu/drm-usage-stats.rst optional that
driver may or may not implement? When you say driver-specific I'd think
more of the ones not using  as prefix as e.g. amd-*.

I think drm-cycles + drm-total-cycles is just an alternative
implementation for engine utilization. Like drm-cycles + drm-maxfreq
already is an alternative to drm-engine and is not implemented by e.g.
amdgpu/i915.

I will submit a new version of the entire patch series to get the ball
rolling, but let's keep this open for now.

<...>


+static void show_runtime(struct drm_printer *p, struct drm_file *file)
+{
+    struct xe_file *xef = file->driver_priv;
+    struct xe_device *xe = x

Re: drm scheduler and wq flavours

2024-05-07 Thread Tvrtko Ursulin



On 07/05/2024 00:23, Matthew Brost wrote:

On Thu, May 02, 2024 at 03:33:50PM +0100, Tvrtko Ursulin wrote:


Hi all,

Continuing after the brief IRC discussion yesterday regarding work queues
being prone to deadlocks or not, I had a browse around the code base and
ended up a bit confused.

When drm_sched_init documents and allocates an *ordered* wq, if no custom
one was provided, could someone remind me was the ordered property
fundamental for something to work correctly? Like run_job vs free_job
ordering?



Before the work queue (kthread design), run_job & free_job were ordered.
It was decided to not break this existing behavior.


Simply for extra paranoia or you remember if there was a reason identified?


I ask because it appears different drivers to different things and at the
moment it looks we have all possible combos or ordered/unordered, bound and
unbound, shared or not shared with the timeout wq, or even unbound for the
timeout wq.

The drivers worth looking at in this respect are probably nouveau, panthor,
pvr and xe.

Nouveau also talks about a depency betwen run_job and free_job and goes to
create two unordered wqs.

Then xe looks a bit funky with the workaround/hack for lockep where it
creates 512 work queues and hands them over to user queues in round-robin
fashion. (Instead of default 1:1.) Which I suspect is a problem which should
be applicable for any 1:1 driver given a thorough enough test suite.



I think lockdep ran out of chains or something when executing some wild
IGT with 1:1. Yes, any driver with a wild enough test would likely hit
this lockdep splat too. Using a pool probably is not bad idea either.


I wonder what is different between that and having a single shared 
unbound queue and let kernel manage the concurrency? Both this..



So anyway.. ordered vs unordered - drm sched dictated or at driver's choice?



Default ordered, driver can override with unordered.


.. and this, go back to my original question - whether the default queue 
must be ordered or not, or under which circustmances can drivers choose 
unordered. I think in drm_sched_init, where kerneldoc says it will 
create an ordered queue, it would be good to document the rules.


Regards,

Tvrtko


Re: [PATCH] Documentation/gpu: Document the situation with unqualified drm-memory-

2024-05-03 Thread Tvrtko Ursulin



On 03/05/2024 16:58, Alex Deucher wrote:

On Fri, May 3, 2024 at 11:33 AM Daniel Vetter  wrote:


On Fri, May 03, 2024 at 01:58:38PM +0100, Tvrtko Ursulin wrote:


[And I forgot dri-devel.. doing well!]

On 03/05/2024 13:40, Tvrtko Ursulin wrote:


[Correcting Christian's email]

On 03/05/2024 13:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Currently it is not well defined what is drm-memory- compared to other
categories.

In practice the only driver which emits these keys is amdgpu and in them
exposes the total memory use (including shared).

Document that drm-memory- and drm-total-memory- are aliases to
prevent any
confusion in the future.

While at it also clarify that the reserved sub-string 'memory' refers to
the memory region component.

Signed-off-by: Tvrtko Ursulin 
Cc: Alex Deucher 
Cc: Christian König 


Mea culpa, I copied the mistake from
77d17c4cd0bf52eacfad88e63e8932eb45d643c5. :)

Regards,

Tvrtko


Cc: Rob Clark 
---
   Documentation/gpu/drm-usage-stats.rst | 10 +-
   1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/gpu/drm-usage-stats.rst
b/Documentation/gpu/drm-usage-stats.rst
index 6dc299343b48..ef5c0a0aa477 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -128,7 +128,9 @@ Memory
   Each possible memory type which can be used to store buffer
objects by the
   GPU in question shall be given a stable and unique name to be
returned as the
-string here.  The name "memory" is reserved to refer to normal
system memory.
+string here.
+
+The region name "memory" is reserved to refer to normal system memory.
   Value shall reflect the amount of storage currently consumed by
the buffer
   objects belong to this client, in the respective memory region.
@@ -136,6 +138,9 @@ objects belong to this client, in the respective
memory region.
   Default unit shall be bytes with optional unit specifiers of 'KiB'
or 'MiB'
   indicating kibi- or mebi-bytes.
+This is an alias for drm-total- and only one of the two
should be
+present.


This feels a bit awkward and seems to needlessly complicate fdinfo uapi.

- Could we just patch amdgpu to follow everyone else, and avoid the
   special case? If there's no tool that relies on the special amdgpu
   prefix then that would be a lot easier.

- If that's not on the table, could we make everyone (with a suitable
   helper or something) just print both variants, so that we again have
   consisent fdinfo output? Or breaks that a different set of existing
   tools.

- Finally maybe could we get away with fixing amd by adding the common
   format there, deprecating the old, fixing the tools that would break and
   then maybe if we're lucky, remove the old one from amdgpu in a year or
   so?


I'm not really understanding what amdgpu is doing wrong.  It seems to
be following the documentation.  Is the idea that we would like to
deprecate drm-memory- in favor of drm-total-?
If that's the case, I think the 3rd option is probably the best.  We
have a lot of tools and customers using this.  It would have also been
nice to have "memory" in the string for the newer ones to avoid
conflicts with other things that might be a total or shared in the
future, but I guess that ship has sailed.  We should also note that
drm-memory- is deprecated.  While we are here, maybe we should
clarify the semantics of resident, purgeable, and active.  For
example, isn't resident just a duplicate of total?  If the memory was
not resident, it would be in a different region.


Amdgpu isn't doing anything wrong. It just appears when the format was 
discussed no one noticed (me included) that the two keys are not clearly 
described. And it looks there also wasn't a plan to handle the uncelar 
duality in the future.


For me deprecating sounds fine, the 3rd option. I understand we would 
only make amdgpu emit both sets of keys and then remove drm-memory- in 
due time.


With regards to key naming, yeah, memory in the name would have been 
nice. We had a lot of discussion on this topic but ship has indeed 
sailed. It is probably workarble for anything new that might come to add 
their prefix. As long as it does not clash with the memory categories is 
should be fine.


In terms of resident semantics, think of it as VIRT vs RES in top(1). It 
is for drivers which allocate backing store lazily, on first use.


Purgeable is for drivers which have a form of MADV_DONTNEED ie. 
currently have backing store but userspace has indicated it can be 
dropped without preserving the content on memory pressure.


Active is when reservation object says there is activity on the buffer.

Regards,

Tvrtko



Alex



Uapi that's "either do $foo or on this one driver, do $bar" is just
guaranteed to fragement the ecosystem, so imo that should be the absolute
last resort.
-Sima


+
   - drm-shared-:  [KiB|MiB]
   The total size of buffers that are shared with another file (e.g.,
have more

Re: [PATCH] Documentation/gpu: Document the situation with unqualified drm-memory-

2024-05-03 Thread Tvrtko Ursulin



On 03/05/2024 14:39, Alex Deucher wrote:

On Fri, May 3, 2024 at 8:58 AM Tvrtko Ursulin  wrote:



[And I forgot dri-devel.. doing well!]

On 03/05/2024 13:40, Tvrtko Ursulin wrote:


[Correcting Christian's email]

On 03/05/2024 13:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Currently it is not well defined what is drm-memory- compared to other
categories.

In practice the only driver which emits these keys is amdgpu and in them
exposes the total memory use (including shared).

Document that drm-memory- and drm-total-memory- are aliases to prevent
any
confusion in the future.

While at it also clarify that the reserved sub-string 'memory' refers to
the memory region component.

Signed-off-by: Tvrtko Ursulin 
Cc: Alex Deucher 
Cc: Christian König 


Mea culpa, I copied the mistake from
77d17c4cd0bf52eacfad88e63e8932eb45d643c5. :)



I'm not following.  What is the mistake from that commit?


Just the spelling of Christian's last name in the email address, nothing 
in the code itself. I failed to spot both that when copying the email 
for git commit, and also failed to cc dri-devel so I am having a bad day.


Regards,

Tvrtko




Regards,

Tvrtko


Cc: Rob Clark 
---
   Documentation/gpu/drm-usage-stats.rst | 10 +-
   1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/gpu/drm-usage-stats.rst
b/Documentation/gpu/drm-usage-stats.rst
index 6dc299343b48..ef5c0a0aa477 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -128,7 +128,9 @@ Memory
   Each possible memory type which can be used to store buffer objects
by the
   GPU in question shall be given a stable and unique name to be
returned as the
-string here.  The name "memory" is reserved to refer to normal system
memory.
+string here.
+
+The region name "memory" is reserved to refer to normal system memory.


Is this supposed to mean drm-memory-memory?  That was my impression,
but that seems sort of weird.  Maybe we should just drop that
sentence.

Alex


   Value shall reflect the amount of storage currently consumed by the
buffer
   objects belong to this client, in the respective memory region.
@@ -136,6 +138,9 @@ objects belong to this client, in the respective
memory region.
   Default unit shall be bytes with optional unit specifiers of 'KiB'
or 'MiB'
   indicating kibi- or mebi-bytes.
+This is an alias for drm-total- and only one of the two
should be
+present.
+
   - drm-shared-:  [KiB|MiB]
   The total size of buffers that are shared with another file (e.g.,
have more
@@ -145,6 +150,9 @@ than a single handle).
   The total size of buffers that including shared and private memory.
+This is an alias for drm-memory- and only one of the two
should be
+present.
+
   - drm-resident-:  [KiB|MiB]
   The total size of buffers that are resident in the specified region.


Re: [PATCH] Documentation/gpu: Document the situation with unqualified drm-memory-

2024-05-03 Thread Tvrtko Ursulin



[And I forgot dri-devel.. doing well!]

On 03/05/2024 13:40, Tvrtko Ursulin wrote:


[Correcting Christian's email]

On 03/05/2024 13:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Currently it is not well defined what is drm-memory- compared to other
categories.

In practice the only driver which emits these keys is amdgpu and in them
exposes the total memory use (including shared).

Document that drm-memory- and drm-total-memory- are aliases to prevent 
any

confusion in the future.

While at it also clarify that the reserved sub-string 'memory' refers to
the memory region component.

Signed-off-by: Tvrtko Ursulin 
Cc: Alex Deucher 
Cc: Christian König 


Mea culpa, I copied the mistake from 
77d17c4cd0bf52eacfad88e63e8932eb45d643c5. :)


Regards,

Tvrtko


Cc: Rob Clark 
---
  Documentation/gpu/drm-usage-stats.rst | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/gpu/drm-usage-stats.rst 
b/Documentation/gpu/drm-usage-stats.rst

index 6dc299343b48..ef5c0a0aa477 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -128,7 +128,9 @@ Memory
  Each possible memory type which can be used to store buffer objects 
by the
  GPU in question shall be given a stable and unique name to be 
returned as the
-string here.  The name "memory" is reserved to refer to normal system 
memory.

+string here.
+
+The region name "memory" is reserved to refer to normal system memory.
  Value shall reflect the amount of storage currently consumed by the 
buffer

  objects belong to this client, in the respective memory region.
@@ -136,6 +138,9 @@ objects belong to this client, in the respective 
memory region.
  Default unit shall be bytes with optional unit specifiers of 'KiB' 
or 'MiB'

  indicating kibi- or mebi-bytes.
+This is an alias for drm-total- and only one of the two 
should be

+present.
+
  - drm-shared-:  [KiB|MiB]
  The total size of buffers that are shared with another file (e.g., 
have more

@@ -145,6 +150,9 @@ than a single handle).
  The total size of buffers that including shared and private memory.
+This is an alias for drm-memory- and only one of the two 
should be

+present.
+
  - drm-resident-:  [KiB|MiB]
  The total size of buffers that are resident in the specified region.


drm scheduler and wq flavours

2024-05-02 Thread Tvrtko Ursulin



Hi all,

Continuing after the brief IRC discussion yesterday regarding work 
queues being prone to deadlocks or not, I had a browse around the code 
base and ended up a bit confused.


When drm_sched_init documents and allocates an *ordered* wq, if no 
custom one was provided, could someone remind me was the ordered 
property fundamental for something to work correctly? Like run_job vs 
free_job ordering?


I ask because it appears different drivers to different things and at 
the moment it looks we have all possible combos or ordered/unordered, 
bound and unbound, shared or not shared with the timeout wq, or even 
unbound for the timeout wq.


The drivers worth looking at in this respect are probably nouveau, 
panthor, pvr and xe.


Nouveau also talks about a depency betwen run_job and free_job and goes 
to create two unordered wqs.


Then xe looks a bit funky with the workaround/hack for lockep where it 
creates 512 work queues and hands them over to user queues in 
round-robin fashion. (Instead of default 1:1.) Which I suspect is a 
problem which should be applicable for any 1:1 driver given a thorough 
enough test suite.


So anyway.. ordered vs unordered - drm sched dictated or at driver's choice?

Regards,

Tvrtko


Re: [PATCH] drm/sysfs: Add drm class-wide attribute to get active device clients

2024-05-01 Thread Tvrtko Ursulin



Hi,

On 24/04/2024 15:48, Adrián Larumbe wrote:

Hi Tvrtko,

On 15.04.2024 13:50, Tvrtko Ursulin wrote:


On 05/04/2024 18:59, Rob Clark wrote:

On Wed, Apr 3, 2024 at 11:37 AM Adrián Larumbe
 wrote:


Up to this day, all fdinfo-based GPU profilers must traverse the entire
/proc directory structure to find open DRM clients with fdinfo file
descriptors. This is inefficient and time-consuming.

This patch adds a new device class attribute that will install a sysfs file
per DRM device, which can be queried by profilers to get a list of PIDs for
their open clients. This file isn't human-readable, and it's meant to be
queried only by GPU profilers like gputop and nvtop.

Cc: Boris Brezillon 
Cc: Tvrtko Ursulin 
Cc: Christopher Healy 
Signed-off-by: Adrián Larumbe 


It does seem like a good idea.. idk if there is some precedent to
prefer binary vs ascii in sysfs, but having a way to avoid walking
_all_ processes is a good idea.


I naturally second that it is a needed feature, but I do not think binary
format is justified. AFAIR it should be used for things like hw/fw
standardised tables or firmware images, not when exporting a simple list of
PIDs. It also precludes easy shell/script access and the benefit of avoiding
parsing a short list is I suspect completely dwarfed by needing to parse all
the related fdinfo etc.


I'd rather keep it as a binary file for the sake of easily parsing the number
list on the client side, in gputop or nvtop. For textual access, there's already
a debugfs file that presents the same information, so I thought it was best not
to duplicate that functionality and restrict sysfs to serving the very specific
use case of UM profilers having to access the DRM client list.

I should mention I did something controversial here, which is a semantically
binary attribute through the regular attribute interface. I guess if I keep it
as a binary attribute in the end, I should switch over to the binary attribute
API.

Another reason why I implemented it as a binary file is that we can only send
back at most a whole page. If a PID takes 4 bytes, that's usually 1024 clients
at most, which is probably enough for any UM profiler, but will decrease even
more if we turn it into an ASCII readable file.


I'm afraid I still think there is no reason for a binary file, even less 
so artificially limited to 1024 clients. Any consumer will have to parse 
text fdinfo so a binary list of pids is not adding any real cost.



I did some research into sysfs binary attributes, and while some sources 
mention that
it's often used for dumping or loading of driver FW, none of them claim it 
cannot
be used for other purposes.


---
   drivers/gpu/drm/drm_internal.h   |  2 +-
   drivers/gpu/drm/drm_privacy_screen.c |  2 +-
   drivers/gpu/drm/drm_sysfs.c  | 89 ++--
   3 files changed, 74 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h
index 2215baef9a3e..9a399b03d11c 100644
--- a/drivers/gpu/drm/drm_internal.h
+++ b/drivers/gpu/drm/drm_internal.h
@@ -145,7 +145,7 @@ bool drm_master_internal_acquire(struct drm_device *dev);
   void drm_master_internal_release(struct drm_device *dev);

   /* drm_sysfs.c */
-extern struct class *drm_class;
+extern struct class drm_class;

   int drm_sysfs_init(void);
   void drm_sysfs_destroy(void);
diff --git a/drivers/gpu/drm/drm_privacy_screen.c 
b/drivers/gpu/drm/drm_privacy_screen.c
index 6cc39e30781f..2fbd24ba5818 100644
--- a/drivers/gpu/drm/drm_privacy_screen.c
+++ b/drivers/gpu/drm/drm_privacy_screen.c
@@ -401,7 +401,7 @@ struct drm_privacy_screen *drm_privacy_screen_register(
  mutex_init(>lock);
  BLOCKING_INIT_NOTIFIER_HEAD(>notifier_head);

-   priv->dev.class = drm_class;
+   priv->dev.class = _class;
  priv->dev.type = _privacy_screen_type;
  priv->dev.parent = parent;
  priv->dev.release = drm_privacy_screen_device_release;
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index a953f69a34b6..56ca9e22c720 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -58,8 +58,6 @@ static struct device_type drm_sysfs_device_connector = {
  .name = "drm_connector",
   };

-struct class *drm_class;
-
   #ifdef CONFIG_ACPI
   static bool drm_connector_acpi_bus_match(struct device *dev)
   {
@@ -128,6 +126,62 @@ static const struct component_ops typec_connector_ops = {

   static CLASS_ATTR_STRING(version, S_IRUGO, "drm 1.1.0 20060810");

+static ssize_t clients_show(struct device *cd, struct device_attribute *attr, 
char *buf)
+{
+   struct drm_minor *minor = cd->driver_data;
+   struct drm_device *ddev = minor->dev;
+   struct drm_file *priv;
+   ssize_t offset = 0;
+   void *pid_buf;
+
+   if (minor->type != DRM_MINOR_RENDER)
+   return 0;


Why this?


I return nothing in case of a non-render node be

Re: [PATCH v4 8/8] drm/v3d: Add modparam for turning off Big/Super Pages

2024-04-29 Thread Tvrtko Ursulin



On 28/04/2024 13:40, Maíra Canal wrote:

Add a modparam for turning off Big/Super Pages to make sure that if an
user doesn't want Big/Super Pages enabled, it can disabled it by setting
the modparam to false.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_drv.c   | 7 +++
  drivers/gpu/drm/v3d/v3d_gemfs.c | 5 +
  2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c
index 28b7ddce7747..1a6e01235df6 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.c
+++ b/drivers/gpu/drm/v3d/v3d_drv.c
@@ -36,6 +36,13 @@
  #define DRIVER_MINOR 0
  #define DRIVER_PATCHLEVEL 0
  
+/* Only expose the `super_pages` modparam if THP is enabled. */

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+bool super_pages = true;
+module_param_named(super_pages, super_pages, bool, 0400);
+MODULE_PARM_DESC(super_pages, "Enable/Disable Super Pages support.");
+#endif
+
  static int v3d_get_param_ioctl(struct drm_device *dev, void *data,
   struct drm_file *file_priv)
  {
diff --git a/drivers/gpu/drm/v3d/v3d_gemfs.c b/drivers/gpu/drm/v3d/v3d_gemfs.c
index 31cf5bd11e39..0ade02bb7209 100644
--- a/drivers/gpu/drm/v3d/v3d_gemfs.c
+++ b/drivers/gpu/drm/v3d/v3d_gemfs.c
@@ -11,6 +11,7 @@ void v3d_gemfs_init(struct v3d_dev *v3d)
char huge_opt[] = "huge=within_size";
struct file_system_type *type;
struct vfsmount *gemfs;
+   extern bool super_pages;
  
  	/*

 * By creating our own shmemfs mountpoint, we can pass in
@@ -20,6 +21,10 @@ void v3d_gemfs_init(struct v3d_dev *v3d)
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
goto err;
  
+	/* The user doesn't want to enable Super Pages */

+   if (!super_pages)
+   goto err;
+
type = get_fs_type("tmpfs");
if (!type)
    goto err;


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [PATCH v2 3/6] drm/xe: Add helper to accumulate exec queue runtime

2024-04-29 Thread Tvrtko Ursulin



On 26/04/2024 19:59, Umesh Nerlige Ramappa wrote:

On Fri, Apr 26, 2024 at 11:49:32AM +0100, Tvrtko Ursulin wrote:


On 24/04/2024 00:56, Lucas De Marchi wrote:

From: Umesh Nerlige Ramappa 

Add a helper to accumulate per-client runtime of all its
exec queues. Currently that is done in 2 places:

1. when the exec_queue is destroyed
2. when the sched job is completed

Signed-off-by: Umesh Nerlige Ramappa 
Signed-off-by: Lucas De Marchi 
---
 drivers/gpu/drm/xe/xe_device_types.h |  9 +++
 drivers/gpu/drm/xe/xe_exec_queue.c   | 37 
 drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
 drivers/gpu/drm/xe/xe_sched_job.c    |  2 ++
 4 files changed, 49 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
b/drivers/gpu/drm/xe/xe_device_types.h

index 2e62450d86e1..33d3bf93a2f1 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -547,6 +547,15 @@ struct xe_file {
 struct mutex lock;
 } exec_queue;
+    /**
+ * @runtime: hw engine class runtime in ticks for this drm client
+ *
+ * Only stats from xe_exec_queue->lrc[0] are accumulated. For 
multi-lrc
+ * case, since all jobs run in parallel on the engines, only the 
stats

+ * from lrc[0] are sufficient.


Out of curiousity doesn't this mean multi-lrc jobs will be incorrectly 
accounted for? (When capacity is considered.)


TBH, I am not sure what the user would like to see here for multi-lrc. 
If reporting the capacity, then we may need to use width as a 
multiplication factor for multi-lrc. How was this done in i915?


IMO user has to see the real utilisation - so if there are two VCS and 
both are busy, 100% should be reported and not 50%. Latter would be 
misleading, either with or without cross-checking with physical utilisation.


In i915 with execlists this works correctly and with GuC you would 
probably know the answer better than me.


Regards,

Tvrtko



Regards,
Umesh




Regards,

Tvrtko


+ */
+    u64 runtime[XE_ENGINE_CLASS_MAX];
+
 /** @client: drm client */
 struct xe_drm_client *client;
 };
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c 
b/drivers/gpu/drm/xe/xe_exec_queue.c

index 395de93579fa..b7b6256cb96a 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -214,6 +214,8 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
 {
 int i;
+    xe_exec_queue_update_runtime(q);
+
 for (i = 0; i < q->width; ++i)
 xe_lrc_finish(q->lrc + i);
 if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & 
EXEC_QUEUE_FLAG_VM || !q->vm))

@@ -769,6 +771,41 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
 q->lrc[0].fence_ctx.next_seqno - 1;
 }
+/**
+ * xe_exec_queue_update_runtime() - Update runtime for this exec 
queue from hw

+ * @q: The exec queue
+ *
+ * Update the timestamp saved by HW for this exec queue and save 
runtime
+ * calculated by using the delta from last update. On multi-lrc 
case, only the

+ * first is considered.
+ */
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
+{
+    struct xe_file *xef;
+    struct xe_lrc *lrc;
+    u32 old_ts, new_ts;
+
+    /*
+ * Jobs that are run during driver load may use an exec_queue, 
but are
+ * not associated with a user xe file, so avoid accumulating 
busyness

+ * for kernel specific work.
+ */
+    if (!q->vm || !q->vm->xef)
+    return;
+
+    xef = q->vm->xef;
+    lrc = >lrc[0];
+
+    new_ts = xe_lrc_update_timestamp(lrc, _ts);
+
+    /*
+ * Special case the very first timestamp: we don't want the
+ * initial delta to be a huge value
+ */
+    if (old_ts)
+    xef->runtime[q->class] += new_ts - old_ts;
+}
+
 void xe_exec_queue_kill(struct xe_exec_queue *q)
 {
 struct xe_exec_queue *eq = q, *next;
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h 
b/drivers/gpu/drm/xe/xe_exec_queue.h

index 02ce8d204622..45b72daa2db3 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue.h
@@ -66,5 +66,6 @@ struct dma_fence 
*xe_exec_queue_last_fence_get(struct xe_exec_queue *e,

    struct xe_vm *vm);
 void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct 
xe_vm *vm,

   struct dma_fence *fence);
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_sched_job.c 
b/drivers/gpu/drm/xe/xe_sched_job.c

index cd8a2fba5438..6a081a4fa190 100644
--- a/drivers/gpu/drm/xe/xe_sched_job.c
+++ b/drivers/gpu/drm/xe/xe_sched_job.c
@@ -242,6 +242,8 @@ bool xe_sched_job_completed(struct xe_sched_job 
*job)

 {
 struct xe_lrc *lrc = job->q->lrc;
+    xe_exec_queue_update_runtime(job->q);
+
 /*
  * Can safely check just LRC[0] seqno as that is last seqno 
written when

  * parallel handshake is done.


Re: [PATCH] MAINTAINERS: Move the drm-intel repo location to fd.o GitLab

2024-04-26 Thread Tvrtko Ursulin




On 26/04/2024 16:47, Lucas De Marchi wrote:

On Wed, Apr 24, 2024 at 01:41:59PM GMT, Ryszard Knop wrote:

The drm-intel repo is moving from the classic fd.o git host to GitLab.
Update its location with a URL matching other fd.o GitLab kernel trees.

Signed-off-by: Ryszard Knop 


Acked-by: Lucas De Marchi 

Also Cc'ing maintainers


Thanks,

Acked-by: Tvrtko Ursulin 

Regards,

Tvrtko


---
MAINTAINERS | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index d6327dc12cb1..fbf7371a0bb0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10854,7 +10854,7 @@ W:
https://drm.pages.freedesktop.org/intel-docs/

Q:    http://patchwork.freedesktop.org/project/intel-gfx/
B:
https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html

C:    irc://irc.oftc.net/intel-gfx
-T:    git git://anongit.freedesktop.org/drm-intel
+T:    git https://gitlab.freedesktop.org/drm/i915/kernel.git
F:    Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon
F:    Documentation/gpu/i915.rst
F:    drivers/gpu/drm/ci/xfails/i915*
--
2.44.0



Re: [PATCH v2 3/6] drm/xe: Add helper to accumulate exec queue runtime

2024-04-26 Thread Tvrtko Ursulin



On 24/04/2024 00:56, Lucas De Marchi wrote:

From: Umesh Nerlige Ramappa 

Add a helper to accumulate per-client runtime of all its
exec queues. Currently that is done in 2 places:

1. when the exec_queue is destroyed
2. when the sched job is completed

Signed-off-by: Umesh Nerlige Ramappa 
Signed-off-by: Lucas De Marchi 
---
  drivers/gpu/drm/xe/xe_device_types.h |  9 +++
  drivers/gpu/drm/xe/xe_exec_queue.c   | 37 
  drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
  drivers/gpu/drm/xe/xe_sched_job.c|  2 ++
  4 files changed, 49 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
b/drivers/gpu/drm/xe/xe_device_types.h
index 2e62450d86e1..33d3bf93a2f1 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -547,6 +547,15 @@ struct xe_file {
struct mutex lock;
} exec_queue;
  
+	/**

+* @runtime: hw engine class runtime in ticks for this drm client
+*
+* Only stats from xe_exec_queue->lrc[0] are accumulated. For multi-lrc
+* case, since all jobs run in parallel on the engines, only the stats
+* from lrc[0] are sufficient.


Out of curiousity doesn't this mean multi-lrc jobs will be incorrectly 
accounted for? (When capacity is considered.)


Regards,

Tvrtko


+*/
+   u64 runtime[XE_ENGINE_CLASS_MAX];
+
/** @client: drm client */
struct xe_drm_client *client;
  };
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c 
b/drivers/gpu/drm/xe/xe_exec_queue.c
index 395de93579fa..b7b6256cb96a 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -214,6 +214,8 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
  {
int i;
  
+	xe_exec_queue_update_runtime(q);

+
for (i = 0; i < q->width; ++i)
xe_lrc_finish(q->lrc + i);
if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & 
EXEC_QUEUE_FLAG_VM || !q->vm))
@@ -769,6 +771,41 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
q->lrc[0].fence_ctx.next_seqno - 1;
  }
  
+/**

+ * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw
+ * @q: The exec queue
+ *
+ * Update the timestamp saved by HW for this exec queue and save runtime
+ * calculated by using the delta from last update. On multi-lrc case, only the
+ * first is considered.
+ */
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
+{
+   struct xe_file *xef;
+   struct xe_lrc *lrc;
+   u32 old_ts, new_ts;
+
+   /*
+* Jobs that are run during driver load may use an exec_queue, but are
+* not associated with a user xe file, so avoid accumulating busyness
+* for kernel specific work.
+*/
+   if (!q->vm || !q->vm->xef)
+   return;
+
+   xef = q->vm->xef;
+   lrc = >lrc[0];
+
+   new_ts = xe_lrc_update_timestamp(lrc, _ts);
+
+   /*
+* Special case the very first timestamp: we don't want the
+* initial delta to be a huge value
+*/
+   if (old_ts)
+   xef->runtime[q->class] += new_ts - old_ts;
+}
+
  void xe_exec_queue_kill(struct xe_exec_queue *q)
  {
struct xe_exec_queue *eq = q, *next;
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h 
b/drivers/gpu/drm/xe/xe_exec_queue.h
index 02ce8d204622..45b72daa2db3 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue.h
@@ -66,5 +66,6 @@ struct dma_fence *xe_exec_queue_last_fence_get(struct 
xe_exec_queue *e,
   struct xe_vm *vm);
  void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct xe_vm *vm,
  struct dma_fence *fence);
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q);
  
  #endif

diff --git a/drivers/gpu/drm/xe/xe_sched_job.c 
b/drivers/gpu/drm/xe/xe_sched_job.c
index cd8a2fba5438..6a081a4fa190 100644
--- a/drivers/gpu/drm/xe/xe_sched_job.c
+++ b/drivers/gpu/drm/xe/xe_sched_job.c
@@ -242,6 +242,8 @@ bool xe_sched_job_completed(struct xe_sched_job *job)
  {
struct xe_lrc *lrc = job->q->lrc;
  
+	xe_exec_queue_update_runtime(job->q);

+
/*
 * Can safely check just LRC[0] seqno as that is last seqno written when
 * parallel handshake is done.


Re: [PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo

2024-04-26 Thread Tvrtko Ursulin



On 24/04/2024 00:56, Lucas De Marchi wrote:

Print the accumulated runtime for client when printing fdinfo.
Each time a query is done it first does 2 things:

1) loop through all the exec queues for the current client and
accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
that, being read from the context image.

2) Read a "GPU timestamp" that can be used for considering "how much GPU
time has passed" and that has the same unit/refclock as the one
recording the runtime. RING_TIMESTAMP is used for that via MMIO.

Since for all current platforms RING_TIMESTAMP follows the same
refclock, just read it once, using any first engine.

This is exported to userspace as 2 numbers in fdinfo:

drm-cycles-: 
drm-total-cycles-: 

Userspace is expected to collect at least 2 samples, which allows to
know the client engine busyness as per:

RUNTIME1 - RUNTIME0
busyness = -
  T1 - T0

Another thing to point out is that it's expected that userspace will
read any 2 samples every few seconds.  Given the update frequency of the
counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
each exec_queue can wrap around (assuming 100% utilization) after ~200s.
The wraparound is not perceived by userspace since it's just accumulated
for all the exec_queues in a 64-bit counter), but the measurement will
not be accurate if the samples are too far apart.

This could be mitigated by adding a workqueue to accumulate the counters
every so often, but it's additional complexity for something that is
done already by userspace every few seconds in tools like gputop (from
igt), htop, nvtop, etc with none of them really defaulting to 1 sample
per minute or more.

Signed-off-by: Lucas De Marchi 
---
  Documentation/gpu/drm-usage-stats.rst   |  16 ++-
  Documentation/gpu/xe/index.rst  |   1 +
  Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
  drivers/gpu/drm/xe/xe_drm_client.c  | 138 +++-
  4 files changed, 162 insertions(+), 3 deletions(-)
  create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst

diff --git a/Documentation/gpu/drm-usage-stats.rst 
b/Documentation/gpu/drm-usage-stats.rst
index 6dc299343b48..421766289b78 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -112,6 +112,17 @@ larger value within a reasonable period. Upon observing a 
value lower than what
  was previously read, userspace is expected to stay with that larger previous
  value until a monotonic update is seen.
  
+- drm-total-cycles-: 

+
+Engine identifier string must be the same as the one specified in the
+drm-cycles- tag and shall contain the total number cycles for the given
+engine.
+
+This is a timestamp in GPU unspecified unit that matches the update rate
+of drm-cycles-. For drivers that implement this interface, the engine
+utilization can be calculated entirely on the GPU clock domain, without
+considering the CPU sleep time between 2 samples.


Two opens.

1)
Do we need to explicity document that drm-total-cycles and drm-maxfreq 
are mutually exclusive?


2)
Should drm-total-cycles for now be documents as driver specific?

I have added some more poeple in the cc who were involved with driver 
fdinfo implementations if they will have an opinion.


I would say potentially yes, and promote it to common if more than one 
driver would use it.


For instance I see panfrost has the driver specific drm-curfreq 
(although isn't documenting it fully in panfrost.rst). And I have to say 
it is somewhat questionable to expose the current frequency per fdinfo 
per engine but not my call.



+
  - drm-maxfreq-:  [Hz|MHz|KHz]
  
  Engine identifier string must be the same as the one specified in the

@@ -168,5 +179,6 @@ be documented above and where possible, aligned with other 
drivers.
  Driver specific implementations
  ---
  
-:ref:`i915-usage-stats`

-:ref:`panfrost-usage-stats`
+* :ref:`i915-usage-stats`
+* :ref:`panfrost-usage-stats`
+* :ref:`xe-usage-stats`
diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index c224ecaee81e..3f07aa3b5432 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver.
 xe_firmware
 xe_tile
 xe_debugging
+   xe-drm-usage-stats.rst
diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst 
b/Documentation/gpu/xe/xe-drm-usage-stats.rst
new file mode 100644
index ..ccb48733cbe3
--- /dev/null
+++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+.. _xe-usage-stats:
+
+===
+Xe DRM client usage stats implemenation
+===
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c
+   :doc: DRM Client usage stats
diff --git 

Re: [PATCH v3 8/8] drm/v3d: Add modparam for turning off Big/Super Pages

2024-04-22 Thread Tvrtko Ursulin



On 21/04/2024 22:44, Maíra Canal wrote:

Add a modparam for turning off Big/Super Pages to make sure that if an
user doesn't want Big/Super Pages enabled, it can disabled it by setting
the modparam to false.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_drv.c   | 8 
  drivers/gpu/drm/v3d/v3d_gemfs.c | 5 +
  2 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c
index 3debf37e7d9b..bc8c8905112a 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.c
+++ b/drivers/gpu/drm/v3d/v3d_drv.c
@@ -36,6 +36,14 @@
  #define DRIVER_MINOR 0
  #define DRIVER_PATCHLEVEL 0
  
+bool super_pages = true;

+
+/* Only expose the `super_pages` modparam if THP is enabled. */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE


I would have put bool super_pages in here so it can get compiled out.


+module_param_named(super_pages, super_pages, bool, 0400);
+MODULE_PARM_DESC(super_pages, "Enable/Disable Super Pages support.");
+#endif
+
  static int v3d_get_param_ioctl(struct drm_device *dev, void *data,
   struct drm_file *file_priv)
  {
diff --git a/drivers/gpu/drm/v3d/v3d_gemfs.c b/drivers/gpu/drm/v3d/v3d_gemfs.c
index 31cf5bd11e39..5fa08263cff2 100644
--- a/drivers/gpu/drm/v3d/v3d_gemfs.c
+++ b/drivers/gpu/drm/v3d/v3d_gemfs.c
@@ -11,6 +11,11 @@ void v3d_gemfs_init(struct v3d_dev *v3d)
char huge_opt[] = "huge=within_size";
struct file_system_type *type;
struct vfsmount *gemfs;
+   extern bool super_pages;
+
+   /* The user doesn't want to enable Super Pages */
+   if (!super_pages)
+   goto err;


And if this hunk is moved after the CONFIG_TRANSPARENT_HUGEPAGE check 
just below I hope compiler can be happy with that.


Regards,

Tvrtko

  
  	/*

 * By creating our own shmemfs mountpoint, we can pass in


Re: [PATCH v3 7/8] drm/v3d: Use gemfs/THP in BO creation if available

2024-04-22 Thread Tvrtko Ursulin



On 21/04/2024 22:44, Maíra Canal wrote:

Although Big/Super Pages could appear naturally, it would be quite hard
to have 1MB or 64KB allocated contiguously naturally. Therefore, we can
force the creation of large pages allocated contiguously by using a
mountpoint with "huge=within_size" enabled.

As V3D has a mountpoint with "huge=within_size" (if user has THP enabled),
use this mountpoint for BO creation if available. This will allow us to create
large pages allocated contiguously and make use of Big/Super Pages.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_bo.c | 21 +++--
  1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_bo.c b/drivers/gpu/drm/v3d/v3d_bo.c
index 79e31c5299b1..16ac26c31c6b 100644
--- a/drivers/gpu/drm/v3d/v3d_bo.c
+++ b/drivers/gpu/drm/v3d/v3d_bo.c
@@ -94,6 +94,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj)
struct v3d_dev *v3d = to_v3d_dev(obj->dev);
struct v3d_bo *bo = to_v3d_bo(obj);
struct sg_table *sgt;
+   u64 align;
int ret;
  
  	/* So far we pin the BO in the MMU for its lifetime, so use

@@ -103,6 +104,15 @@ v3d_bo_create_finish(struct drm_gem_object *obj)
if (IS_ERR(sgt))
return PTR_ERR(sgt);
  
+	if (!v3d->gemfs)

+   align = SZ_4K;
+   else if (obj->size >= SZ_1M)
+   align = SZ_1M;
+   else if (obj->size >= SZ_64K)
+   align = SZ_64K;
+   else
+   align = SZ_4K;


V3d has one GPU address space, right? I wonder if one day fragmentation 
could become an issue but it's a problem for another day. Patch looks 
fine to me.


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


+
spin_lock(>mm_lock);
/* Allocate the object's space in the GPU's page tables.
 * Inserting PTEs will happen later, but the offset is for the
@@ -110,7 +120,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj)
 */
ret = drm_mm_insert_node_generic(>mm, >node,
 obj->size >> V3D_MMU_PAGE_SHIFT,
-SZ_4K >> V3D_MMU_PAGE_SHIFT, 0, 0);
+align >> V3D_MMU_PAGE_SHIFT, 0, 0);
spin_unlock(>mm_lock);
if (ret)
return ret;
@@ -130,10 +140,17 @@ struct v3d_bo *v3d_bo_create(struct drm_device *dev, 
struct drm_file *file_priv,
 size_t unaligned_size)
  {
struct drm_gem_shmem_object *shmem_obj;
+   struct v3d_dev *v3d = to_v3d_dev(dev);
struct v3d_bo *bo;
int ret;
  
-	shmem_obj = drm_gem_shmem_create(dev, unaligned_size);

+   /* Let the user opt out of allocating the BOs with THP */
+   if (v3d->gemfs)
+   shmem_obj = drm_gem_shmem_create_with_mnt(dev, unaligned_size,
+ v3d->gemfs);
+   else
+   shmem_obj = drm_gem_shmem_create(dev, unaligned_size);
+
if (IS_ERR(shmem_obj))
return ERR_CAST(shmem_obj);
bo = to_v3d_bo(_obj->base);


Re: [PATCH v3 6/8] drm/v3d: Support Big/Super Pages when writing out PTEs

2024-04-22 Thread Tvrtko Ursulin



On 22/04/2024 10:57, Tvrtko Ursulin wrote:


On 21/04/2024 22:44, Maíra Canal wrote:

The V3D MMU also supports 64KB and 1MB pages, called big and super pages,
respectively. In order to set a 64KB page or 1MB page in the MMU, we need
to make sure that page table entries for all 4KB pages within a big/super
page must be correctly configured.

In order to create a big/super page, we need a contiguous memory region.
That's why we use a separate mountpoint with THP enabled. In order to
place the page table entries in the MMU, we iterate over the 16 4KB pages
(for big pages) or 256 4KB pages (for super pages) and insert the PTE.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_drv.h |  1 +
  drivers/gpu/drm/v3d/v3d_mmu.c | 52 ++-
  2 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_drv.h 
b/drivers/gpu/drm/v3d/v3d_drv.h

index 17236ee23490..79d8a1a059aa 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.h
+++ b/drivers/gpu/drm/v3d/v3d_drv.h
@@ -18,6 +18,7 @@ struct platform_device;
  struct reset_control;
  #define V3D_MMU_PAGE_SHIFT 12
+#define V3D_PAGE_FACTOR (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT)
  #define V3D_MAX_QUEUES (V3D_CPU + 1)
diff --git a/drivers/gpu/drm/v3d/v3d_mmu.c 
b/drivers/gpu/drm/v3d/v3d_mmu.c

index 14f3af40d6f6..2e0b31e373b2 100644
--- a/drivers/gpu/drm/v3d/v3d_mmu.c
+++ b/drivers/gpu/drm/v3d/v3d_mmu.c
@@ -25,9 +25,16 @@
   * superpage bit set.
   */
  #define V3D_PTE_SUPERPAGE BIT(31)
+#define V3D_PTE_BIGPAGE BIT(30)
  #define V3D_PTE_WRITEABLE BIT(29)
  #define V3D_PTE_VALID BIT(28)
+static bool v3d_mmu_is_aligned(u32 page, u32 page_address, size_t 
alignment)

+{
+    return IS_ALIGNED(page, alignment >> V3D_MMU_PAGE_SHIFT) &&
+    IS_ALIGNED(page_address, alignment >> V3D_MMU_PAGE_SHIFT);
+}
+
  static int v3d_mmu_flush_all(struct v3d_dev *v3d)
  {
  int ret;
@@ -87,19 +94,38 @@ void v3d_mmu_insert_ptes(struct v3d_bo *bo)
  struct drm_gem_shmem_object *shmem_obj = >base;
  struct v3d_dev *v3d = to_v3d_dev(shmem_obj->base.dev);
  u32 page = bo->node.start;
-    u32 page_prot = V3D_PTE_WRITEABLE | V3D_PTE_VALID;
-    struct sg_dma_page_iter dma_iter;
-
-    for_each_sgtable_dma_page(shmem_obj->sgt, _iter, 0) {
-    dma_addr_t dma_addr = sg_page_iter_dma_address(_iter);
-    u32 page_address = dma_addr >> V3D_MMU_PAGE_SHIFT;
-    u32 pte = page_prot | page_address;
-    u32 i;
-
-    BUG_ON(page_address + (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT) >=
-   BIT(24));
-    for (i = 0; i < PAGE_SIZE >> V3D_MMU_PAGE_SHIFT; i++)
-    v3d->pt[page++] = pte + i;
+    struct scatterlist *sgl;
+    unsigned int count;
+
+    for_each_sgtable_dma_sg(shmem_obj->sgt, sgl, count) {
+    dma_addr_t dma_addr = sg_dma_address(sgl);
+    u32 pfn = dma_addr >> V3D_MMU_PAGE_SHIFT;
+    unsigned int len = sg_dma_len(sgl);
+
+    while (len > 0) {
+    u32 page_prot = V3D_PTE_WRITEABLE | V3D_PTE_VALID;
+    u32 page_address = page_prot | pfn;
+    unsigned int i, page_size;
+
+    BUG_ON(pfn + V3D_PAGE_FACTOR >= BIT(24));
+
+    if (len >= SZ_1M && v3d_mmu_is_aligned(page, 
page_address, SZ_1M)) {

+    page_size = SZ_1M;
+    page_address |= V3D_PTE_SUPERPAGE;
+    } else if (len >= SZ_64K && v3d_mmu_is_aligned(page, 
page_address, SZ_64K)) {

+    page_size = SZ_64K;
+    page_address |= V3D_PTE_BIGPAGE;
+    } else {
+    page_size = SZ_4K;
+    }
+
+    for (i = 0; i < page_size >> V3D_MMU_PAGE_SHIFT; i++) {
+    v3d->pt[page++] = page_address + i;
+    pfn++;
+    }
+
+    len -= page_size;
+    }
  }
  WARN_ON_ONCE(page - bo->node.start !=


It looks correct to me.

Reviewed-by: Tvrtko Ursulin 


Ooops muscle memory strikes again! I guess reviewing patches for 10+ 
years can do that.. :)


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [PATCH v3 6/8] drm/v3d: Support Big/Super Pages when writing out PTEs

2024-04-22 Thread Tvrtko Ursulin



On 21/04/2024 22:44, Maíra Canal wrote:

The V3D MMU also supports 64KB and 1MB pages, called big and super pages,
respectively. In order to set a 64KB page or 1MB page in the MMU, we need
to make sure that page table entries for all 4KB pages within a big/super
page must be correctly configured.

In order to create a big/super page, we need a contiguous memory region.
That's why we use a separate mountpoint with THP enabled. In order to
place the page table entries in the MMU, we iterate over the 16 4KB pages
(for big pages) or 256 4KB pages (for super pages) and insert the PTE.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_drv.h |  1 +
  drivers/gpu/drm/v3d/v3d_mmu.c | 52 ++-
  2 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h
index 17236ee23490..79d8a1a059aa 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.h
+++ b/drivers/gpu/drm/v3d/v3d_drv.h
@@ -18,6 +18,7 @@ struct platform_device;
  struct reset_control;
  
  #define V3D_MMU_PAGE_SHIFT 12

+#define V3D_PAGE_FACTOR (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT)
  
  #define V3D_MAX_QUEUES (V3D_CPU + 1)
  
diff --git a/drivers/gpu/drm/v3d/v3d_mmu.c b/drivers/gpu/drm/v3d/v3d_mmu.c

index 14f3af40d6f6..2e0b31e373b2 100644
--- a/drivers/gpu/drm/v3d/v3d_mmu.c
+++ b/drivers/gpu/drm/v3d/v3d_mmu.c
@@ -25,9 +25,16 @@
   * superpage bit set.
   */
  #define V3D_PTE_SUPERPAGE BIT(31)
+#define V3D_PTE_BIGPAGE BIT(30)
  #define V3D_PTE_WRITEABLE BIT(29)
  #define V3D_PTE_VALID BIT(28)
  
+static bool v3d_mmu_is_aligned(u32 page, u32 page_address, size_t alignment)

+{
+   return IS_ALIGNED(page, alignment >> V3D_MMU_PAGE_SHIFT) &&
+   IS_ALIGNED(page_address, alignment >> V3D_MMU_PAGE_SHIFT);
+}
+
  static int v3d_mmu_flush_all(struct v3d_dev *v3d)
  {
int ret;
@@ -87,19 +94,38 @@ void v3d_mmu_insert_ptes(struct v3d_bo *bo)
struct drm_gem_shmem_object *shmem_obj = >base;
struct v3d_dev *v3d = to_v3d_dev(shmem_obj->base.dev);
u32 page = bo->node.start;
-   u32 page_prot = V3D_PTE_WRITEABLE | V3D_PTE_VALID;
-   struct sg_dma_page_iter dma_iter;
-
-   for_each_sgtable_dma_page(shmem_obj->sgt, _iter, 0) {
-   dma_addr_t dma_addr = sg_page_iter_dma_address(_iter);
-   u32 page_address = dma_addr >> V3D_MMU_PAGE_SHIFT;
-   u32 pte = page_prot | page_address;
-   u32 i;
-
-   BUG_ON(page_address + (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT) >=
-  BIT(24));
-   for (i = 0; i < PAGE_SIZE >> V3D_MMU_PAGE_SHIFT; i++)
-   v3d->pt[page++] = pte + i;
+   struct scatterlist *sgl;
+   unsigned int count;
+
+   for_each_sgtable_dma_sg(shmem_obj->sgt, sgl, count) {
+   dma_addr_t dma_addr = sg_dma_address(sgl);
+   u32 pfn = dma_addr >> V3D_MMU_PAGE_SHIFT;
+   unsigned int len = sg_dma_len(sgl);
+
+   while (len > 0) {
+   u32 page_prot = V3D_PTE_WRITEABLE | V3D_PTE_VALID;
+   u32 page_address = page_prot | pfn;
+   unsigned int i, page_size;
+
+   BUG_ON(pfn + V3D_PAGE_FACTOR >= BIT(24));
+
+   if (len >= SZ_1M && v3d_mmu_is_aligned(page, 
page_address, SZ_1M)) {
+   page_size = SZ_1M;
+   page_address |= V3D_PTE_SUPERPAGE;
+   } else if (len >= SZ_64K && v3d_mmu_is_aligned(page, 
page_address, SZ_64K)) {
+   page_size = SZ_64K;
+   page_address |= V3D_PTE_BIGPAGE;
+   } else {
+   page_size = SZ_4K;
+   }
+
+   for (i = 0; i < page_size >> V3D_MMU_PAGE_SHIFT; i++) {
+   v3d->pt[page++] = page_address + i;
+   pfn++;
+   }
+
+   len -= page_size;
+       }
}
  
  	WARN_ON_ONCE(page - bo->node.start !=


It looks correct to me.

Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [PATCH v3 4/5] drm/v3d: Decouple stats calculation from printing

2024-04-22 Thread Tvrtko Ursulin



On 20/04/2024 22:32, Maíra Canal wrote:

Create a function to decouple the stats calculation from the printing.
This will be useful in the next step when we add a seqcount to protect
the stats.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_drv.c   | 18 ++
  drivers/gpu/drm/v3d/v3d_drv.h   |  4 
  drivers/gpu/drm/v3d/v3d_sysfs.c | 11 +++
  3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c
index 52e3ba9df46f..2ec359ed2def 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.c
+++ b/drivers/gpu/drm/v3d/v3d_drv.c
@@ -142,6 +142,15 @@ v3d_postclose(struct drm_device *dev, struct drm_file 
*file)
kfree(v3d_priv);
  }
  
+void v3d_get_stats(const struct v3d_stats *stats, u64 timestamp,

+  u64 *active_runtime, u64 *jobs_completed)
+{
+   *active_runtime = stats->enabled_ns;
+   if (stats->start_ns)
+   *active_runtime += timestamp - stats->start_ns;
+   *jobs_completed = stats->jobs_completed;
+}
+
  static void v3d_show_fdinfo(struct drm_printer *p, struct drm_file *file)
  {
struct v3d_file_priv *file_priv = file->driver_priv;
@@ -150,20 +159,21 @@ static void v3d_show_fdinfo(struct drm_printer *p, struct 
drm_file *file)
  
  	for (queue = 0; queue < V3D_MAX_QUEUES; queue++) {

struct v3d_stats *stats = _priv->stats[queue];
+   u64 active_runtime, jobs_completed;
+
+   v3d_get_stats(stats, timestamp, _runtime, 
_completed);
  
  		/* Note that, in case of a GPU reset, the time spent during an

 * attempt of executing the job is not computed in the runtime.
 */
drm_printf(p, "drm-engine-%s: \t%llu ns\n",
-  v3d_queue_to_string(queue),
-  stats->start_ns ? stats->enabled_ns + timestamp - 
stats->start_ns
-  : stats->enabled_ns);
+  v3d_queue_to_string(queue), active_runtime);
  
  		/* Note that we only count jobs that completed. Therefore, jobs

 * that were resubmitted due to a GPU reset are not computed.
 */
drm_printf(p, "v3d-jobs-%s: \t%llu jobs\n",
-  v3d_queue_to_string(queue), stats->jobs_completed);
+  v3d_queue_to_string(queue), jobs_completed);
}
  }
  
diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h

index 5a198924d568..ff06dc1cc078 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.h
+++ b/drivers/gpu/drm/v3d/v3d_drv.h
@@ -510,6 +510,10 @@ struct drm_gem_object *v3d_prime_import_sg_table(struct 
drm_device *dev,
  /* v3d_debugfs.c */
  void v3d_debugfs_init(struct drm_minor *minor);
  
+/* v3d_drv.c */

+void v3d_get_stats(const struct v3d_stats *stats, u64 timestamp,
+  u64 *active_runtime, u64 *jobs_completed);
+
  /* v3d_fence.c */
  extern const struct dma_fence_ops v3d_fence_ops;
  struct dma_fence *v3d_fence_create(struct v3d_dev *v3d, enum v3d_queue queue);
diff --git a/drivers/gpu/drm/v3d/v3d_sysfs.c b/drivers/gpu/drm/v3d/v3d_sysfs.c
index 6a8e7acc8b82..d610e355964f 100644
--- a/drivers/gpu/drm/v3d/v3d_sysfs.c
+++ b/drivers/gpu/drm/v3d/v3d_sysfs.c
@@ -15,18 +15,15 @@ gpu_stats_show(struct device *dev, struct device_attribute 
*attr, char *buf)
struct v3d_dev *v3d = to_v3d_dev(drm);
enum v3d_queue queue;
u64 timestamp = local_clock();
-   u64 active_runtime;
ssize_t len = 0;
  
  	len += sysfs_emit(buf, "queue\ttimestamp\tjobs\truntime\n");
  
  	for (queue = 0; queue < V3D_MAX_QUEUES; queue++) {

struct v3d_stats *stats = >queue[queue].stats;
+   u64 active_runtime, jobs_completed;
  
-		if (stats->start_ns)

-   active_runtime = timestamp - stats->start_ns;
-   else
-   active_runtime = 0;
+   v3d_get_stats(stats, timestamp, _runtime, 
_completed);
  
  		/* Each line will display the queue name, timestamp, the number

 * of jobs sent to that queue and the runtime, as can be seem 
here:
@@ -40,9 +37,7 @@ gpu_stats_show(struct device *dev, struct device_attribute 
*attr, char *buf)
 */
len += sysfs_emit_at(buf, len, "%s\t%llu\t%llu\t%llu\n",
 v3d_queue_to_string(queue),
-timestamp,
-stats->jobs_completed,
-stats->enabled_ns + active_runtime);
+    timestamp, jobs_completed, active_runtime);
}
  
  	return len;


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [PATCH v2 4/4] drm/v3d: Fix race-condition between sysfs/fdinfo and interrupt handler

2024-04-17 Thread Tvrtko Ursulin



On 17/04/2024 01:53, Maíra Canal wrote:

In V3D, the conclusion of a job is indicated by a IRQ. When a job
finishes, then we update the local and the global GPU stats of that
queue. But, while the GPU stats are being updated, a user might be
reading the stats from sysfs or fdinfo.

For example, on `gpu_stats_show()`, we could think about a scenario where
`v3d->queue[queue].start_ns != 0`, then an interruption happens, we update


interrupt


the value of `v3d->queue[queue].start_ns` to 0, we come back to
`gpu_stats_show()` to calculate `active_runtime` and now,
`active_runtime = timestamp`.

In this simple example, the user would see a spike in the queue usage,
that didn't matches reality.


match


In order to address this issue properly, use a seqcount to protect read
and write sections of the code.

Fixes: 09a93cc4f7d1 ("drm/v3d: Implement show_fdinfo() callback for GPU usage 
stats")
Reported-by: Tvrtko Ursulin 
Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_drv.c   | 10 ++
  drivers/gpu/drm/v3d/v3d_drv.h   | 21 +
  drivers/gpu/drm/v3d/v3d_gem.c   |  7 +--
  drivers/gpu/drm/v3d/v3d_sched.c |  7 +++
  drivers/gpu/drm/v3d/v3d_sysfs.c | 11 +++
  5 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c
index 52e3ba9df46f..cf15fa142968 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.c
+++ b/drivers/gpu/drm/v3d/v3d_drv.c
@@ -121,6 +121,7 @@ v3d_open(struct drm_device *dev, struct drm_file *file)
  1, NULL);
  
  		memset(_priv->stats[i], 0, sizeof(v3d_priv->stats[i]));

+   seqcount_init(_priv->stats[i].lock);
}
  
  	v3d_perfmon_open_file(v3d_priv);

@@ -150,20 +151,21 @@ static void v3d_show_fdinfo(struct drm_printer *p, struct 
drm_file *file)
  
  	for (queue = 0; queue < V3D_MAX_QUEUES; queue++) {

struct v3d_stats *stats = _priv->stats[queue];
+   u64 active_runtime, jobs_completed;
+
+   v3d_get_stats(stats, timestamp, _runtime, 
_completed);
  
  		/* Note that, in case of a GPU reset, the time spent during an

 * attempt of executing the job is not computed in the runtime.
 */
drm_printf(p, "drm-engine-%s: \t%llu ns\n",
-  v3d_queue_to_string(queue),
-  stats->start_ns ? stats->enabled_ns + timestamp - 
stats->start_ns
-  : stats->enabled_ns);
+  v3d_queue_to_string(queue), active_runtime);
  
  		/* Note that we only count jobs that completed. Therefore, jobs

 * that were resubmitted due to a GPU reset are not computed.
 */
drm_printf(p, "v3d-jobs-%s: \t%llu jobs\n",
-  v3d_queue_to_string(queue), stats->jobs_completed);
+  v3d_queue_to_string(queue), jobs_completed);
}
  }
  
diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h

index 5a198924d568..5211df7c7317 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.h
+++ b/drivers/gpu/drm/v3d/v3d_drv.h
@@ -40,8 +40,29 @@ struct v3d_stats {
u64 start_ns;
u64 enabled_ns;
u64 jobs_completed;
+
+   /*
+* This seqcount is used to protect the access to the GPU stats
+* variables. It must be used as, while we are reading the stats,
+* IRQs can happen and the stats can be updated.
+*/
+   seqcount_t lock;
  };
  
+static inline void v3d_get_stats(const struct v3d_stats *stats, u64 timestamp,

+u64 *active_runtime, u64 *jobs_completed)
+{
+   unsigned int seq;
+
+   do {
+   seq = read_seqcount_begin(>lock);
+   *active_runtime = stats->enabled_ns;
+   if (stats->start_ns)
+   *active_runtime += timestamp - stats->start_ns;
+   *jobs_completed = stats->jobs_completed;
+   } while (read_seqcount_retry(>lock, seq));
+}


Patch reads clean and obviously correct to me.

Reviewed-by: Tvrtko Ursulin 

The only possible discussion point I see is whether v3d_get_stats could 
have been introduced first to avoid mixing pure refactors with 
functionality, and whether it deserves to be in a header, or could be a 
function call in v3d_drv.c just as well. No strong opinion from me, 
since it is your driver your preference.


Regards,

Tvrtko


+
  struct v3d_queue_state {
struct drm_gpu_scheduler sched;
  
diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c

index d14589d3ae6c..da8faf3b9011 100644
--- a/drivers/gpu/drm/v3d/v3d_gem.c
+++ b/drivers/gpu/drm/v3d/v3d_gem.c
@@ -247,8 +247,11 @@ v3d_gem_init(struct drm_device *dev)
int ret, i;
  
  	for (i = 0; i < V3D_MAX_QUEUES; i++) {

-

Re: Proposal to add CRIU support to DRM render nodes

2024-04-16 Thread Tvrtko Ursulin



On 01/04/2024 18:58, Felix Kuehling wrote:


On 2024-04-01 12:56, Tvrtko Ursulin wrote:


On 01/04/2024 17:37, Felix Kuehling wrote:

On 2024-04-01 11:09, Tvrtko Ursulin wrote:


On 28/03/2024 20:42, Felix Kuehling wrote:


On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU 
plugin. It appears it relies on the KFD support being compiled in 
and /dev/kfd present, correct? AFAICT at least, it relies on that 
to figure out the amdgpu DRM node.


In would be probably good to consider designing things without 
that dependency. So that checkpointing an application which does 
not use /dev/kfd is possible. Or if the kernel does not even have 
the KFD support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we 
should definitely do that. Currently we get a lot of topology 
information from KFD, not even from the /dev/kfd device but from 
the sysfs nodes exposed by KFD. We'd need to get GPU device info 
from the render nodes instead. And if KFD is available, we may need 
to integrate both sources of information.





It could perhaps mean no more than adding some GPU discovery code 
into CRIU. Which shuold be flexible enough to account for things 
like re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the 
plugin. I was thinking this is still part of the plugin.


Yes I agree. I was only thinking about adding some DRM device 
discovery code in a more decoupled fashion from the current plugin, 
for both the reason discussed above (decoupling a bit from reliance 
on kfd sysfs), and then also if/when a new DRM driver might want to 
implement this the code could be move to some common plugin area.


I am not sure how feasible that would be though. The "gpu id" 
concept and it's matching in the current kernel code and CRIU plugin 
- is that value tied to the physical GPU instance or how it works?


The concept of the GPU ID is that it's stable while the system is up, 
even when devices get added and removed dynamically. It was baked 
into the API early on, but I don't think we ever fully validated 
device hot plug. I think the closest we're getting is with our latest 
MI GPUs and dynamic partition mode change.


Doesn't it read the saved gpu id from the image file while doing 
restore and tries to open the render node to match it? Maybe I am 
misreading the code.. But if it does, does it imply that in practice 
it could be stable across reboots? Or that it is not possible to 
restore to a different instance of maybe the same GPU model installed 
in a system?


Ah, the idea is, that when you restore on a different system, you may 
get different GPU IDs. Or you may checkpoint an app running on GPU 1 but 
restore it on GPU 2 on the same system. That's why we need to translate 
GPU IDs in restored applications. User mode still uses the old GPU IDs, 
but the kernel mode driver translates them to the actual GPU IDs of the 
GPUs that the process was restored on.


I see.. I think. Normal flow is ppd->user_gpu_id set during client init, 
but for restored clients it gets overriden during restore so that any 
further ioctls can actually not instantly fail.


And then in amdgpu_plugin_restore_file, when it is opening the render 
node, it relies on the kfd topology to have filled in (more or less) the 
target_gpu_id corresponding to the render node gpu id of the target GPU 
- the one associated with the new kfd gpu_id?


I am digging into this because I am trying to see if some part of GPU 
discovery could somehow be decoupled.. to offer you to work on at least 
that until you start to tackle the main body of the feature. But it 
looks properly tangled up.


Do you have any suggestions with what I could help with? Maybe 
developing some sort of drm device enumeration library if you see a way 
that would be useful in decoupling the device discovery from kfd. We 
would need to define what sort of information you would need to be 
queryable from it.


This also highlights another aspect on those spatially partitioned 
GPUs. GPU IDs identify device partitions, not devices. Similarly, 
each partition has its own render node, and the KFD topology info in 
sysfs points to the render-minor number corresponding to each GPU ID.


I am not familiar with this. This is not SR-IOV but some other kind of 
partitioning? Would you have any links where I could read more?


Right, the bare-metal driver can partition a PF spatially without SRIOV. 
SRIOV can also use spatial partitioning and expose each partition 
through its own VF, but that's not useful for bare metal. Spatial 
partitioning is new in MI300. There is some high-level info in this 
whitepaper: 
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.


From the outside (userspace) this looks simply like multiple DRM render 
nodes or how exactly?


Regards,

Tvrtko



Rega

Re: [PATCH 1/5] drm/v3d: Don't increment `enabled_ns` twice

2024-04-16 Thread Tvrtko Ursulin



Hi,

On 15/04/2024 12:17, Chema Casanova wrote:

El 3/4/24 a las 22:24, Maíra Canal escribió:
The commit 509433d8146c ("drm/v3d: Expose the total GPU usage stats on 
sysfs")

introduced the calculation of global GPU stats. For the regards, it used
the already existing infrastructure provided by commit 09a93cc4f7d1 
("drm/v3d:

Implement show_fdinfo() callback for GPU usage stats"). While adding
global GPU stats calculation ability, the author forgot to delete the
existing one.

Currently, the value of `enabled_ns` is incremented twice by the end of
the job, when it should be added just once. Therefore, delete the
leftovers from commit 509433d8146c ("drm/v3d: Expose the total GPU usage
stats on sysfs").

Fixes: 509433d8146c ("drm/v3d: Expose the total GPU usage stats on 
sysfs")

Reported-by: Tvrtko Ursulin 
Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_irq.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_irq.c 
b/drivers/gpu/drm/v3d/v3d_irq.c

index 2e04f6cb661e..ce6b2fb341d1 100644
--- a/drivers/gpu/drm/v3d/v3d_irq.c
+++ b/drivers/gpu/drm/v3d/v3d_irq.c
@@ -105,7 +105,6 @@ v3d_irq(int irq, void *arg)
  struct v3d_file_priv *file = 
v3d->bin_job->base.file->driver_priv;

  u64 runtime = local_clock() - file->start_ns[V3D_BIN];
-    file->enabled_ns[V3D_BIN] += local_clock() - 
file->start_ns[V3D_BIN];

  file->jobs_sent[V3D_BIN]++;
  v3d->queue[V3D_BIN].jobs_sent++;
@@ -126,7 +125,6 @@ v3d_irq(int irq, void *arg)
  struct v3d_file_priv *file = 
v3d->render_job->base.file->driver_priv;

  u64 runtime = local_clock() - file->start_ns[V3D_RENDER];
-    file->enabled_ns[V3D_RENDER] += local_clock() - 
file->start_ns[V3D_RENDER];

  file->jobs_sent[V3D_RENDER]++;
  v3d->queue[V3D_RENDER].jobs_sent++;
@@ -147,7 +145,6 @@ v3d_irq(int irq, void *arg)
  struct v3d_file_priv *file = 
v3d->csd_job->base.file->driver_priv;

  u64 runtime = local_clock() - file->start_ns[V3D_CSD];
-    file->enabled_ns[V3D_CSD] += local_clock() - 
file->start_ns[V3D_CSD];

  file->jobs_sent[V3D_CSD]++;
  v3d->queue[V3D_CSD].jobs_sent++;
@@ -195,7 +192,6 @@ v3d_hub_irq(int irq, void *arg)
  struct v3d_file_priv *file = 
v3d->tfu_job->base.file->driver_priv;

  u64 runtime = local_clock() - file->start_ns[V3D_TFU];
-    file->enabled_ns[V3D_TFU] += local_clock() - 
file->start_ns[V3D_TFU];

  file->jobs_sent[V3D_TFU]++;
  v3d->queue[V3D_TFU].jobs_sent++;


Thanks for fixing this. I see that I included this error in my first 
refactoring of

the original patch.


Not sure if it would be worth creating a simple test like 
https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/commit/2f81ed3aed873c7cc2f6d0e1117fa4fb02033246 
for i915? Just a thought.


Regards,

Tvrtko


Re: [PATCH] drm/sysfs: Add drm class-wide attribute to get active device clients

2024-04-15 Thread Tvrtko Ursulin



On 05/04/2024 18:59, Rob Clark wrote:

On Wed, Apr 3, 2024 at 11:37 AM Adrián Larumbe
 wrote:


Up to this day, all fdinfo-based GPU profilers must traverse the entire
/proc directory structure to find open DRM clients with fdinfo file
descriptors. This is inefficient and time-consuming.

This patch adds a new device class attribute that will install a sysfs file
per DRM device, which can be queried by profilers to get a list of PIDs for
their open clients. This file isn't human-readable, and it's meant to be
queried only by GPU profilers like gputop and nvtop.

Cc: Boris Brezillon 
Cc: Tvrtko Ursulin 
Cc: Christopher Healy 
Signed-off-by: Adrián Larumbe 


It does seem like a good idea.. idk if there is some precedent to
prefer binary vs ascii in sysfs, but having a way to avoid walking
_all_ processes is a good idea.


I naturally second that it is a needed feature, but I do not think 
binary format is justified. AFAIR it should be used for things like 
hw/fw standardised tables or firmware images, not when exporting a 
simple list of PIDs. It also precludes easy shell/script access and the 
benefit of avoiding parsing a short list is I suspect completely dwarfed 
by needing to parse all the related fdinfo etc.



---
  drivers/gpu/drm/drm_internal.h   |  2 +-
  drivers/gpu/drm/drm_privacy_screen.c |  2 +-
  drivers/gpu/drm/drm_sysfs.c  | 89 ++--
  3 files changed, 74 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h
index 2215baef9a3e..9a399b03d11c 100644
--- a/drivers/gpu/drm/drm_internal.h
+++ b/drivers/gpu/drm/drm_internal.h
@@ -145,7 +145,7 @@ bool drm_master_internal_acquire(struct drm_device *dev);
  void drm_master_internal_release(struct drm_device *dev);

  /* drm_sysfs.c */
-extern struct class *drm_class;
+extern struct class drm_class;

  int drm_sysfs_init(void);
  void drm_sysfs_destroy(void);
diff --git a/drivers/gpu/drm/drm_privacy_screen.c 
b/drivers/gpu/drm/drm_privacy_screen.c
index 6cc39e30781f..2fbd24ba5818 100644
--- a/drivers/gpu/drm/drm_privacy_screen.c
+++ b/drivers/gpu/drm/drm_privacy_screen.c
@@ -401,7 +401,7 @@ struct drm_privacy_screen *drm_privacy_screen_register(
 mutex_init(>lock);
 BLOCKING_INIT_NOTIFIER_HEAD(>notifier_head);

-   priv->dev.class = drm_class;
+   priv->dev.class = _class;
 priv->dev.type = _privacy_screen_type;
 priv->dev.parent = parent;
 priv->dev.release = drm_privacy_screen_device_release;
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index a953f69a34b6..56ca9e22c720 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -58,8 +58,6 @@ static struct device_type drm_sysfs_device_connector = {
 .name = "drm_connector",
  };

-struct class *drm_class;
-
  #ifdef CONFIG_ACPI
  static bool drm_connector_acpi_bus_match(struct device *dev)
  {
@@ -128,6 +126,62 @@ static const struct component_ops typec_connector_ops = {

  static CLASS_ATTR_STRING(version, S_IRUGO, "drm 1.1.0 20060810");

+static ssize_t clients_show(struct device *cd, struct device_attribute *attr, 
char *buf)
+{
+   struct drm_minor *minor = cd->driver_data;
+   struct drm_device *ddev = minor->dev;
+   struct drm_file *priv;
+   ssize_t offset = 0;
+   void *pid_buf;
+
+   if (minor->type != DRM_MINOR_RENDER)
+   return 0;


Why this?


+
+   pid_buf = kvmalloc(PAGE_SIZE, GFP_KERNEL);


I don't quite get the kvmalloc for just one page (or why even a temporay 
buffer and not write into buf directly?).



+   if (!pid_buf)
+   return 0;
+
+   mutex_lock(>filelist_mutex);
+   list_for_each_entry_reverse(priv, >filelist, lhead) {
+   struct pid *pid;
+
+   if (drm_WARN_ON(ddev, (PAGE_SIZE - offset) < sizeof(pid_t)))
+   break;


Feels bad.. I would suggest exploring implementing a read callback 
(instead of show) and handling arbitrary size output.



+
+   rcu_read_lock();
+   pid = rcu_dereference(priv->pid);
+   (*(pid_t *)(pid_buf + offset)) = pid_vnr(pid);
+   rcu_read_unlock();
+
+   offset += sizeof(pid_t);
+   }
+   mutex_unlock(>filelist_mutex);
+
+   if (offset < PAGE_SIZE)
+   (*(pid_t *)(pid_buf + offset)) = 0;


Either NULL terminated or PAGE_SIZE/sizeof(pid) entries and not NULL 
terminated feels weird. If I got that right.


For me everything points towards going for text output.


+
+   memcpy(buf, pid_buf, offset);
+
+   kvfree(pid_buf);
+
+   return offset;
+
+}
+static DEVICE_ATTR_RO(clients);


Shouldn't BIN_ATTR_RO be used for binary files in sysfs?

Regards,

Tvrtko

P.S. Or maybe it is time for drmfs? Where each client gets a directory 
and drivers can populate files. Such as per client logging s

Re: [PATCH v2 4/6] drm/gem: Create shmem GEM object in a given mountpoint

2024-04-15 Thread Tvrtko Ursulin



On 05/04/2024 19:29, Maíra Canal wrote:

Create a function `drm_gem_shmem_create_with_mnt()`, similar to
`drm_gem_shmem_create()`, that has a mountpoint as a argument. This
function will create a shmem GEM object in a given tmpfs mountpoint.

This function will be useful for drivers that have a special mountpoint
with flags enabled.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/drm_gem_shmem_helper.c | 30 ++
  include/drm/drm_gem_shmem_helper.h |  3 +++
  2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
b/drivers/gpu/drm/drm_gem_shmem_helper.c
index 13bcdbfd..10b7c4c769a3 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -49,7 +49,8 @@ static const struct drm_gem_object_funcs drm_gem_shmem_funcs 
= {
  };
  
  static struct drm_gem_shmem_object *

-__drm_gem_shmem_create(struct drm_device *dev, size_t size, bool private)
+__drm_gem_shmem_create(struct drm_device *dev, size_t size, bool private,
+  struct vfsmount *gemfs)
  {
struct drm_gem_shmem_object *shmem;
struct drm_gem_object *obj;
@@ -76,7 +77,7 @@ __drm_gem_shmem_create(struct drm_device *dev, size_t size, 
bool private)
drm_gem_private_object_init(dev, obj, size);
shmem->map_wc = false; /* dma-buf mappings use always 
writecombine */
} else {
-   ret = drm_gem_object_init(dev, obj, size);
+   ret = drm_gem_object_init_with_mnt(dev, obj, size, gemfs);
}
if (ret) {
drm_gem_private_object_fini(obj);
@@ -123,10 +124,31 @@ __drm_gem_shmem_create(struct drm_device *dev, size_t 
size, bool private)
   */
  struct drm_gem_shmem_object *drm_gem_shmem_create(struct drm_device *dev, 
size_t size)
  {
-   return __drm_gem_shmem_create(dev, size, false);
+   return __drm_gem_shmem_create(dev, size, false, NULL);
  }
  EXPORT_SYMBOL_GPL(drm_gem_shmem_create);
  
+/**

+ * drm_gem_shmem_create_with_mnt - Allocate an object with the given size in a
+ * given mountpoint
+ * @dev: DRM device
+ * @size: Size of the object to allocate
+ * @gemfs: tmpfs mount where the GEM object will be created
+ *
+ * This function creates a shmem GEM object in a given tmpfs mountpoint.
+ *
+ * Returns:
+ * A struct drm_gem_shmem_object * on success or an ERR_PTR()-encoded negative
+ * error code on failure.
+ */
+struct drm_gem_shmem_object *drm_gem_shmem_create_with_mnt(struct drm_device 
*dev,
+  size_t size,
+  struct vfsmount 
*gemfs)
+{
+   return __drm_gem_shmem_create(dev, size, false, gemfs);
+}
+EXPORT_SYMBOL_GPL(drm_gem_shmem_create_with_mnt);
+
  /**
   * drm_gem_shmem_free - Free resources associated with a shmem GEM object
   * @shmem: shmem GEM object to free
@@ -760,7 +782,7 @@ drm_gem_shmem_prime_import_sg_table(struct drm_device *dev,
size_t size = PAGE_ALIGN(attach->dmabuf->size);
struct drm_gem_shmem_object *shmem;
  
-	shmem = __drm_gem_shmem_create(dev, size, true);

+   shmem = __drm_gem_shmem_create(dev, size, true, NULL);
if (IS_ERR(shmem))
return ERR_CAST(shmem);
  
diff --git a/include/drm/drm_gem_shmem_helper.h b/include/drm/drm_gem_shmem_helper.h

index efbc9f27312b..d22e3fb53631 100644
--- a/include/drm/drm_gem_shmem_helper.h
+++ b/include/drm/drm_gem_shmem_helper.h
@@ -97,6 +97,9 @@ struct drm_gem_shmem_object {
container_of(obj, struct drm_gem_shmem_object, base)
  
  struct drm_gem_shmem_object *drm_gem_shmem_create(struct drm_device *dev, size_t size);

+struct drm_gem_shmem_object *drm_gem_shmem_create_with_mnt(struct drm_device 
*dev,
+  size_t size,
+  struct vfsmount 
*gemfs);
  void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem);
  
  void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem);


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [PATCH v2 2/6] drm/gem: Create a drm_gem_object_init_with_mnt() function

2024-04-15 Thread Tvrtko Ursulin



On 05/04/2024 19:29, Maíra Canal wrote:

For some applications, such as applications that uses huge pages, we might
want to have a different mountpoint, for which we pass mount flags that
better match our usecase.

Therefore, create a new function `drm_gem_object_init_with_mnt()` that
allow us to define the tmpfs mountpoint where the GEM object will be
created. If this parameter is NULL, then we fallback to `shmem_file_setup()`.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/drm_gem.c | 34 ++
  include/drm/drm_gem.h |  3 +++
  2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index d4bbc5d109c8..74ebe68e3d61 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -114,22 +114,32 @@ drm_gem_init(struct drm_device *dev)
  }

  /**
- * drm_gem_object_init - initialize an allocated shmem-backed GEM object
+ * drm_gem_object_init_with_mnt - initialize an allocated shmem-backed GEM
+ * object in a given shmfs mountpoint
+ *
   * @dev: drm_device the object should be initialized for
   * @obj: drm_gem_object to initialize
   * @size: object size
+ * @gemfs: tmpfs mount where the GEM object will be created. If NULL, use
+ * the usual tmpfs mountpoint (`shm_mnt`).
   *
   * Initialize an already allocated GEM object of the specified size with
   * shmfs backing store.
   */
-int drm_gem_object_init(struct drm_device *dev,
-   struct drm_gem_object *obj, size_t size)
+int drm_gem_object_init_with_mnt(struct drm_device *dev,
+struct drm_gem_object *obj, size_t size,
+struct vfsmount *gemfs)
  {
struct file *filp;

drm_gem_private_object_init(dev, obj, size);

-   filp = shmem_file_setup("drm mm object", size, VM_NORESERVE);
+   if (gemfs)
+   filp = shmem_file_setup_with_mnt(gemfs, "drm mm object", size,
+VM_NORESERVE);
+   else
+   filp = shmem_file_setup("drm mm object", size, VM_NORESERVE);
+
if (IS_ERR(filp))
return PTR_ERR(filp);

@@ -137,6 +147,22 @@ int drm_gem_object_init(struct drm_device *dev,

return 0;
  }
+EXPORT_SYMBOL(drm_gem_object_init_with_mnt);
+
+/**
+ * drm_gem_object_init - initialize an allocated shmem-backed GEM object
+ * @dev: drm_device the object should be initialized for
+ * @obj: drm_gem_object to initialize
+ * @size: object size
+ *
+ * Initialize an already allocated GEM object of the specified size with
+ * shmfs backing store.
+ */
+int drm_gem_object_init(struct drm_device *dev, struct drm_gem_object *obj,
+   size_t size)
+{
+   return drm_gem_object_init_with_mnt(dev, obj, size, NULL);
+}
  EXPORT_SYMBOL(drm_gem_object_init);


I would be tempted to static inline this one but see what other people 
think. (One wise kernel legend was once annoyed by trivial wrappers / 
function calls. But some other are then annoyed by static inlines.. so 
dunno.) For either flavour:


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko



  /**
diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
index bae4865b2101..2ebf6e10cc44 100644
--- a/include/drm/drm_gem.h
+++ b/include/drm/drm_gem.h
@@ -472,6 +472,9 @@ void drm_gem_object_release(struct drm_gem_object *obj);
  void drm_gem_object_free(struct kref *kref);
  int drm_gem_object_init(struct drm_device *dev,
struct drm_gem_object *obj, size_t size);
+int drm_gem_object_init_with_mnt(struct drm_device *dev,
+struct drm_gem_object *obj, size_t size,
+struct vfsmount *gemfs);
  void drm_gem_private_object_init(struct drm_device *dev,
 struct drm_gem_object *obj, size_t size);
  void drm_gem_private_object_fini(struct drm_gem_object *obj);
--
2.44.0




Re: [PATCH 5/5] drm/v3d: Fix race-condition between sysfs/fdinfo and interrupt handler

2024-04-15 Thread Tvrtko Ursulin



On 03/04/2024 21:24, Maíra Canal wrote:

In V3D, the conclusion of a job is indicated by a IRQ. When a job
finishes, then we update the local and the global GPU stats of that
queue. But, while the GPU stats are being updated, a user might be
reading the stats from sysfs or fdinfo.

For example, on `gpu_stats_show()`, we could think about a scenario where
`v3d->queue[queue].start_ns != 0`, then an interruption happens, we update
the value of `v3d->queue[queue].start_ns` to 0, we come back to 
`gpu_stats_show()`
to calculate `active_runtime` and now, `active_runtime = timestamp`.

In this simple example, the user would see a spike in the queue usage,
that didn't matches reality.

In order to address this issue properly, use rw-locks to protect read
and write sections of the code.

Fixes: 09a93cc4f7d1 ("drm/v3d: Implement show_fdinfo() callback for GPU usage 
stats")
Reported-by: Tvrtko Ursulin 
Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_drv.c   | 16 
  drivers/gpu/drm/v3d/v3d_drv.h   |  7 +++
  drivers/gpu/drm/v3d/v3d_gem.c   |  7 +--
  drivers/gpu/drm/v3d/v3d_sched.c |  9 +
  drivers/gpu/drm/v3d/v3d_sysfs.c | 16 
  5 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c
index cbb62be18aa5..60437718786c 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.c
+++ b/drivers/gpu/drm/v3d/v3d_drv.c
@@ -119,7 +119,9 @@ v3d_open(struct drm_device *dev, struct drm_file *file)
drm_sched_entity_init(_priv->sched_entity[i],
  DRM_SCHED_PRIORITY_NORMAL, ,
  1, NULL);
+


Nitpick - if you want a blank line here probably add it in the patch 
which added the below memset.



memset(_priv->stats[i], 0, sizeof(v3d_priv->stats[i]));
+   rwlock_init(_priv->stats[i].rw_lock);
}

v3d_perfmon_open_file(v3d_priv);
@@ -149,20 +151,26 @@ static void v3d_show_fdinfo(struct drm_printer *p, struct 
drm_file *file)

for (queue = 0; queue < V3D_MAX_QUEUES; queue++) {
struct v3d_stats *stats = _priv->stats[queue];
+   u64 active_time, jobs_sent;
+   unsigned long flags;
+
+   read_lock_irqsave(>rw_lock, flags);


The context is never irq/bh here so you can you read_lock_irq.

However on the topic of lock type chosen, I think sort of established 
wisdom is that rwlocks are overkill for such short locked sections. More 
so, optimizing for multiple concurrent readers is not a huge use case 
for fdinfo reads. I would go for a plain spinlock, or potentially even 
read/write_seqcount. Just because the latter has no atomics in the irq 
handler. Readers might retry now and then, but unless v3d typically sees 
thousands of interrupts per second it should not be a problem.



+   active_time = stats->start_ns ? stats->enabled_ns + timestamp - 
stats->start_ns
+ : stats->enabled_ns;
+   jobs_sent = stats->jobs_sent;
+   read_unlock_irqrestore(>rw_lock, flags);

/* Note that, in case of a GPU reset, the time spent during an
 * attempt of executing the job is not computed in the runtime.
 */
drm_printf(p, "drm-engine-%s: \t%llu ns\n",
-  v3d_queue_to_string(queue),
-  stats->start_ns ? stats->enabled_ns + timestamp - 
stats->start_ns
-  : stats->enabled_ns);
+  v3d_queue_to_string(queue), active_time);

/* Note that we only count jobs that completed. Therefore, jobs
 * that were resubmitted due to a GPU reset are not computed.
 */
drm_printf(p, "v3d-jobs-%s: \t%llu jobs\n",
-  v3d_queue_to_string(queue), stats->jobs_sent);
+  v3d_queue_to_string(queue), jobs_sent);
}
  }

diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h
index 0117593976ed..8fde2623f763 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.h
+++ b/drivers/gpu/drm/v3d/v3d_drv.h
@@ -40,6 +40,13 @@ struct v3d_stats {
u64 start_ns;
u64 enabled_ns;
u64 jobs_sent;
+
+   /*
+* This lock is used to protect the access to the GPU stats variables.
+* It must be used as, while we are reading the stats, IRQs can happen
+* and the stats would be updated.
+*/
+   rwlock_t rw_lock;
  };

  struct v3d_queue_state {
diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
index d14589d3ae6c..439088724a51 100644
--- a/drivers/gpu/drm/v3d/v3d_gem.c
+++ b/drivers/gpu/drm/v3d/v3d_gem.c
@@ -247,8 +247,11 @@ v3d_gem_init(struct drm_device *de

Re: [PATCH v2 6/6] drm/v3d: Enable big and super pages

2024-04-15 Thread Tvrtko Ursulin



On 05/04/2024 19:29, Maíra Canal wrote:

The V3D MMU also supports 64KB and 1MB pages, called big and super pages,
respectively. In order to set a 64KB page or 1MB page in the MMU, we need
to make sure that page table entries for all 4KB pages within a big/super
page must be correctly configured.

In order to create a big/super page, we need a contiguous memory region.
That's why we use a separate mountpoint with THP enabled. In order to
place the page table entries in the MMU, we iterate over the 16 4KB pages
(for big pages) or 256 4KB pages (for super pages) and insert the PTE.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_bo.c| 21 +--
  drivers/gpu/drm/v3d/v3d_drv.c   |  8 ++
  drivers/gpu/drm/v3d/v3d_drv.h   |  2 ++
  drivers/gpu/drm/v3d/v3d_gemfs.c |  6 +
  drivers/gpu/drm/v3d/v3d_mmu.c   | 46 ++---
  5 files changed, 71 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_bo.c b/drivers/gpu/drm/v3d/v3d_bo.c
index 79e31c5299b1..cfe82232886a 100644
--- a/drivers/gpu/drm/v3d/v3d_bo.c
+++ b/drivers/gpu/drm/v3d/v3d_bo.c
@@ -94,6 +94,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj)
struct v3d_dev *v3d = to_v3d_dev(obj->dev);
struct v3d_bo *bo = to_v3d_bo(obj);
struct sg_table *sgt;
+   u64 align;
int ret;

/* So far we pin the BO in the MMU for its lifetime, so use
@@ -103,6 +104,15 @@ v3d_bo_create_finish(struct drm_gem_object *obj)
if (IS_ERR(sgt))
return PTR_ERR(sgt);

+   if (!v3d->super_pages)
+   align = SZ_4K;
+   else if (obj->size >= SZ_1M)
+   align = SZ_1M;
+   else if (obj->size >= SZ_64K)
+   align = SZ_64K;
+   else
+   align = SZ_4K;
+
spin_lock(>mm_lock);
/* Allocate the object's space in the GPU's page tables.
 * Inserting PTEs will happen later, but the offset is for the
@@ -110,7 +120,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj)
 */
ret = drm_mm_insert_node_generic(>mm, >node,
 obj->size >> V3D_MMU_PAGE_SHIFT,
-SZ_4K >> V3D_MMU_PAGE_SHIFT, 0, 0);
+align >> V3D_MMU_PAGE_SHIFT, 0, 0);
spin_unlock(>mm_lock);
if (ret)
return ret;
@@ -130,10 +140,17 @@ struct v3d_bo *v3d_bo_create(struct drm_device *dev, 
struct drm_file *file_priv,
 size_t unaligned_size)
  {
struct drm_gem_shmem_object *shmem_obj;
+   struct v3d_dev *v3d = to_v3d_dev(dev);
struct v3d_bo *bo;
int ret;

-   shmem_obj = drm_gem_shmem_create(dev, unaligned_size);
+   /* Let the user opt out of allocating the BOs with THP */
+   if (v3d->super_pages)
+   shmem_obj = drm_gem_shmem_create_with_mnt(dev, unaligned_size,
+ v3d->gemfs);
+   else
+   shmem_obj = drm_gem_shmem_create(dev, unaligned_size);
+
if (IS_ERR(shmem_obj))
return ERR_CAST(shmem_obj);
bo = to_v3d_bo(_obj->base);
diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c
index 3debf37e7d9b..3dbd29560be4 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.c
+++ b/drivers/gpu/drm/v3d/v3d_drv.c
@@ -36,6 +36,12 @@
  #define DRIVER_MINOR 0
  #define DRIVER_PATCHLEVEL 0

+static bool super_pages = true;
+module_param_named(super_pages, super_pages, bool, 0400);
+MODULE_PARM_DESC(super_pages, "Enable/Disable Super Pages support. Note: \
+  To enable Super Pages, you need support to \
+  enable THP.");


Maybe not expose the modparam unless CONFIG_TRANSPARENT_HUGEPAGE is 
enabled? Then you wouldn't have to explain the dependency in the 
description.



+
  static int v3d_get_param_ioctl(struct drm_device *dev, void *data,
   struct drm_file *file_priv)
  {
@@ -308,6 +314,8 @@ static int v3d_platform_drm_probe(struct platform_device 
*pdev)
return -ENOMEM;
}

+   v3d->super_pages = super_pages;
+
ret = v3d_gem_init(drm);
if (ret)
goto dma_free;
diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h
index 17236ee23490..0a7aacf51164 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.h
+++ b/drivers/gpu/drm/v3d/v3d_drv.h
@@ -18,6 +18,7 @@ struct platform_device;
  struct reset_control;

  #define V3D_MMU_PAGE_SHIFT 12
+#define V3D_PAGE_FACTOR (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT)

  #define V3D_MAX_QUEUES (V3D_CPU + 1)

@@ -121,6 +122,7 @@ struct v3d_dev {
 * tmpfs instance used for shmem backed objects
 */
struct vfsmount *gemfs;
+   bool super_pages;


You could probably get away with not having to add this new bool by 
basing the runtime checks of v3d->gemfs != NULL. In v3d_gemfs_init you 
would 

Re: [PATCH 4/5] drm/v3d: Create function to update a set of GPU stats

2024-04-15 Thread Tvrtko Ursulin



On 03/04/2024 21:24, Maíra Canal wrote:

Given a set of GPU stats, that is, a `struct v3d_stats` related to a
queue in a given context, create a function that can update all this set of
GPU stats.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_sched.c | 20 
  1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index ea5f5a84b55b..754107b80f67 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -118,6 +118,16 @@ v3d_job_start_stats(struct v3d_job *job, enum v3d_queue 
queue)
global_stats->start_ns = now;
  }
  
+static void

+v3d_stats_update(struct v3d_stats *stats)
+{
+   u64 now = local_clock();
+
+   stats->enabled_ns += now - stats->start_ns;
+   stats->jobs_sent++;
+   stats->start_ns = 0;
+}
+
  void
  v3d_job_update_stats(struct v3d_job *job, enum v3d_queue queue)
  {
@@ -125,15 +135,9 @@ v3d_job_update_stats(struct v3d_job *job, enum v3d_queue 
queue)
struct v3d_file_priv *file = job->file->driver_priv;
struct v3d_stats *global_stats = >queue[queue].stats;
struct v3d_stats *local_stats = >stats[queue];
-   u64 now = local_clock();
-
-   local_stats->enabled_ns += now - local_stats->start_ns;
-   local_stats->jobs_sent++;
-   local_stats->start_ns = 0;
  
-	global_stats->enabled_ns += now - global_stats->start_ns;

-   global_stats->jobs_sent++;
-   global_stats->start_ns = 0;
+   v3d_stats_update(local_stats);
+   v3d_stats_update(global_stats);
  }
  
  static struct dma_fence *v3d_bin_job_run(struct drm_sched_job *sched_job)


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [PATCH 3/5] drm/v3d: Create a struct to store the GPU stats

2024-04-15 Thread Tvrtko Ursulin
->file->driver_priv;
+   struct v3d_stats *global_stats = >queue[queue].stats;
+   struct v3d_stats *local_stats = >stats[queue];
u64 now = local_clock();
  
-	file->start_ns[queue] = now;

-   v3d->queue[queue].start_ns = now;
+   local_stats->start_ns = now;
+   global_stats->start_ns = now;
  }
  
  void

@@ -121,15 +123,17 @@ v3d_job_update_stats(struct v3d_job *job, enum v3d_queue 
queue)
  {
struct v3d_dev *v3d = job->v3d;
struct v3d_file_priv *file = job->file->driver_priv;
+   struct v3d_stats *global_stats = >queue[queue].stats;
+   struct v3d_stats *local_stats = >stats[queue];
u64 now = local_clock();
  
-	file->enabled_ns[queue] += now - file->start_ns[queue];

-   file->jobs_sent[queue]++;
-   file->start_ns[queue] = 0;
+   local_stats->enabled_ns += now - local_stats->start_ns;
+   local_stats->jobs_sent++;
+   local_stats->start_ns = 0;
  
-	v3d->queue[queue].enabled_ns += now - v3d->queue[queue].start_ns;

-   v3d->queue[queue].jobs_sent++;
-   v3d->queue[queue].start_ns = 0;
+   global_stats->enabled_ns += now - global_stats->start_ns;
+   global_stats->jobs_sent++;
+   global_stats->start_ns = 0;
  }
  
  static struct dma_fence *v3d_bin_job_run(struct drm_sched_job *sched_job)

diff --git a/drivers/gpu/drm/v3d/v3d_sysfs.c b/drivers/gpu/drm/v3d/v3d_sysfs.c
index d106845ba890..1eb5f3de6937 100644
--- a/drivers/gpu/drm/v3d/v3d_sysfs.c
+++ b/drivers/gpu/drm/v3d/v3d_sysfs.c
@@ -21,8 +21,10 @@ gpu_stats_show(struct device *dev, struct device_attribute 
*attr, char *buf)
len += sysfs_emit(buf, "queue\ttimestamp\tjobs\truntime\n");
  
  	for (queue = 0; queue < V3D_MAX_QUEUES; queue++) {

-   if (v3d->queue[queue].start_ns)
-   active_runtime = timestamp - v3d->queue[queue].start_ns;
+   struct v3d_stats *stats = >queue[queue].stats;
+
+   if (stats->start_ns)
+   active_runtime = timestamp - stats->start_ns;
else
active_runtime = 0;
  
@@ -39,8 +41,8 @@ gpu_stats_show(struct device *dev, struct device_attribute *attr, char *buf)

len += sysfs_emit_at(buf, len, "%s\t%llu\t%llu\t%llu\n",
 v3d_queue_to_string(queue),
 timestamp,
-v3d->queue[queue].jobs_sent,
-        v3d->queue[queue].enabled_ns + 
active_runtime);
+stats->jobs_sent,
+stats->enabled_ns + active_runtime);
}
  
  	return len;


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [PATCH 1/5] drm/v3d: Don't increment `enabled_ns` twice

2024-04-15 Thread Tvrtko Ursulin



On 03/04/2024 21:24, Maíra Canal wrote:

The commit 509433d8146c ("drm/v3d: Expose the total GPU usage stats on sysfs")
introduced the calculation of global GPU stats. For the regards, it used
the already existing infrastructure provided by commit 09a93cc4f7d1 ("drm/v3d:
Implement show_fdinfo() callback for GPU usage stats"). While adding
global GPU stats calculation ability, the author forgot to delete the
existing one.

Currently, the value of `enabled_ns` is incremented twice by the end of
the job, when it should be added just once. Therefore, delete the
leftovers from commit 509433d8146c ("drm/v3d: Expose the total GPU usage
stats on sysfs").

Fixes: 509433d8146c ("drm/v3d: Expose the total GPU usage stats on sysfs")
Reported-by: Tvrtko Ursulin 
Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_irq.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_irq.c b/drivers/gpu/drm/v3d/v3d_irq.c
index 2e04f6cb661e..ce6b2fb341d1 100644
--- a/drivers/gpu/drm/v3d/v3d_irq.c
+++ b/drivers/gpu/drm/v3d/v3d_irq.c
@@ -105,7 +105,6 @@ v3d_irq(int irq, void *arg)
struct v3d_file_priv *file = 
v3d->bin_job->base.file->driver_priv;
u64 runtime = local_clock() - file->start_ns[V3D_BIN];
  
-		file->enabled_ns[V3D_BIN] += local_clock() - file->start_ns[V3D_BIN];

file->jobs_sent[V3D_BIN]++;
v3d->queue[V3D_BIN].jobs_sent++;
  
@@ -126,7 +125,6 @@ v3d_irq(int irq, void *arg)

struct v3d_file_priv *file = 
v3d->render_job->base.file->driver_priv;
u64 runtime = local_clock() - file->start_ns[V3D_RENDER];
  
-		file->enabled_ns[V3D_RENDER] += local_clock() - file->start_ns[V3D_RENDER];

file->jobs_sent[V3D_RENDER]++;
v3d->queue[V3D_RENDER].jobs_sent++;
  
@@ -147,7 +145,6 @@ v3d_irq(int irq, void *arg)

struct v3d_file_priv *file = 
v3d->csd_job->base.file->driver_priv;
u64 runtime = local_clock() - file->start_ns[V3D_CSD];
  
-		file->enabled_ns[V3D_CSD] += local_clock() - file->start_ns[V3D_CSD];

file->jobs_sent[V3D_CSD]++;
v3d->queue[V3D_CSD].jobs_sent++;
  
@@ -195,7 +192,6 @@ v3d_hub_irq(int irq, void *arg)

struct v3d_file_priv *file = 
v3d->tfu_job->base.file->driver_priv;
u64 runtime = local_clock() - file->start_ns[V3D_TFU];
  
-		file->enabled_ns[V3D_TFU] += local_clock() - file->start_ns[V3D_TFU];

file->jobs_sent[V3D_TFU]++;
v3d->queue[V3D_TFU].jobs_sent++;
  


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Tvrtko Ursulin



On 01/04/2024 17:37, Felix Kuehling wrote:

On 2024-04-01 11:09, Tvrtko Ursulin wrote:


On 28/03/2024 20:42, Felix Kuehling wrote:


On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. 
It appears it relies on the KFD support being compiled in and 
/dev/kfd present, correct? AFAICT at least, it relies on that to 
figure out the amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we 
should definitely do that. Currently we get a lot of topology 
information from KFD, not even from the /dev/kfd device but from the 
sysfs nodes exposed by KFD. We'd need to get GPU device info from the 
render nodes instead. And if KFD is available, we may need to 
integrate both sources of information.





It could perhaps mean no more than adding some GPU discovery code 
into CRIU. Which shuold be flexible enough to account for things 
like re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the plugin. 
I was thinking this is still part of the plugin.


Yes I agree. I was only thinking about adding some DRM device 
discovery code in a more decoupled fashion from the current plugin, 
for both the reason discussed above (decoupling a bit from reliance on 
kfd sysfs), and then also if/when a new DRM driver might want to 
implement this the code could be move to some common plugin area.


I am not sure how feasible that would be though. The "gpu id" concept 
and it's matching in the current kernel code and CRIU plugin - is that 
value tied to the physical GPU instance or how it works?


The concept of the GPU ID is that it's stable while the system is up, 
even when devices get added and removed dynamically. It was baked into 
the API early on, but I don't think we ever fully validated device hot 
plug. I think the closest we're getting is with our latest MI GPUs and 
dynamic partition mode change.


Doesn't it read the saved gpu id from the image file while doing restore 
and tries to open the render node to match it? Maybe I am misreading the 
code.. But if it does, does it imply that in practice it could be stable 
across reboots? Or that it is not possible to restore to a different 
instance of maybe the same GPU model installed in a system?


This also highlights another aspect on those spatially partitioned GPUs. 
GPU IDs identify device partitions, not devices. Similarly, each 
partition has its own render node, and the KFD topology info in sysfs 
points to the render-minor number corresponding to each GPU ID.


I am not familiar with this. This is not SR-IOV but some other kind of 
partitioning? Would you have any links where I could read more?


Regards,

Tvrtko

Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend to 
other DRM related anonymous fds.


I've been pretty far under-water lately. I hope I'll find time to 
work on this more, but it's probably going to be at least a few weeks.


Got it.

Regards,

Tvrtko



Regards,
   Felix




Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application 
once they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why we 
are doing this, and outlining some of the problems we need to 
solve to checkpoint and restore render node state and shared 
memory (DMABuf) state. I have some thoughts on the API design, 
leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM 
API and improve interoperability between graphics and compute. 
This uses DMABufs for sharing buffer objects between KFD and 
multiple render node devices, as well as between processes. In 
the long run this also provides a path to moving all or most 
memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual 
address management, that creates a problem for checkpointing 
and restoring ROCm applications with CRIU. Currently there is 
no support for checkpointing and restoring render node state, 
other than CPU virtual address mappings. Support will be needed 
for checkpointing GEM buffer objects a

Re: drm-misc migration to Gitlab server

2024-04-01 Thread Tvrtko Ursulin



On 12/03/2024 13:56, Maxime Ripard wrote:

Hi,

On Tue, Feb 20, 2024 at 09:49:25AM +0100, Maxime Ripard wrote:

## Changing the default location repo

Dim gets its repos list in the drm-rerere nightly.conf file. We will
need to change that file to match the gitlab repo, and drop the old cgit
URLs to avoid people pushing to the wrong place once the transition is
made.

I guess the next merge window is a good time to do so, it's usually a
quiet time for us and a small disruption would be easier to handle. I'll
be off-duty during that time too, so I'll have time to handle any
complication.

## Updating the documentation

The documentation currently mentions the old process to request a
drm-misc access. It will all go through Gitlab now, so it will change a
few things. We will also need to update and move the issue template to
the new repo to maintain consistency.

I would expect the transition (if everything goes smoothly) to occur in
the merge-window time frame (11/03 -> 24/03).


The transition just happened. The main drm-misc repo is now on gitlab,
with the old cgit repo being setup as a mirror.

If there's any issue accessing that gitlab repo, please let me know.


No issues accessing the repo just a slight confusion and how to handle 
the workflow. More specifically, if I have a patch which wants to be 
merged to drm-misc-next, is the mailing list based worklflow still the 
way to go, or I should create a merge request, or I should apply for 
commit access via some new method other than adding permissions to my 
legacy fdo ssh account?


Regards,

Tvrtko


Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Tvrtko Ursulin



On 28/03/2024 20:42, Felix Kuehling wrote:


On 2024-03-28 12:03, Tvrtko Ursulin wrote:


Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. 
It appears it relies on the KFD support being compiled in and /dev/kfd 
present, correct? AFAICT at least, it relies on that to figure out the 
amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


Yeah, if we want to support graphics apps that don't use KFD, we should 
definitely do that. Currently we get a lot of topology information from 
KFD, not even from the /dev/kfd device but from the sysfs nodes exposed 
by KFD. We'd need to get GPU device info from the render nodes instead. 
And if KFD is available, we may need to integrate both sources of 
information.





It could perhaps mean no more than adding some GPU discovery code into 
CRIU. Which shuold be flexible enough to account for things like 
re-assigned minor numbers due driver reload.


Do you mean adding GPU discovery to the core CRIU, or to the plugin. I 
was thinking this is still part of the plugin.


Yes I agree. I was only thinking about adding some DRM device discovery 
code in a more decoupled fashion from the current plugin, for both the 
reason discussed above (decoupling a bit from reliance on kfd sysfs), 
and then also if/when a new DRM driver might want to implement this the 
code could be move to some common plugin area.


I am not sure how feasible that would be though. The "gpu id" concept 
and it's matching in the current kernel code and CRIU plugin - is that 
value tied to the physical GPU instance or how it works?


Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend to 
other DRM related anonymous fds.


I've been pretty far under-water lately. I hope I'll find time to work 
on this more, but it's probably going to be at least a few weeks.


Got it.

Regards,

Tvrtko



Regards,
   Felix




Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application once 
they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why we 
are doing this, and outlining some of the problems we need to 
solve to checkpoint and restore render node state and shared 
memory (DMABuf) state. I have some thoughts on the API design, 
leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM 
API and improve interoperability between graphics and compute. 
This uses DMABufs for sharing buffer objects between KFD and 
multiple render node devices, as well as between processes. In 
the long run this also provides a path to moving all or most 
memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and 
restoring ROCm applications with CRIU. Currently there is no 
support for checkpointing and restoring render node state, other 
than CPU virtual address mappings. Support will be needed for 
checkpointing GEM buffer objects and handles, their GPU virtual 
address mappings and memory sharing relationships between devices 
and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including 
scheduler contexts and BO lists. Most of this state is 
driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf 
APIs and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, 
although I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of 
any concrete plans, but I think feature is pretty cool and if 
amdgpu gets it working I wouldn't be surprised if other drivers 
would get interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I 
recently implemented on Mesa's request, which is to be able to 
"upload" the GPU context from the GPU hang error state and replay 
the hanging

Re: [PATCH] dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FS

2024-04-01 Thread Tvrtko Ursulin



On 01/04/2024 13:45, Christian König wrote:

Am 01.04.24 um 14:39 schrieb Tvrtko Ursulin:


On 29/03/2024 00:00, T.J. Mercier wrote:
On Thu, Mar 28, 2024 at 7:53 AM Tvrtko Ursulin  
wrote:


From: Tvrtko Ursulin 

There is no point in compiling in the list and mutex operations 
which are

only used from the dma-buf debugfs code, if debugfs is not compiled in.

Put the code in questions behind some kconfig guards and so save 
some text

and maybe even a pointer per object at runtime when not enabled.

Signed-off-by: Tvrtko Ursulin 


Reviewed-by: T.J. Mercier 


Thanks!

How would patches to dma-buf be typically landed? Via what tree I 
mean? drm-misc-next?


That should go through drm-misc-next.

And feel free to add Reviewed-by: Christian König 
 as well.


Thanks!

Maarten if I got it right you are handling the next drm-misc-next pull - 
could you merge this one please?


Regards,

Tvrtko


Re: [PATCH] dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FS

2024-04-01 Thread Tvrtko Ursulin



On 29/03/2024 00:00, T.J. Mercier wrote:

On Thu, Mar 28, 2024 at 7:53 AM Tvrtko Ursulin  wrote:


From: Tvrtko Ursulin 

There is no point in compiling in the list and mutex operations which are
only used from the dma-buf debugfs code, if debugfs is not compiled in.

Put the code in questions behind some kconfig guards and so save some text
and maybe even a pointer per object at runtime when not enabled.

Signed-off-by: Tvrtko Ursulin 


Reviewed-by: T.J. Mercier 


Thanks!

How would patches to dma-buf be typically landed? Via what tree I mean? 
drm-misc-next?


Regards,

Tvrtko


Re: Proposal to add CRIU support to DRM render nodes

2024-03-28 Thread Tvrtko Ursulin



Hi Felix,

I had one more thought while browsing around the amdgpu CRIU plugin. It 
appears it relies on the KFD support being compiled in and /dev/kfd 
present, correct? AFAICT at least, it relies on that to figure out the 
amdgpu DRM node.


In would be probably good to consider designing things without that 
dependency. So that checkpointing an application which does not use 
/dev/kfd is possible. Or if the kernel does not even have the KFD 
support compiled in.


It could perhaps mean no more than adding some GPU discovery code into 
CRIU. Which shuold be flexible enough to account for things like 
re-assigned minor numbers due driver reload.


Otherwise I am eagerly awaiting to hear more about the design specifics 
around dma-buf handling. And also seeing how to extend to other DRM 
related anonymous fds.


Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:


On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes 
in order to maintain CRIU support for ROCm application once they 
start relying on render nodes for more GPU memory management. In 
this email I'm providing some background why we are doing this, and 
outlining some of the problems we need to solve to checkpoint and 
restore render node state and shared memory (DMABuf) state. I have 
some thoughts on the API design, leaning on what we did for KFD, 
but would like to get feedback from the DRI community regarding 
that API and to what extent there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM API 
and improve interoperability between graphics and compute. This 
uses DMABufs for sharing buffer objects between KFD and multiple 
render node devices, as well as between processes. In the long run 
this also provides a path to moving all or most memory management 
from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU 
virtual address mappings. Support will be needed for checkpointing 
GEM buffer objects and handles, their GPU virtual address mappings 
and memory sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including scheduler 
contexts and BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf APIs 
and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although 
I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of any 
concrete plans, but I think feature is pretty cool and if amdgpu gets 
it working I wouldn't be surprised if other drivers would get 
interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the 
GPU context from the GPU hang error state and replay the hanging 
request. It is kind of (at a stretch) a very special tiny subset of 
checkout and restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at 
the moment not yet upstream) i915 online debugger. This part being 
in the area of discovering and enumerating GPU resources beloning to 
the client.


I don't see an immediate design or code sharing opportunities though 
but just mentioning.


I did spend some time reading your plugin and kernel implementation 
out of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM 
CRIU API (either generic of AMDGPU driver specific): The API goes 
through several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can 
allocate

    memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at 
this time)
 2. Resume (final fixups after the VMAs are sorted out, resume 
execution)


Btw is check-pointing guaranteeing all relevant activity is idled? 
For instance dma_resv objects are free of fences which 

[PATCH] dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FS

2024-03-28 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

There is no point in compiling in the list and mutex operations which are
only used from the dma-buf debugfs code, if debugfs is not compiled in.

Put the code in questions behind some kconfig guards and so save some text
and maybe even a pointer per object at runtime when not enabled.

Signed-off-by: Tvrtko Ursulin 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: linux-me...@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-ker...@vger.kernel.org
Cc: kernel-...@igalia.com
---
 drivers/dma-buf/dma-buf.c | 56 ---
 include/linux/dma-buf.h   |  2 ++
 2 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..8892bc701a66 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -35,12 +35,35 @@
 
 static inline int is_dma_buf_file(struct file *);
 
-struct dma_buf_list {
-   struct list_head head;
-   struct mutex lock;
-};
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+static DEFINE_MUTEX(debugfs_list_mutex);
+static LIST_HEAD(debugfs_list);
 
-static struct dma_buf_list db_list;
+static void __dma_buf_debugfs_list_add(struct dma_buf *dmabuf)
+{
+   mutex_lock(_list_mutex);
+   list_add(>list_node, _list);
+   mutex_unlock(_list_mutex);
+}
+
+static void __dma_buf_debugfs_list_del(struct dma_buf *dmabuf)
+{
+   if (!dmabuf)
+   return;
+
+   mutex_lock(_list_mutex);
+   list_del(>list_node);
+   mutex_unlock(_list_mutex);
+}
+#else
+static void __dma_buf_debugfs_list_add(struct dma_buf *dmabuf)
+{
+}
+
+static void __dma_buf_debugfs_list_del(struct file *file)
+{
+}
+#endif
 
 static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
@@ -89,17 +112,10 @@ static void dma_buf_release(struct dentry *dentry)
 
 static int dma_buf_file_release(struct inode *inode, struct file *file)
 {
-   struct dma_buf *dmabuf;
-
if (!is_dma_buf_file(file))
return -EINVAL;
 
-   dmabuf = file->private_data;
-   if (dmabuf) {
-   mutex_lock(_list.lock);
-   list_del(>list_node);
-   mutex_unlock(_list.lock);
-   }
+   __dma_buf_debugfs_list_del(file->private_data);
 
return 0;
 }
@@ -672,9 +688,7 @@ struct dma_buf *dma_buf_export(const struct 
dma_buf_export_info *exp_info)
file->f_path.dentry->d_fsdata = dmabuf;
dmabuf->file = file;
 
-   mutex_lock(_list.lock);
-   list_add(>list_node, _list.head);
-   mutex_unlock(_list.lock);
+   __dma_buf_debugfs_list_add(dmabuf);
 
return dmabuf;
 
@@ -1611,7 +1625,7 @@ static int dma_buf_debug_show(struct seq_file *s, void 
*unused)
size_t size = 0;
int ret;
 
-   ret = mutex_lock_interruptible(_list.lock);
+   ret = mutex_lock_interruptible(_list_mutex);
 
if (ret)
return ret;
@@ -1620,7 +1634,7 @@ static int dma_buf_debug_show(struct seq_file *s, void 
*unused)
seq_printf(s, "%-8s\t%-8s\t%-8s\t%-8s\texp_name\t%-8s\tname\n",
   "size", "flags", "mode", "count", "ino");
 
-   list_for_each_entry(buf_obj, _list.head, list_node) {
+   list_for_each_entry(buf_obj, _list, list_node) {
 
ret = dma_resv_lock_interruptible(buf_obj->resv, NULL);
if (ret)
@@ -1657,11 +1671,11 @@ static int dma_buf_debug_show(struct seq_file *s, void 
*unused)
 
seq_printf(s, "\nTotal %d objects, %zu bytes\n", count, size);
 
-   mutex_unlock(_list.lock);
+   mutex_unlock(_list_mutex);
return 0;
 
 error_unlock:
-   mutex_unlock(_list.lock);
+   mutex_unlock(_list_mutex);
return ret;
 }
 
@@ -1718,8 +1732,6 @@ static int __init dma_buf_init(void)
if (IS_ERR(dma_buf_mnt))
return PTR_ERR(dma_buf_mnt);
 
-   mutex_init(_list.lock);
-   INIT_LIST_HEAD(_list.head);
dma_buf_init_debugfs();
return 0;
 }
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 8ff4add71f88..36216d28d8bd 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -370,8 +370,10 @@ struct dma_buf {
 */
struct module *owner;
 
+#if IS_ENABLED(CONFIG_DEBUG_FS)
/** @list_node: node for dma_buf accounting and debugging. */
struct list_head list_node;
+#endif
 
/** @priv: exporter specific private data for this buffer object. */
void *priv;
-- 
2.44.0



Re: [PATCH] drm/i915/gem: Replace dev_priv with i915

2024-03-28 Thread Tvrtko Ursulin



On 28/03/2024 07:18, Andi Shyti wrote:

Anyone using 'dev_priv' instead of 'i915' in a cleaned-up area
should be fined and required to do community service for a few
days.

I thought I had cleaned up the 'gem/' directory in the past, but
still, old aficionados of the 'dev_priv' name keep sneaking it
in.

Signed-off-by: Andi Shyti 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
---
  drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c |  4 ++--
  drivers/gpu/drm/i915/gem/i915_gem_shmem.c  |  6 +++---
  drivers/gpu/drm/i915/gem/i915_gem_stolen.h |  8 
  drivers/gpu/drm/i915/gem/i915_gem_tiling.c | 18 +-
  drivers/gpu/drm/i915/gem/i915_gem_userptr.c|  6 +++---
  .../gpu/drm/i915/gem/selftests/huge_pages.c| 14 +++---
  6 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 3f20fe381199..42619fc05de4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -2456,7 +2456,7 @@ static int eb_submit(struct i915_execbuffer *eb)
   * The engine index is returned.
   */
  static unsigned int
-gen8_dispatch_bsd_engine(struct drm_i915_private *dev_priv,
+gen8_dispatch_bsd_engine(struct drm_i915_private *i915,
 struct drm_file *file)
  {
struct drm_i915_file_private *file_priv = file->driver_priv;
@@ -2464,7 +2464,7 @@ gen8_dispatch_bsd_engine(struct drm_i915_private 
*dev_priv,
/* Check whether the file_priv has already selected one ring. */
if ((int)file_priv->bsd_engine < 0)
file_priv->bsd_engine =
-   
get_random_u32_below(dev_priv->engine_uabi_class_count[I915_ENGINE_CLASS_VIDEO]);
+   
get_random_u32_below(i915->engine_uabi_class_count[I915_ENGINE_CLASS_VIDEO]);
  
  	return file_priv->bsd_engine;

  }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 38b72d86560f..c5e1c718a6d2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -654,7 +654,7 @@ i915_gem_object_create_shmem(struct drm_i915_private *i915,
  
  /* Allocate a new GEM object and fill it with the supplied data */

  struct drm_i915_gem_object *
-i915_gem_object_create_shmem_from_data(struct drm_i915_private *dev_priv,
+i915_gem_object_create_shmem_from_data(struct drm_i915_private *i915,
   const void *data, resource_size_t size)
  {
struct drm_i915_gem_object *obj;
@@ -663,8 +663,8 @@ i915_gem_object_create_shmem_from_data(struct 
drm_i915_private *dev_priv,
resource_size_t offset;
int err;
  
-	GEM_WARN_ON(IS_DGFX(dev_priv));

-   obj = i915_gem_object_create_shmem(dev_priv, round_up(size, PAGE_SIZE));
+   GEM_WARN_ON(IS_DGFX(i915));
+   obj = i915_gem_object_create_shmem(i915, round_up(size, PAGE_SIZE));
if (IS_ERR(obj))
return obj;
  
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_stolen.h b/drivers/gpu/drm/i915/gem/i915_gem_stolen.h

index 258381d1c054..dfe0db8bb1b9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_stolen.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_stolen.h
@@ -14,14 +14,14 @@ struct drm_i915_gem_object;
  
  #define i915_stolen_fb drm_mm_node
  
-int i915_gem_stolen_insert_node(struct drm_i915_private *dev_priv,

+int i915_gem_stolen_insert_node(struct drm_i915_private *i915,
struct drm_mm_node *node, u64 size,
unsigned alignment);
-int i915_gem_stolen_insert_node_in_range(struct drm_i915_private *dev_priv,
+int i915_gem_stolen_insert_node_in_range(struct drm_i915_private *i915,
 struct drm_mm_node *node, u64 size,
 unsigned alignment, u64 start,
 u64 end);
-void i915_gem_stolen_remove_node(struct drm_i915_private *dev_priv,
+void i915_gem_stolen_remove_node(struct drm_i915_private *i915,
 struct drm_mm_node *node);
  struct intel_memory_region *
  i915_gem_stolen_smem_setup(struct drm_i915_private *i915, u16 type,
@@ -31,7 +31,7 @@ i915_gem_stolen_lmem_setup(struct drm_i915_private *i915, u16 
type,
   u16 instance);
  
  struct drm_i915_gem_object *

-i915_gem_object_create_stolen(struct drm_i915_private *dev_priv,
+i915_gem_object_create_stolen(struct drm_i915_private *i915,
  resource_size_t size);
  
  bool i915_gem_object_is_stolen(const struct drm_i915_gem_object *obj);

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c 
b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
index a049ca0b7980..d9eb84c1d2f1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
+++ b/drivers/gpu/drm/i915/

Re: [PATCH v6 0/3] Disable automatic load CCS load balancing

2024-03-20 Thread Tvrtko Ursulin



On 20/03/2024 15:06, Andi Shyti wrote:

Ping! Any thoughts here?


I only casually observed the discussion after I saw Matt suggested 
further simplifications. As I understood it, you will bring back the 
uabi engine games when adding the dynamic behaviour and that is fine by me.


Regards,

Tvrtko


On Wed, Mar 13, 2024 at 09:19:48PM +0100, Andi Shyti wrote:

Hi,

this series does basically two things:

1. Disables automatic load balancing as adviced by the hardware
workaround.

2. Assigns all the CCS slices to one single user engine. The user
will then be able to query only one CCS engine

>From v5 I have created a new file, gt/intel_gt_ccs_mode.c where
I added the intel_gt_apply_ccs_mode(). In the upcoming patches,
this file will contain the implementation for dynamic CCS mode
setting.

Thanks Tvrtko, Matt, John and Joonas for your reviews!

Andi

Changelog
=
v5 -> v6 (thanks Matt for the suggestions in v6)
  - Remove the refactoring and the for_each_available_engine()
macro and instead do not create the intel_engine_cs structure
at all.
  - In patch 1 just a trivial reordering of the bit definitions.

v4 -> v5
  - Use the workaround framework to do all the CCS balancing
settings in order to always apply the modes also when the
engine resets. Put everything in its own specific function to
be executed for the first CCS engine encountered. (Thanks
Matt)
  - Calculate the CCS ID for the CCS mode as the first available
CCS among all the engines (Thanks Matt)
  - create the intel_gt_ccs_mode.c function to host the CCS
configuration. We will have it ready for the next series.
  - Fix a selftest that was failing because could not set CCS2.
  - Add the for_each_available_engine() macro to exclude CCS1+ and
start using it in the hangcheck selftest.

v3 -> v4
  - Reword correctly the comment in the workaround
  - Fix a buffer overflow (Thanks Joonas)
  - Handle properly the fused engines when setting the CCS mode.

v2 -> v3
  - Simplified the algorithm for creating the list of the exported
uabi engines. (Patch 1) (Thanks, Tvrtko)
  - Consider the fused engines when creating the uabi engine list
(Patch 2) (Thanks, Matt)
  - Patch 4 now uses a the refactoring from patch 1, in a cleaner
outcome.

v1 -> v2
  - In Patch 1 use the correct workaround number (thanks Matt).
  - In Patch 2 do not add the extra CCS engines to the exposed
UABI engine list and adapt the engine counting accordingly
(thanks Tvrtko).
  - Reword the commit of Patch 2 (thanks John).

Andi Shyti (3):
   drm/i915/gt: Disable HW load balancing for CCS
   drm/i915/gt: Do not generate the command streamer for all the CCS
   drm/i915/gt: Enable only one CCS for compute workload

  drivers/gpu/drm/i915/Makefile   |  1 +
  drivers/gpu/drm/i915/gt/intel_engine_cs.c   | 20 ---
  drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.c | 39 +
  drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.h | 13 +++
  drivers/gpu/drm/i915/gt/intel_gt_regs.h |  6 
  drivers/gpu/drm/i915/gt/intel_workarounds.c | 30 ++--
  6 files changed, 103 insertions(+), 6 deletions(-)
  create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.c
  create mode 100644 drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.h

--
2.43.0


Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()

2024-03-18 Thread Tvrtko Ursulin



On 18/03/2024 15:05, Christian König wrote:

Am 18.03.24 um 15:24 schrieb Maíra Canal:

Not that the CC list wasn't big enough, but I'm adding MM folks
in the CC list.

On 3/18/24 11:04, Christian König wrote:

Am 18.03.24 um 14:28 schrieb Maíra Canal:

Hi Christian,

On 3/18/24 10:10, Christian König wrote:

Am 18.03.24 um 13:42 schrieb Maíra Canal:

Hi Christian,

On 3/12/24 10:48, Christian König wrote:

Am 12.03.24 um 14:09 schrieb Tvrtko Ursulin:


On 12/03/2024 10:37, Christian König wrote:

Am 12.03.24 um 11:31 schrieb Tvrtko Ursulin:


On 12/03/2024 10:23, Christian König wrote:

Am 12.03.24 um 10:30 schrieb Tvrtko Ursulin:


On 12/03/2024 08:59, Christian König wrote:

Am 12.03.24 um 09:51 schrieb Tvrtko Ursulin:


Hi Maira,

On 11/03/2024 10:05, Maíra Canal wrote:
For some applications, such as using huge pages, we might 
want to have a
different mountpoint, for which we pass in mount flags 
that better match

our usecase.

Therefore, add a new parameter to drm_gem_object_init() 
that allow us to
define the tmpfs mountpoint where the GEM object will be 
created. If
this parameter is NULL, then we fallback to 
shmem_file_setup().


One strategy for reducing churn, and so the number of 
drivers this patch touches, could be to add a lower level 
drm_gem_object_init() (which takes vfsmount, call it 
__drm_gem_object_init(), or drm__gem_object_init_mnt(), 
and make drm_gem_object_init() call that one with a NULL 
argument.


I would even go a step further into the other direction. 
The shmem backed GEM object is just some special handling 
as far as I can see.


So I would rather suggest to rename all drm_gem_* function 
which only deal with the shmem backed GEM object into 
drm_gem_shmem_*.


That makes sense although it would be very churny. I at 
least would be on the fence regarding the cost vs benefit.


Yeah, it should clearly not be part of this patch here.



Also the explanation why a different mount point helps with 
something isn't very satisfying.


Not satisfying as you think it is not detailed enough to say 
driver wants to use huge pages for performance? Or not 
satisying as you question why huge pages would help?


That huge pages are beneficial is clear to me, but I'm 
missing the connection why a different mount point helps with 
using huge pages.


Ah right, same as in i915, one needs to mount a tmpfs instance 
passing huge=within_size or huge=always option. Default is 
'never', see man 5 tmpfs.


Thanks for the explanation, I wasn't aware of that.

Mhm, shouldn't we always use huge pages? Is there a reason for 
a DRM device to not use huge pages with the shmem backend?


AFAIU, according to b901bb89324a ("drm/i915/gemfs: enable THP"), 
back then the understanding was within_size may overallocate, 
meaning there would be some space wastage, until the memory 
pressure makes the thp code split the trailing huge page. I 
haven't checked if that still applies.


Other than that I don't know if some drivers/platforms could 
have problems if they have some limitations or hardcoded 
assumptions when they iterate the sg list.


Yeah, that was the whole point behind my question. As far as I 
can see this isn't driver specific, but platform specific.


I might be wrong here, but I think we should then probably not 
have that handling in each individual driver, but rather 
centralized in the DRM code.


I don't see a point in enabling THP for all shmem drivers. A huge 
page

is only useful if the driver is going to use it. On V3D, for example,
I only need huge pages because I need the memory contiguously 
allocated

to implement Super Pages. Otherwise, if we don't have the Super Pages
support implemented in the driver, I would be creating memory 
pressure

without any performance gain.


Well that's the point I'm disagreeing with. THP doesn't seem to 
create much extra memory pressure for this use case.


As far as I can see background for the option is that files in 
tmpfs usually have a varying size, so it usually isn't beneficial 
to allocate a huge page just to find that the shmem file is much 
smaller than what's needed.


But GEM objects have a fixed size. So we of hand knew if we need 
4KiB or 1GiB and can therefore directly allocate huge pages if they 
are available and object large enough to back them with.


If the memory pressure is so high that we don't have huge pages 
available the shmem code falls back to standard pages anyway.


The matter is: how do we define the point where the memory pressure 
is high?


Well as driver developers/maintainers we simply don't do that. This 
is the job of the shmem code.



For example, notice that in this implementation of Super Pages
for the V3D driver, I only use a Super Page if the BO is bigger than 
2MB. I'm doing that because the Raspberry Pi only has 4GB of RAM 
available for the GPU. If I created huge pages for every BO 
allocation (and initially, I tried that), I would end up with hangs 
in some applications.


Yeah, that

Re: Proposal to add CRIU support to DRM render nodes

2024-03-15 Thread Tvrtko Ursulin



On 15/03/2024 02:33, Felix Kuehling wrote:


On 2024-03-12 5:45, Tvrtko Ursulin wrote:


On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes 
in order to maintain CRIU support for ROCm application once they 
start relying on render nodes for more GPU memory management. In 
this email I'm providing some background why we are doing this, and 
outlining some of the problems we need to solve to checkpoint and 
restore render node state and shared memory (DMABuf) state. I have 
some thoughts on the API design, leaning on what we did for KFD, but 
would like to get feedback from the DRI community regarding that API 
and to what extent there is interest in making that generic.


We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style VM API 
and improve interoperability between graphics and compute. This uses 
DMABufs for sharing buffer objects between KFD and multiple render 
node devices, as well as between processes. In the long run this 
also provides a path to moving all or most memory management from 
the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU 
virtual address mappings. Support will be needed for checkpointing 
GEM buffer objects and handles, their GPU virtual address mappings 
and memory sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including scheduler 
contexts and BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf APIs 
and may have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although 
I cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of any 
concrete plans, but I think feature is pretty cool and if amdgpu gets 
it working I wouldn't be surprised if other drivers would get interested.


Thanks, that's good to hear!




Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the 
GPU context from the GPU hang error state and replay the hanging 
request. It is kind of (at a stretch) a very special tiny subset of 
checkout and restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at 
the moment not yet upstream) i915 online debugger. This part being in 
the area of discovering and enumerating GPU resources beloning to the 
client.


I don't see an immediate design or code sharing opportunities though 
but just mentioning.


I did spend some time reading your plugin and kernel implementation 
out of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM 
CRIU API (either generic of AMDGPU driver specific): The API goes 
through several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can 
allocate

    memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at 
this time)
 2. Resume (final fixups after the VMAs are sorted out, resume 
execution)


Btw is check-pointing guaranteeing all relevant activity is idled? 
For instance dma_resv objects are free of fences which would need to 
restored for things to continue executing sensibly? Or how is that 
handled?


In our compute use cases, we suspend user mode queues. This can include 
CWSR (compute-wave-save-restore) where the state of in-flight waves is 
stored in memory and can be reloaded and resumed from memory later. We 
don't use any fences other than "eviction fences", that are signaled 
after the queues are suspended. And those fences are never handed to 
user mode. So we don't need to worry about any fence state in the 
checkpoint.


If we extended this to support the kernel mode command submission APIs, 
I would expect that we'd wait for all current submissions to complete, 
and stop new ones from being sent to the HW before taking the 
checkpoint. When we take the checkpoint in the CRIU plugin, the CPU 
threads are already frozen and cannot submit any more work. If we wai

Re: [PATCH 5/5] drm/v3d: Enable super pages

2024-03-12 Thread Tvrtko Ursulin



Hi Maira,

On 11/03/2024 10:06, Maíra Canal wrote:

The V3D MMU also supports 1MB pages, called super pages. In order to
set a 1MB page in the MMU, we need to make sure that page table entries
for all 4KB pages within a super page must be correctly configured.

Therefore, if the BO is larger than 2MB, we allocate it in a separate
mountpoint that uses THP. This will allow us to create a contiguous
memory region to create our super pages. In order to place the page
table entries in the MMU, we iterate over the 256 4KB pages and insert
the PTE.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/v3d_bo.c| 19 +--
  drivers/gpu/drm/v3d/v3d_drv.c   |  7 +++
  drivers/gpu/drm/v3d/v3d_drv.h   |  6 --
  drivers/gpu/drm/v3d/v3d_gemfs.c |  6 ++
  drivers/gpu/drm/v3d/v3d_mmu.c   | 24 ++--
  5 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_bo.c b/drivers/gpu/drm/v3d/v3d_bo.c
index a07ede668cc1..cb8e49a33be7 100644
--- a/drivers/gpu/drm/v3d/v3d_bo.c
+++ b/drivers/gpu/drm/v3d/v3d_bo.c
@@ -94,6 +94,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj)
struct v3d_dev *v3d = to_v3d_dev(obj->dev);
struct v3d_bo *bo = to_v3d_bo(obj);
struct sg_table *sgt;
+   u64 align;
int ret;

/* So far we pin the BO in the MMU for its lifetime, so use
@@ -103,6 +104,9 @@ v3d_bo_create_finish(struct drm_gem_object *obj)
if (IS_ERR(sgt))
return PTR_ERR(sgt);

+   bo->huge_pages = (obj->size >= SZ_2M && v3d->super_pages);
+   align = bo->huge_pages ? SZ_1M : SZ_4K;
+
spin_lock(>mm_lock);
/* Allocate the object's space in the GPU's page tables.
 * Inserting PTEs will happen later, but the offset is for the
@@ -110,7 +114,7 @@ v3d_bo_create_finish(struct drm_gem_object *obj)
 */
ret = drm_mm_insert_node_generic(>mm, >node,
 obj->size >> V3D_MMU_PAGE_SHIFT,
-GMP_GRANULARITY >> V3D_MMU_PAGE_SHIFT, 
0, 0);
+align >> V3D_MMU_PAGE_SHIFT, 0, 0);
spin_unlock(>mm_lock);
if (ret)
return ret;
@@ -130,10 +134,21 @@ struct v3d_bo *v3d_bo_create(struct drm_device *dev, 
struct drm_file *file_priv,
 size_t unaligned_size)
  {
struct drm_gem_shmem_object *shmem_obj;
+   struct v3d_dev *v3d = to_v3d_dev(dev);
struct v3d_bo *bo;
+   size_t size;
int ret;

-   shmem_obj = drm_gem_shmem_create(dev, unaligned_size);
+   size = PAGE_ALIGN(unaligned_size);
+
+   /* To avoid memory fragmentation, we only use THP if the BO is bigger
+* than two Super Pages (1MB).
+*/
+   if (size >= SZ_2M && v3d->super_pages)
+   shmem_obj = drm_gem_shmem_create_with_mnt(dev, size, 
v3d->gemfs);
+   else
+   shmem_obj = drm_gem_shmem_create(dev, size);
+
if (IS_ERR(shmem_obj))
return ERR_CAST(shmem_obj);
bo = to_v3d_bo(_obj->base);
diff --git a/drivers/gpu/drm/v3d/v3d_drv.c b/drivers/gpu/drm/v3d/v3d_drv.c
index 3debf37e7d9b..96f4d8227407 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.c
+++ b/drivers/gpu/drm/v3d/v3d_drv.c
@@ -36,6 +36,11 @@
  #define DRIVER_MINOR 0
  #define DRIVER_PATCHLEVEL 0

+static bool super_pages = true;
+module_param_named(super_pages, super_pages, bool, 0400);
+MODULE_PARM_DESC(super_pages, "Enable/Disable Super Pages support. Note: \
+  To enable Super Pages, you need support to 
THP.");
+
  static int v3d_get_param_ioctl(struct drm_device *dev, void *data,
   struct drm_file *file_priv)
  {
@@ -308,6 +313,8 @@ static int v3d_platform_drm_probe(struct platform_device 
*pdev)
return -ENOMEM;
}

+   v3d->super_pages = super_pages;
+
ret = v3d_gem_init(drm);
if (ret)
goto dma_free;
diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h
index d2ce8222771a..795087663739 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.h
+++ b/drivers/gpu/drm/v3d/v3d_drv.h
@@ -17,9 +17,8 @@ struct clk;
  struct platform_device;
  struct reset_control;

-#define GMP_GRANULARITY (128 * 1024)
-
  #define V3D_MMU_PAGE_SHIFT 12
+#define V3D_PAGE_FACTOR (PAGE_SIZE >> V3D_MMU_PAGE_SHIFT)

  #define V3D_MAX_QUEUES (V3D_CPU + 1)

@@ -123,6 +122,7 @@ struct v3d_dev {
 * tmpfs instance used for shmem backed objects
 */
struct vfsmount *gemfs;
+   bool super_pages;


One not very important comment just in passing: Does v3d->super_pages == 
!!v3d->gemfs always holds at runtime? Thinking if you really need to add 
v3d->super_pages, or could just infer from v3d->gemfs, maybe via a 
wrapper or whatever pattern is used in v3d.




struct work_struct overflow_mem_work;

@@ -211,6 +211,8 @@ struct v3d_bo {
struct list_head 

Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()

2024-03-12 Thread Tvrtko Ursulin



On 12/03/2024 10:37, Christian König wrote:

Am 12.03.24 um 11:31 schrieb Tvrtko Ursulin:


On 12/03/2024 10:23, Christian König wrote:

Am 12.03.24 um 10:30 schrieb Tvrtko Ursulin:


On 12/03/2024 08:59, Christian König wrote:

Am 12.03.24 um 09:51 schrieb Tvrtko Ursulin:


Hi Maira,

On 11/03/2024 10:05, Maíra Canal wrote:
For some applications, such as using huge pages, we might want to 
have a
different mountpoint, for which we pass in mount flags that 
better match

our usecase.

Therefore, add a new parameter to drm_gem_object_init() that 
allow us to

define the tmpfs mountpoint where the GEM object will be created. If
this parameter is NULL, then we fallback to shmem_file_setup().


One strategy for reducing churn, and so the number of drivers this 
patch touches, could be to add a lower level drm_gem_object_init() 
(which takes vfsmount, call it __drm_gem_object_init(), or 
drm__gem_object_init_mnt(), and make drm_gem_object_init() call 
that one with a NULL argument.


I would even go a step further into the other direction. The shmem 
backed GEM object is just some special handling as far as I can see.


So I would rather suggest to rename all drm_gem_* function which 
only deal with the shmem backed GEM object into drm_gem_shmem_*.


That makes sense although it would be very churny. I at least would 
be on the fence regarding the cost vs benefit.


Yeah, it should clearly not be part of this patch here.



Also the explanation why a different mount point helps with 
something isn't very satisfying.


Not satisfying as you think it is not detailed enough to say driver 
wants to use huge pages for performance? Or not satisying as you 
question why huge pages would help?


That huge pages are beneficial is clear to me, but I'm missing the 
connection why a different mount point helps with using huge pages.


Ah right, same as in i915, one needs to mount a tmpfs instance passing 
huge=within_size or huge=always option. Default is 'never', see man 5 
tmpfs.


Thanks for the explanation, I wasn't aware of that.

Mhm, shouldn't we always use huge pages? Is there a reason for a DRM 
device to not use huge pages with the shmem backend?


AFAIU, according to b901bb89324a ("drm/i915/gemfs: enable THP"), back 
then the understanding was within_size may overallocate, meaning there 
would be some space wastage, until the memory pressure makes the thp 
code split the trailing huge page. I haven't checked if that still applies.


Other than that I don't know if some drivers/platforms could have 
problems if they have some limitations or hardcoded assumptions when 
they iterate the sg list.


Te Cc is plenty large so perhaps someone else will have additional 
information. :)


Regards,

Tvrtko



I mean it would make this patch here even smaller.

Regards,
Christian.




Regards,

Tvrtko




Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()

2024-03-12 Thread Tvrtko Ursulin



On 12/03/2024 10:23, Christian König wrote:

Am 12.03.24 um 10:30 schrieb Tvrtko Ursulin:


On 12/03/2024 08:59, Christian König wrote:

Am 12.03.24 um 09:51 schrieb Tvrtko Ursulin:


Hi Maira,

On 11/03/2024 10:05, Maíra Canal wrote:
For some applications, such as using huge pages, we might want to 
have a
different mountpoint, for which we pass in mount flags that better 
match

our usecase.

Therefore, add a new parameter to drm_gem_object_init() that allow 
us to

define the tmpfs mountpoint where the GEM object will be created. If
this parameter is NULL, then we fallback to shmem_file_setup().


One strategy for reducing churn, and so the number of drivers this 
patch touches, could be to add a lower level drm_gem_object_init() 
(which takes vfsmount, call it __drm_gem_object_init(), or 
drm__gem_object_init_mnt(), and make drm_gem_object_init() call that 
one with a NULL argument.


I would even go a step further into the other direction. The shmem 
backed GEM object is just some special handling as far as I can see.


So I would rather suggest to rename all drm_gem_* function which only 
deal with the shmem backed GEM object into drm_gem_shmem_*.


That makes sense although it would be very churny. I at least would be 
on the fence regarding the cost vs benefit.


Yeah, it should clearly not be part of this patch here.



Also the explanation why a different mount point helps with something 
isn't very satisfying.


Not satisfying as you think it is not detailed enough to say driver 
wants to use huge pages for performance? Or not satisying as you 
question why huge pages would help?


That huge pages are beneficial is clear to me, but I'm missing the 
connection why a different mount point helps with using huge pages.


Ah right, same as in i915, one needs to mount a tmpfs instance passing 
huge=within_size or huge=always option. Default is 'never', see man 5 tmpfs.


Regards,

Tvrtko


Re: [PATCH 0/5] drm/i915: cleanup dead code

2024-03-12 Thread Tvrtko Ursulin



On 11/03/2024 19:27, Lucas De Marchi wrote:

On Mon, Mar 11, 2024 at 05:43:00PM +, Tvrtko Ursulin wrote:


On 06/03/2024 19:36, Lucas De Marchi wrote:

Remove platforms that never had their PCI IDs added to the driver and
are of course marked with requiring force_probe. Note that most of the
code for those platforms is actually used by subsequent ones, so it's
not a huge amount of code being removed.


I had PVC and xehpsdv back in October but could not collect all acks. :(

Last two patches from https://patchwork.freedesktop.org/series/124705/.


oh... I was actually surprised we still had xehpsdv while removing a
WA for PVC, which made me look into removing these platforms.

rebasing your series and comparing yours..my-v2, where my-v2 only has
patches 2 and 4, I have the diff below. I think it's small enough that I
can just take your commits and squash delta. Is that ok to you?

my version is a little bit more aggressive, also doing some renames
s/xehpsdv/xehp/ and dropping some more code
(engine_mask_apply_copy_fuses(), unused registers, default ctx, fw
ranges).


Right, yeah I see I missed some case combos in the comments when 
grepping and more.


 diff --git a/Documentation/gpu/rfc/i915_vm_bind.h 
b/Documentation/gpu/rfc/i915_vm_bind.h

 index 8a8fcd4fceac..bc26dc126104 100644
 --- a/Documentation/gpu/rfc/i915_vm_bind.h
 +++ b/Documentation/gpu/rfc/i915_vm_bind.h
 @@ -93,12 +93,11 @@ struct drm_i915_gem_timeline_fence {
   * Multiple VA mappings can be created to the same section of the 
object

   * (aliasing).
   *
 - * The @start, @offset and @length must be 4K page aligned. 
However the DG2
 - * and XEHPSDV has 64K page size for device local memory and has 
compact page
 - * table. On those platforms, for binding device local-memory 
objects, the
 - * @start, @offset and @length must be 64K aligned. Also, UMDs 
should not mix
 - * the local memory 64K page and the system memory 4K page 
bindings in the same

 - * 2M range.
 + * The @start, @offset and @length must be 4K page aligned. 
However the DG2 has
 + * 64K page size for device local memory and has compact page 
table. On that
 + * platform, for binding device local-memory objects, the @start, 
@offset and
 + * @length must be 64K aligned. Also, UMDs should not mix the 
local memory 64K

 + * page and the system memory 4K page bindings in the same 2M range.
   *
   * Error code -EINVAL will be returned if @start, @offset and 
@length are not
   * properly aligned. In version 1 (See 
I915_PARAM_VM_BIND_VERSION), error code
 diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h 
b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h

 index 1495b6074492..d3300ae3053f 100644
 --- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
 +++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
 @@ -386,7 +386,7 @@ struct drm_i915_gem_object {
  * and kernel mode driver for caching policy control after GEN12.
  * In the meantime platform specific tables are created to 
translate
  * i915_cache_level into pat index, for more details check the 
macros

 - * defined i915/i915_pci.c, e.g. TGL_CACHELEVEL.
 + * defined i915/i915_pci.c, e.g. MTL_CACHELEVEL.


Why this?

  * For backward compatibility, this field contains values 
exactly match
  * the entries of enum i915_cache_level for pre-GEN12 platforms 
(See

  * LEGACY_CACHELEVEL), so that the PTE encode functions for these
 diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c 
b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c

 index fa46d2308b0e..1bd0e041e15c 100644
 --- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
 +++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
 @@ -500,11 +500,11 @@ gen8_ppgtt_insert_pte(struct i915_ppgtt *ppgtt,
  }
  static void
 -xehpsdv_ppgtt_insert_huge(struct i915_address_space *vm,
 -  struct i915_vma_resource *vma_res,
 -  struct sgt_dma *iter,
 -  unsigned int pat_index,
 -  u32 flags)
 +xehp_ppgtt_insert_huge(struct i915_address_space *vm,
 +   struct i915_vma_resource *vma_res,
 +   struct sgt_dma *iter,
 +   unsigned int pat_index,
 +   u32 flags)
  {
     const gen8_pte_t pte_encode = vm->pte_encode(0, pat_index, flags);
     unsigned int rem = sg_dma_len(iter->sg);
 @@ -741,8 +741,8 @@ static void gen8_ppgtt_insert(struct 
i915_address_space *vm,

     struct sgt_dma iter = sgt_dma(vma_res);
     if (vma_res->bi.page_sizes.sg > I915_GTT_PAGE_SIZE) {
 -    if (GRAPHICS_VER_FULL(vm->i915) >= IP_VER(12, 50))
 -    xehpsdv_ppgtt_insert_huge(vm, vma_res, , 
pat_index, flags);

 +    if (GRAPHICS_VER_FULL(vm->i915) >= IP_VER(12, 55))
 +    xehp_ppgtt_insert_huge(vm, vma_

Re: Proposal to add CRIU support to DRM render nodes

2024-03-12 Thread Tvrtko Ursulin



On 11/03/2024 14:48, Tvrtko Ursulin wrote:


Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes in 
order to maintain CRIU support for ROCm application once they start 
relying on render nodes for more GPU memory management. In this email 
I'm providing some background why we are doing this, and outlining 
some of the problems we need to solve to checkpoint and restore render 
node state and shared memory (DMABuf) state. I have some thoughts on 
the API design, leaning on what we did for KFD, but would like to get 
feedback from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address mappings 
in ROCm applications to implement the CUDA11-style VM API and improve 
interoperability between graphics and compute. This uses DMABufs for 
sharing buffer objects between KFD and multiple render node devices, 
as well as between processes. In the long run this also provides a 
path to moving all or most memory management from the KFD ioctl API to 
libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU virtual 
address mappings. Support will be needed for checkpointing GEM buffer 
objects and handles, their GPU virtual address mappings and memory 
sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is desired, 
more state would need to be captured, including scheduler contexts and 
BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design process 
public as this potentially touches DRM GEM and DMABuf APIs and may 
have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although I 
cannot answer on the last question here.


I forgot to finish this thought. I cannot answer / don't know of any 
concrete plans, but I think feature is pretty cool and if amdgpu gets it 
working I wouldn't be surprised if other drivers would get interested.


Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the GPU 
context from the GPU hang error state and replay the hanging request. It 
is kind of (at a stretch) a very special tiny subset of checkout and 
restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at the 
moment not yet upstream) i915 online debugger. This part being in the 
area of discovering and enumerating GPU resources beloning to the client.


I don't see an immediate design or code sharing opportunities though but 
just mentioning.


I did spend some time reading your plugin and kernel implementation out 
of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM CRIU 
API (either generic of AMDGPU driver specific): The API goes through 
several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can allocate
    memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at this 
time)

 2. Resume (final fixups after the VMAs are sorted out, resume execution)


Btw is check-pointing guaranteeing all relevant activity is idled? For 
instance dma_resv objects are free of fences which would need to 
restored for things to continue executing sensibly? Or how is that handled?


For some more background about our implementation in KFD, you can 
refer to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md


Potential objections to a KFD-style CRIU API in DRM render nodes, I'll 
address each of them in more detail below:


  * Opaque information in the checkpoint data that user mode can't
    interpret or do anything with
  * A second API for creating objects (e.g. BOs) that is separate from
    the regular BO creation API
  * Kernel mode would need to be involved in restoring BO sharing
    relationships rather than replaying BO creation, export and import
    from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding any new 
objects or attributes to the driver/HW state that needs to be 
checkpointed could potentially break the ABI of the CRIU 
checkpoint/restore ioctl if the pl

Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()

2024-03-12 Thread Tvrtko Ursulin



On 12/03/2024 08:59, Christian König wrote:

Am 12.03.24 um 09:51 schrieb Tvrtko Ursulin:


Hi Maira,

On 11/03/2024 10:05, Maíra Canal wrote:

For some applications, such as using huge pages, we might want to have a
different mountpoint, for which we pass in mount flags that better match
our usecase.

Therefore, add a new parameter to drm_gem_object_init() that allow us to
define the tmpfs mountpoint where the GEM object will be created. If
this parameter is NULL, then we fallback to shmem_file_setup().


One strategy for reducing churn, and so the number of drivers this 
patch touches, could be to add a lower level drm_gem_object_init() 
(which takes vfsmount, call it __drm_gem_object_init(), or 
drm__gem_object_init_mnt(), and make drm_gem_object_init() call that 
one with a NULL argument.


I would even go a step further into the other direction. The shmem 
backed GEM object is just some special handling as far as I can see.


So I would rather suggest to rename all drm_gem_* function which only 
deal with the shmem backed GEM object into drm_gem_shmem_*.


That makes sense although it would be very churny. I at least would be 
on the fence regarding the cost vs benefit.


Also the explanation why a different mount point helps with something 
isn't very satisfying.


Not satisfying as you think it is not detailed enough to say driver 
wants to use huge pages for performance? Or not satisying as you 
question why huge pages would help?


Regards,

Tvrtko


Re: [PATCH 3/5] drm/v3d: Introduce gemfs

2024-03-12 Thread Tvrtko Ursulin



Hi,

On 11/03/2024 10:06, Maíra Canal wrote:

Create a separate "tmpfs" kernel mount for V3D. This will allow us to
move away from the shmemfs `shm_mnt` and gives the flexibility to do
things like set our own mount options. Here, the interest is to use
"huge=", which should allow us to enable the use of THP for our
shmem-backed objects.

Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/v3d/Makefile|  3 ++-
  drivers/gpu/drm/v3d/v3d_drv.h   |  9 +++
  drivers/gpu/drm/v3d/v3d_gem.c   |  3 +++
  drivers/gpu/drm/v3d/v3d_gemfs.c | 46 +
  4 files changed, 60 insertions(+), 1 deletion(-)
  create mode 100644 drivers/gpu/drm/v3d/v3d_gemfs.c

diff --git a/drivers/gpu/drm/v3d/Makefile b/drivers/gpu/drm/v3d/Makefile
index b7d673f1153b..fcf710926057 100644
--- a/drivers/gpu/drm/v3d/Makefile
+++ b/drivers/gpu/drm/v3d/Makefile
@@ -13,7 +13,8 @@ v3d-y := \
v3d_trace_points.o \
v3d_sched.o \
v3d_sysfs.o \
-   v3d_submit.o
+   v3d_submit.o \
+   v3d_gemfs.o

  v3d-$(CONFIG_DEBUG_FS) += v3d_debugfs.o

diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h
index 1950c723dde1..d2ce8222771a 100644
--- a/drivers/gpu/drm/v3d/v3d_drv.h
+++ b/drivers/gpu/drm/v3d/v3d_drv.h
@@ -119,6 +119,11 @@ struct v3d_dev {
struct drm_mm mm;
spinlock_t mm_lock;

+   /*
+* tmpfs instance used for shmem backed objects
+*/
+   struct vfsmount *gemfs;
+
struct work_struct overflow_mem_work;

struct v3d_bin_job *bin_job;
@@ -519,6 +524,10 @@ void v3d_reset(struct v3d_dev *v3d);
  void v3d_invalidate_caches(struct v3d_dev *v3d);
  void v3d_clean_caches(struct v3d_dev *v3d);

+/* v3d_gemfs.c */
+void v3d_gemfs_init(struct v3d_dev *v3d);
+void v3d_gemfs_fini(struct v3d_dev *v3d);
+
  /* v3d_submit.c */
  void v3d_job_cleanup(struct v3d_job *job);
  void v3d_job_put(struct v3d_job *job);
diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c
index 66f4b78a6b2e..faefbe497e8d 100644
--- a/drivers/gpu/drm/v3d/v3d_gem.c
+++ b/drivers/gpu/drm/v3d/v3d_gem.c
@@ -287,6 +287,8 @@ v3d_gem_init(struct drm_device *dev)
v3d_init_hw_state(v3d);
v3d_mmu_set_page_table(v3d);

+   v3d_gemfs_init(v3d);
+
ret = v3d_sched_init(v3d);
if (ret) {
drm_mm_takedown(>mm);
@@ -304,6 +306,7 @@ v3d_gem_destroy(struct drm_device *dev)
struct v3d_dev *v3d = to_v3d_dev(dev);

v3d_sched_fini(v3d);
+   v3d_gemfs_fini(v3d);

/* Waiting for jobs to finish would need to be done before
 * unregistering V3D.
diff --git a/drivers/gpu/drm/v3d/v3d_gemfs.c b/drivers/gpu/drm/v3d/v3d_gemfs.c
new file mode 100644
index ..8518b7da6f73
--- /dev/null
+++ b/drivers/gpu/drm/v3d/v3d_gemfs.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0+
+/* Copyright (C) 2024 Raspberry Pi */
+
+#include 
+#include 
+
+#include "v3d_drv.h"
+
+void v3d_gemfs_init(struct v3d_dev *v3d)
+{
+   char huge_opt[] = "huge=always";


Using 'always' and not 'within_size' is deliberate? It can waste memory 
but indeed could be best for performance. I am just asking and perhaps I 
missed some prior discussion on this.


Regards,

Tvrtko


+   struct file_system_type *type;
+   struct vfsmount *gemfs;
+
+   /*
+* By creating our own shmemfs mountpoint, we can pass in
+* mount flags that better match our usecase. However, we
+* only do so on platforms which benefit from it.
+*/
+   if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+   goto err;
+
+   type = get_fs_type("tmpfs");
+   if (!type)
+   goto err;
+
+   gemfs = vfs_kern_mount(type, SB_KERNMOUNT, type->name, huge_opt);
+   if (IS_ERR(gemfs))
+   goto err;
+
+   v3d->gemfs = gemfs;
+   drm_info(>drm, "Using Transparent Hugepages\n");
+
+   return;
+
+err:
+   v3d->gemfs = NULL;
+   drm_notice(>drm,
+  "Transparent Hugepage support is recommended for optimal 
performance on this platform!\n");
+}
+
+void v3d_gemfs_fini(struct v3d_dev *v3d)
+{
+   if (v3d->gemfs)
+   kern_unmount(v3d->gemfs);
+}
--
2.43.0




Re: [PATCH 2/5] drm/gem: Add a mountpoint parameter to drm_gem_object_init()

2024-03-12 Thread Tvrtko Ursulin



Hi Maira,

On 11/03/2024 10:05, Maíra Canal wrote:

For some applications, such as using huge pages, we might want to have a
different mountpoint, for which we pass in mount flags that better match
our usecase.

Therefore, add a new parameter to drm_gem_object_init() that allow us to
define the tmpfs mountpoint where the GEM object will be created. If
this parameter is NULL, then we fallback to shmem_file_setup().


One strategy for reducing churn, and so the number of drivers this patch 
touches, could be to add a lower level drm_gem_object_init() (which 
takes vfsmount, call it __drm_gem_object_init(), or 
drm__gem_object_init_mnt(), and make drm_gem_object_init() call that one 
with a NULL argument.


Regards,

Tvrtko



Cc: Russell King 
Cc: Lucas Stach 
Cc: Christian Gmeiner 
Cc: Inki Dae 
Cc: Seung-Woo Kim 
Cc: Kyungmin Park 
Cc: Krzysztof Kozlowski 
Cc: Alim Akhtar 
Cc: Patrik Jakobsson 
Cc: Sui Jingfeng 
Cc: Chun-Kuang Hu 
Cc: Philipp Zabel 
Cc: Matthias Brugger 
Cc: AngeloGioacchino Del Regno 
Cc: Rob Clark 
Cc: Abhinav Kumar 
Cc: Dmitry Baryshkov 
Cc: Sean Paul 
Cc: Marijn Suijten 
Cc: Karol Herbst 
Cc: Lyude Paul 
Cc: Danilo Krummrich 
Cc: Tomi Valkeinen 
Cc: Gerd Hoffmann 
Cc: Sandy Huang 
Cc: "Heiko Stübner" 
Cc: Andy Yan 
Cc: Thierry Reding 
Cc: Mikko Perttunen 
Cc: Jonathan Hunter 
Cc: Christian König 
Cc: Huang Rui 
Cc: Oleksandr Andrushchenko 
Cc: Karolina Stolarek 
Cc: Andi Shyti 
Signed-off-by: Maíra Canal 
---
  drivers/gpu/drm/armada/armada_gem.c   |  2 +-
  drivers/gpu/drm/drm_gem.c | 12 ++--
  drivers/gpu/drm/drm_gem_dma_helper.c  |  2 +-
  drivers/gpu/drm/drm_gem_shmem_helper.c|  2 +-
  drivers/gpu/drm/drm_gem_vram_helper.c |  2 +-
  drivers/gpu/drm/etnaviv/etnaviv_gem.c |  2 +-
  drivers/gpu/drm/exynos/exynos_drm_gem.c   |  2 +-
  drivers/gpu/drm/gma500/gem.c  |  2 +-
  drivers/gpu/drm/loongson/lsdc_ttm.c   |  2 +-
  drivers/gpu/drm/mediatek/mtk_drm_gem.c|  2 +-
  drivers/gpu/drm/msm/msm_gem.c |  2 +-
  drivers/gpu/drm/nouveau/nouveau_gem.c |  2 +-
  drivers/gpu/drm/nouveau/nouveau_prime.c   |  2 +-
  drivers/gpu/drm/omapdrm/omap_gem.c|  2 +-
  drivers/gpu/drm/qxl/qxl_object.c  |  2 +-
  drivers/gpu/drm/rockchip/rockchip_drm_gem.c   |  2 +-
  drivers/gpu/drm/tegra/gem.c   |  2 +-
  drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c |  2 +-
  drivers/gpu/drm/xen/xen_drm_front_gem.c   |  2 +-
  include/drm/drm_gem.h |  3 ++-
  20 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/armada/armada_gem.c 
b/drivers/gpu/drm/armada/armada_gem.c
index 26d10065d534..36a25e667341 100644
--- a/drivers/gpu/drm/armada/armada_gem.c
+++ b/drivers/gpu/drm/armada/armada_gem.c
@@ -226,7 +226,7 @@ static struct armada_gem_object 
*armada_gem_alloc_object(struct drm_device *dev,

obj->obj.funcs = _gem_object_funcs;

-   if (drm_gem_object_init(dev, >obj, size)) {
+   if (drm_gem_object_init(dev, >obj, size, NULL)) {
kfree(obj);
return NULL;
}
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 44a948b80ee1..ddd8777fcda5 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -118,18 +118,26 @@ drm_gem_init(struct drm_device *dev)
   * @dev: drm_device the object should be initialized for
   * @obj: drm_gem_object to initialize
   * @size: object size
+ * @gemfs: tmpfs mount where the GEM object will be created. If NULL, use
+ * the usual tmpfs mountpoint (`shm_mnt`).
   *
   * Initialize an already allocated GEM object of the specified size with
   * shmfs backing store.
   */
  int drm_gem_object_init(struct drm_device *dev,
-   struct drm_gem_object *obj, size_t size)
+   struct drm_gem_object *obj, size_t size,
+   struct vfsmount *gemfs)
  {
struct file *filp;

drm_gem_private_object_init(dev, obj, size);

-   filp = shmem_file_setup("drm mm object", size, VM_NORESERVE);
+   if (gemfs)
+   filp = shmem_file_setup_with_mnt(gemfs, "drm mm object", size,
+VM_NORESERVE);
+   else
+   filp = shmem_file_setup("drm mm object", size, VM_NORESERVE);
+
if (IS_ERR(filp))
return PTR_ERR(filp);

diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c 
b/drivers/gpu/drm/drm_gem_dma_helper.c
index 870b90b78bc4..9ada5ac85dd6 100644
--- a/drivers/gpu/drm/drm_gem_dma_helper.c
+++ b/drivers/gpu/drm/drm_gem_dma_helper.c
@@ -95,7 +95,7 @@ __drm_gem_dma_create(struct drm_device *drm, size_t size, 
bool private)
/* Always use writecombine for dma-buf mappings */
dma_obj->map_noncoherent = false;
} else {
-   ret = drm_gem_object_init(drm, gem_obj, size);
+   ret = 

Re: [PATCH 0/5] drm/i915: cleanup dead code

2024-03-11 Thread Tvrtko Ursulin



On 06/03/2024 19:36, Lucas De Marchi wrote:

Remove platforms that never had their PCI IDs added to the driver and
are of course marked with requiring force_probe. Note that most of the
code for those platforms is actually used by subsequent ones, so it's
not a huge amount of code being removed.


I had PVC and xehpsdv back in October but could not collect all acks. :(

Last two patches from https://patchwork.freedesktop.org/series/124705/.

Regards,

Tvrtko


drivers/gpu/drm/xe/compat-i915-headers/i915_drv.h is also changed on the
xe side, but that should be ok: the defines are there only for compat
reasons while building the display side (and none of these platforms
have display, so it's build-issue only).

First patch is what motivated the others and was submitted alone
@ 20240306144723.1826977-1-lucas.demar...@intel.com .
While loooking at this WA I was wondering why we still had some of that
code around.

Build-tested only for now.

Lucas De Marchi (5):
   drm/i915: Drop WA 16015675438
   drm/i915: Drop dead code for xehpsdv
   drm/i915: Update IP_VER(12, 50)
   drm/i915: Drop dead code for pvc
   drm/i915: Remove special handling for !RCS_MASK()

  Documentation/gpu/rfc/i915_vm_bind.h  |  11 +-
  .../gpu/drm/i915/gem/i915_gem_object_types.h  |   2 +-
  .../gpu/drm/i915/gem/selftests/huge_pages.c   |   4 +-
  .../i915/gem/selftests/i915_gem_client_blt.c  |   8 +-
  drivers/gpu/drm/i915/gt/gen8_engine_cs.c  |   5 +-
  drivers/gpu/drm/i915/gt/gen8_ppgtt.c  |  40 ++--
  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  43 +---
  .../drm/i915/gt/intel_execlists_submission.c  |  10 +-
  drivers/gpu/drm/i915/gt/intel_gsc.c   |  15 --
  drivers/gpu/drm/i915/gt/intel_gt.c|   4 +-
  drivers/gpu/drm/i915/gt/intel_gt_mcr.c|  52 +
  drivers/gpu/drm/i915/gt/intel_gt_mcr.h|   2 +-
  drivers/gpu/drm/i915/gt/intel_gt_regs.h   |  59 --
  drivers/gpu/drm/i915/gt/intel_gt_sysfs_pm.c   |  21 +-
  drivers/gpu/drm/i915/gt/intel_gtt.c   |   2 +-
  drivers/gpu/drm/i915/gt/intel_lrc.c   |  51 +
  drivers/gpu/drm/i915/gt/intel_migrate.c   |  22 +-
  drivers/gpu/drm/i915/gt/intel_mocs.c  |  52 +
  drivers/gpu/drm/i915/gt/intel_rps.c   |   6 +-
  drivers/gpu/drm/i915/gt/intel_sseu.c  |  13 +-
  drivers/gpu/drm/i915/gt/intel_workarounds.c   | 193 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc.c|   6 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|   4 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c |   2 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   1 -
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   2 +-
  drivers/gpu/drm/i915/gt/uc/intel_uc.c |   4 -
  drivers/gpu/drm/i915/i915_debugfs.c   |  12 --
  drivers/gpu/drm/i915/i915_drv.h   |  13 --
  drivers/gpu/drm/i915/i915_getparam.c  |   4 +-
  drivers/gpu/drm/i915/i915_gpu_error.c |   5 +-
  drivers/gpu/drm/i915/i915_hwmon.c |   6 -
  drivers/gpu/drm/i915/i915_pci.c   |  61 +-
  drivers/gpu/drm/i915/i915_perf.c  |  19 +-
  drivers/gpu/drm/i915/i915_query.c |   2 +-
  drivers/gpu/drm/i915/i915_reg.h   |   4 +-
  drivers/gpu/drm/i915/intel_clock_gating.c |  26 +--
  drivers/gpu/drm/i915/intel_device_info.c  |   2 -
  drivers/gpu/drm/i915/intel_device_info.h  |   2 -
  drivers/gpu/drm/i915/intel_step.c |  80 +---
  drivers/gpu/drm/i915/intel_uncore.c   | 159 +--
  drivers/gpu/drm/i915/selftests/intel_uncore.c |   3 -
  .../gpu/drm/xe/compat-i915-headers/i915_drv.h |   6 -
  43 files changed, 110 insertions(+), 928 deletions(-)



Re: Proposal to add CRIU support to DRM render nodes

2024-03-11 Thread Tvrtko Ursulin



Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes in 
order to maintain CRIU support for ROCm application once they start 
relying on render nodes for more GPU memory management. In this email 
I'm providing some background why we are doing this, and outlining some 
of the problems we need to solve to checkpoint and restore render node 
state and shared memory (DMABuf) state. I have some thoughts on the API 
design, leaning on what we did for KFD, but would like to get feedback 
from the DRI community regarding that API and to what extent there is 
interest in making that generic.


We are working on using DRM render nodes for virtual address mappings in 
ROCm applications to implement the CUDA11-style VM API and improve 
interoperability between graphics and compute. This uses DMABufs for 
sharing buffer objects between KFD and multiple render node devices, as 
well as between processes. In the long run this also provides a path to 
moving all or most memory management from the KFD ioctl API to libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring ROCm 
applications with CRIU. Currently there is no support for checkpointing 
and restoring render node state, other than CPU virtual address 
mappings. Support will be needed for checkpointing GEM buffer objects 
and handles, their GPU virtual address mappings and memory sharing 
relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is desired, 
more state would need to be captured, including scheduler contexts and 
BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design process 
public as this potentially touches DRM GEM and DMABuf APIs and may have 
implications for other drivers in the future.


One basic question before going into any API details: Is there a desire 
to have CRIU support for other DRM drivers?


This sounds like a very interesting feature on the overall, although I 
cannot answer on the last question here.


Funnily enough, it has a tiny relation to an i915 feature I recently 
implemented on Mesa's request, which is to be able to "upload" the GPU 
context from the GPU hang error state and replay the hanging request. It 
is kind of (at a stretch) a very special tiny subset of checkout and 
restore so I am not mentioning it as a curiosity.


And there is also another partical conceptual intersect with the (at the 
moment not yet upstream) i915 online debugger. This part being in the 
area of discovering and enumerating GPU resources beloning to the client.


I don't see an immediate design or code sharing opportunities though but 
just mentioning.


I did spend some time reading your plugin and kernel implementation out 
of curiousity and have some comments and questions.


With that out of the way, some considerations for a possible DRM CRIU 
API (either generic of AMDGPU driver specific): The API goes through 
several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can allocate
memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at this time)
 2. Resume (final fixups after the VMAs are sorted out, resume execution)


Btw is check-pointing guaranteeing all relevant activity is idled? For 
instance dma_resv objects are free of fences which would need to 
restored for things to continue executing sensibly? Or how is that handled?


For some more background about our implementation in KFD, you can refer 
to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md


Potential objections to a KFD-style CRIU API in DRM render nodes, I'll 
address each of them in more detail below:


  * Opaque information in the checkpoint data that user mode can't
interpret or do anything with
  * A second API for creating objects (e.g. BOs) that is separate from
the regular BO creation API
  * Kernel mode would need to be involved in restoring BO sharing
relationships rather than replaying BO creation, export and import
from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding any new 
objects or attributes to the driver/HW state that needs to be 
checkpointed could potentially break the ABI of the CRIU 
checkpoint/restore ioctl if the plugin needs to parse that information. 
Therefore, much of the information in our KFD CRIU ioctl API is opaque. 
It is written by kernel mode in the checkpoint, it is consumed by kernel 
mode when restoring the checkpoint, but user mode doesn't care about the 
contents or binary 

Re: [PATCH v3 1/1] drm/panfrost: Replace fdinfo's profiling debugfs knob with sysfs

2024-03-06 Thread Tvrtko Ursulin



On 06/03/2024 01:56, Adrián Larumbe wrote:

Debugfs isn't always available in production builds that try to squeeze
every single byte out of the kernel image, but we still need a way to
toggle the timestamp and cycle counter registers so that jobs can be
profiled for fdinfo's drm engine and cycle calculations.

Drop the debugfs knob and replace it with a sysfs file that accomplishes
the same functionality, and document its ABI in a separate file.

Signed-off-by: Adrián Larumbe 
---
  .../testing/sysfs-driver-panfrost-profiling   | 10 +
  Documentation/gpu/panfrost.rst|  9 
  drivers/gpu/drm/panfrost/Makefile |  2 -
  drivers/gpu/drm/panfrost/panfrost_debugfs.c   | 21 --
  drivers/gpu/drm/panfrost/panfrost_debugfs.h   | 14 ---
  drivers/gpu/drm/panfrost/panfrost_device.h|  2 +-
  drivers/gpu/drm/panfrost/panfrost_drv.c   | 41 ---
  drivers/gpu/drm/panfrost/panfrost_job.c   |  2 +-
  8 files changed, 57 insertions(+), 44 deletions(-)
  create mode 100644 Documentation/ABI/testing/sysfs-driver-panfrost-profiling
  delete mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.c
  delete mode 100644 drivers/gpu/drm/panfrost/panfrost_debugfs.h

diff --git a/Documentation/ABI/testing/sysfs-driver-panfrost-profiling 
b/Documentation/ABI/testing/sysfs-driver-panfrost-profiling
new file mode 100644
index ..1d8bb0978920
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-driver-panfrost-profiling
@@ -0,0 +1,10 @@
+What:  /sys/bus/platform/drivers/panfrost/.../profiling
+Date:  February 2024
+KernelVersion: 6.8.0
+Contact:   Adrian Larumbe 
+Description:
+   Get/set drm fdinfo's engine and cycles profiling status.
+   Valid values are:
+   0: Don't enable fdinfo job profiling sources.
+   1: Enable fdinfo job profiling sources, this enables both the 
GPU's
+  timestamp and cycle counter registers.
\ No newline at end of file
diff --git a/Documentation/gpu/panfrost.rst b/Documentation/gpu/panfrost.rst
index b80e41f4b2c5..51ba375fd80d 100644
--- a/Documentation/gpu/panfrost.rst
+++ b/Documentation/gpu/panfrost.rst
@@ -38,3 +38,12 @@ the currently possible format options:
  
  Possible `drm-engine-` key names are: `fragment`, and  `vertex-tiler`.

  `drm-curfreq-` values convey the current operating frequency for that engine.
+
+Users must bear in mind that engine and cycle sampling are disabled by default,
+because of power saving concerns. `fdinfo` users and benchmark applications 
which
+query the fdinfo file must make sure to toggle the job profiling status of the
+driver by writing into the appropriate sysfs node::
+
+echo  > /sys/bus/platform/drivers/panfrost/[a-f0-9]*.gpu/profiling


A late thought - how it would work to not output the inactive fdinfo 
keys when this knob is not enabled?


Generic userspace like gputop already handles that and wouldn't show the 
stat. Which may be more user friendly than showing stats permanently at 
zero. It may be moot once you add the auto-toggle to gputop (or so) but 
perhaps worth considering.


Regards,

Tvrtko


+
+Where `N` is either `0` or `1`, depending on the desired enablement status.
diff --git a/drivers/gpu/drm/panfrost/Makefile 
b/drivers/gpu/drm/panfrost/Makefile
index 2c01c1e7523e..7da2b3f02ed9 100644
--- a/drivers/gpu/drm/panfrost/Makefile
+++ b/drivers/gpu/drm/panfrost/Makefile
@@ -12,6 +12,4 @@ panfrost-y := \
panfrost_perfcnt.o \
panfrost_dump.o
  
-panfrost-$(CONFIG_DEBUG_FS) += panfrost_debugfs.o

-
  obj-$(CONFIG_DRM_PANFROST) += panfrost.o
diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.c 
b/drivers/gpu/drm/panfrost/panfrost_debugfs.c
deleted file mode 100644
index 72d4286a6bf7..
--- a/drivers/gpu/drm/panfrost/panfrost_debugfs.c
+++ /dev/null
@@ -1,21 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* Copyright 2023 Collabora ltd. */
-/* Copyright 2023 Amazon.com, Inc. or its affiliates. */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "panfrost_device.h"
-#include "panfrost_gpu.h"
-#include "panfrost_debugfs.h"
-
-void panfrost_debugfs_init(struct drm_minor *minor)
-{
-   struct drm_device *dev = minor->dev;
-   struct panfrost_device *pfdev = 
platform_get_drvdata(to_platform_device(dev->dev));
-
-   debugfs_create_atomic_t("profile", 0600, minor->debugfs_root, 
>profile_mode);
-}
diff --git a/drivers/gpu/drm/panfrost/panfrost_debugfs.h 
b/drivers/gpu/drm/panfrost/panfrost_debugfs.h
deleted file mode 100644
index c5af5f35877f..
--- a/drivers/gpu/drm/panfrost/panfrost_debugfs.h
+++ /dev/null
@@ -1,14 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Copyright 2023 Collabora ltd.
- * Copyright 2023 Amazon.com, Inc. or its affiliates.
- */
-
-#ifndef PANFROST_DEBUGFS_H
-#define PANFROST_DEBUGFS_H
-
-#ifdef CONFIG_DEBUG_FS
-void panfrost_debugfs_init(struct drm_minor *minor);
-#endif
-
-#endif  /* 

[PATCH] MAINTAINERS: Update email address for Tvrtko Ursulin

2024-02-28 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

I will lose access to my @.*intel.com e-mail addresses soon so let me
adjust the maintainers entry and update the mailmap too.

While at it consolidate a few other of my old emails to point to the
main one.

Signed-off-by: Tvrtko Ursulin 
Cc: Daniel Vetter 
Cc: Dave Airlie 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
---
 .mailmap| 5 +
 MAINTAINERS | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index b99a238ee3bd..d67e351bce8e 100644
--- a/.mailmap
+++ b/.mailmap
@@ -608,6 +608,11 @@ TripleX Chung  
 TripleX Chung  
 Tsuneo Yoshioka 
 Tudor Ambarus  
+Tvrtko Ursulin  
+Tvrtko Ursulin  
+Tvrtko Ursulin  
+Tvrtko Ursulin  
+Tvrtko Ursulin  
 Tycho Andersen  
 Tzung-Bi Shih  
 Uwe Kleine-König 
diff --git a/MAINTAINERS b/MAINTAINERS
index 19f6f8014f94..b940bfe2a692 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10734,7 +10734,7 @@ INTEL DRM I915 DRIVER (Meteor Lake, DG2 and older 
excluding Poulsbo, Moorestown
 M: Jani Nikula 
 M: Joonas Lahtinen 
 M: Rodrigo Vivi 
-M: Tvrtko Ursulin 
+M: Tvrtko Ursulin 
 L: intel-...@lists.freedesktop.org
 S: Supported
 W: https://drm.pages.freedesktop.org/intel-docs/
-- 
2.40.1



[PULL] drm-intel-gt-next

2024-02-28 Thread Tvrtko Ursulin
Hi Dave, Sima,

Last drm-intel-gt-next pull request for 6.9.

There are only two small fixes in there so could also wait for the
-next-fixes round if so would be preferred. One fix is for a kerneldoc
warning and other for a very unlikely userptr object creation failure
where cleanup would oops.

Regards,

Tvrtko

drm-intel-gt-next-2024-02-28:
Driver Changes:

Fixes:

- Add some boring kerneldoc (Tvrtko Ursulin)
- Check before removing mm notifier (Nirmoy
The following changes since commit eb927f01dfb6309c8a184593c2c0618c4000c481:

  drm/i915/gt: Restart the heartbeat timer when forcing a pulse (2024-02-14 
17:17:35 -0800)

are available in the Git repository at:

  git://anongit.freedesktop.org/drm/drm-intel tags/drm-intel-gt-next-2024-02-28

for you to fetch changes up to db7bbd13f08774cde0332c705f042e327fe21e73:

  drm/i915: Check before removing mm notifier (2024-02-28 13:11:32 +)


Driver Changes:

Fixes:

- Add some boring kerneldoc (Tvrtko Ursulin)
- Check before removing mm notifier (Nirmoy


Nirmoy Das (1):
  drm/i915: Check before removing mm notifier

Tvrtko Ursulin (1):
  drm/i915: Add some boring kerneldoc

 drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++
 include/uapi/drm/i915_drm.h | 4 
 2 files changed, 7 insertions(+)


Re: [PATCH] drm/i915: check before removing mm notifier

2024-02-28 Thread Tvrtko Ursulin



On 27/02/2024 09:26, Nirmoy Das wrote:

Hi Tvrtko,

On 2/27/2024 10:04 AM, Tvrtko Ursulin wrote:


On 21/02/2024 11:52, Nirmoy Das wrote:

Merged it to drm-intel-gt-next with s/check/Check


Shouldn't this have had:

Fixes: ed29c2691188 ("drm/i915: Fix userptr so we do not have to worry 
about obj->mm.lock, v7.")

Cc:  # v5.13+

?


Yes. Sorry, I missed that. Can we still the tag ?


I've added them and force pushed the branch since commit was still at 
the top.


FYI + Jani, Joonas and Rodrigo

Regards,

Tvrtko




Thanks,

Nirmoy


Regards,

Tvrtko


On 2/19/2024 1:50 PM, Nirmoy Das wrote:

Error in mmu_interval_notifier_insert() can leave a NULL
notifier.mm pointer. Catch that and return early.

Cc: Andi Shyti 
Cc: Shawn Lee 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c

index 0e21ce9d3e5a..61abfb505766 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -349,6 +349,9 @@ i915_gem_userptr_release(struct 
drm_i915_gem_object *obj)

  {
  GEM_WARN_ON(obj->userptr.page_ref);
+    if (!obj->userptr.notifier.mm)
+    return;
+
mmu_interval_notifier_remove(>userptr.notifier);
  obj->userptr.notifier.mm = NULL;
  }


Re: [PATCH v2] drm/i915/guc: Use context hints for GT freq

2024-02-28 Thread Tvrtko Ursulin



On 27/02/2024 23:51, Vinay Belgaumkar wrote:

Allow user to provide a low latency context hint. When set, KMD
sends a hint to GuC which results in special handling for this
context. SLPC will ramp the GT frequency aggressively every time
it switches to this context. The down freq threshold will also be
lower so GuC will ramp down the GT freq for this context more slowly.
We also disable waitboost for this context as that will interfere with
the strategy.

We need to enable the use of SLPC Compute strategy during init, but
it will apply only to contexts that set this bit during context
creation.

Userland can check whether this feature is supported using a new param-
I915_PARAM_HAS_CONTEXT_FREQ_HINTS. This flag is true for all guc submission
enabled platforms as they use SLPC for frequency management.

The Mesa usage model for this flag is here -
https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint

v2: Rename flags as per review suggestions (Rodrigo, Tvrtko).
Also, use flag bits in intel_context as it allows finer control for
toggling per engine if needed (Tvrtko).

Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Sushma Venkatesh Reddy 
Signed-off-by: Vinay Belgaumkar 
---
  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 15 +++--
  .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
  drivers/gpu/drm/i915/gt/intel_context_types.h |  1 +
  drivers/gpu/drm/i915/gt/intel_rps.c   |  5 +
  .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 +++
  drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c   | 17 +++
  drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h   |  1 +
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 ++
  drivers/gpu/drm/i915/i915_getparam.c  | 12 +++
  include/uapi/drm/i915_drm.h   | 15 +
  10 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..0799cb0b2803 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private 
*fpriv,
   struct i915_gem_proto_context *pc,
   struct drm_i915_gem_context_param *args)
  {
+   struct drm_i915_private *i915 = fpriv->i915;
int ret = 0;
  
  	switch (args->param) {

@@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct 
drm_i915_file_private *fpriv,
pc->user_flags &= ~BIT(UCONTEXT_BANNABLE);
break;
  
+	case I915_CONTEXT_PARAM_LOW_LATENCY:

+   if (intel_uc_uses_guc_submission(_gt(i915)->uc))
+   pc->user_flags |= BIT(UCONTEXT_LOW_LATENCY);
+   else
+   ret = -EINVAL;
+   break;
+
case I915_CONTEXT_PARAM_RECOVERABLE:
if (args->size)
ret = -EINVAL;
@@ -992,6 +1000,9 @@ static int intel_context_set_gem(struct intel_context *ce,
if (sseu.slice_mask && !WARN_ON(ce->engine->class != RENDER_CLASS))
ret = intel_context_reconfigure_sseu(ce, sseu);
  
+	if (test_bit(UCONTEXT_LOW_LATENCY, >user_flags))

+   set_bit(CONTEXT_LOW_LATENCY, >flags);


Does not need to be atomic so can use __set_bit as higher up in the 
function.



+
return ret;
  }
  
@@ -1630,6 +1641,8 @@ i915_gem_create_context(struct drm_i915_private *i915,

if (vm)
ctx->vm = vm;
  
+	ctx->user_flags = pc->user_flags;

+


Given how most ctx->something assignments are at the bottom of the 
function I would stick a comment here saying along the lines of "assign 
early for intel_context_set_gem called when creating engines".



mutex_init(>engines_mutex);
if (pc->num_user_engines >= 0) {
i915_gem_context_set_user_engines(ctx);
@@ -1652,8 +1665,6 @@ i915_gem_create_context(struct drm_i915_private *i915,
 * is no remap info, it will be a NOP. */
ctx->remap_slice = ALL_L3_SLICES(i915);
  
-	ctx->user_flags = pc->user_flags;

-
for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp); i++)
ctx->hang_timestamp[i] = jiffies - CONTEXT_FAST_HANG_JIFFIES;
  
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h

index 03bc7f9d191b..b6d97da63d1f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -338,6 +338,7 @@ struct i915_gem_context {
  #define UCONTEXT_BANNABLE 2
  #define UCONTEXT_RECOVERABLE  3
  #define UCONTEXT_PERSISTENCE  4
+#define UCONTEXT_LOW_LATENCY   5
  
  	/**

 * @flags: small set of booleans
diff --git a/drivers/gpu/drm/i915/g

Re: [PATCH] drm/i915: check before removing mm notifier

2024-02-27 Thread Tvrtko Ursulin



On 21/02/2024 11:52, Nirmoy Das wrote:

Merged it to drm-intel-gt-next with s/check/Check


Shouldn't this have had:

Fixes: ed29c2691188 ("drm/i915: Fix userptr so we do not have to worry about 
obj->mm.lock, v7.")
Cc:  # v5.13+

?

Regards,

Tvrtko
 

On 2/19/2024 1:50 PM, Nirmoy Das wrote:

Error in mmu_interval_notifier_insert() can leave a NULL
notifier.mm pointer. Catch that and return early.

Cc: Andi Shyti 
Cc: Shawn Lee 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c

index 0e21ce9d3e5a..61abfb505766 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -349,6 +349,9 @@ i915_gem_userptr_release(struct 
drm_i915_gem_object *obj)

  {
  GEM_WARN_ON(obj->userptr.page_ref);
+    if (!obj->userptr.notifier.mm)
+    return;
+
  mmu_interval_notifier_remove(>userptr.notifier);
  obj->userptr.notifier.mm = NULL;
  }


Re: [PATCH 2/2] drm/i915: Support replaying GPU hangs with captured context image

2024-02-26 Thread Tvrtko Ursulin




On 22/02/2024 21:07, Rodrigo Vivi wrote:

On Wed, Feb 21, 2024 at 02:22:45PM +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

When debugging GPU hangs Mesa developers are finding it useful to replay
the captured error state against the simulator. But due various simulator
limitations which prevent replicating all hangs, one step further is being
able to replay against a real GPU.

This is almost doable today with the missing part being able to upload the
captured context image into the driver state prior to executing the
uploaded hanging batch and all the buffers.

To enable this last part we add a new context parameter called
I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU
configuration pattern of being able to select which context to apply
against, paired with the actual image and its size.

Since this is adding a new concept of debug only uapi, we hide it behind
a new kconfig option and also require activation with a module parameter.
Together with a warning banner printed at driver load, all those combined
should be sufficient to guard against inadvertently enabling the feature.

In terms of implementation we allow the legacy context set param to be
used since that removes the need to record the per context data in the
proto context, while still allowing flexibility of specifying context
images for any context.

Mesa MR using the uapi can be seen at:
   https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594

v2:
  * Fix whitespace alignment as per checkpatch.
  * Added warning on userspace misuse.
  * Rebase for extracting ce->default_state shadowing.

Signed-off-by: Tvrtko Ursulin 
Cc: Lionel Landwerlin 
Cc: Carlos Santa 
Cc: Rodrigo Vivi 
Reviewed-by: Rodrigo Vivi  # v1


still valid for v2. Thanks for splitting the patch.


Great, thanks!

Now we need to hear from Lionel if he is still keen to have this. In 
which case some acks or tested by would be good.


Regards,

Tvrtko


---
  drivers/gpu/drm/i915/Kconfig.debug|  17 +++
  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 113 ++
  drivers/gpu/drm/i915/gt/intel_context.c   |   2 +
  drivers/gpu/drm/i915/gt/intel_context.h   |  22 
  drivers/gpu/drm/i915/gt/intel_context_types.h |   1 +
  drivers/gpu/drm/i915/gt/intel_lrc.c   |   3 +-
  .../gpu/drm/i915/gt/intel_ring_submission.c   |   3 +-
  drivers/gpu/drm/i915/i915_params.c|   5 +
  drivers/gpu/drm/i915/i915_params.h|   3 +-
  include/uapi/drm/i915_drm.h   |  27 +
  10 files changed, 193 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.debug 
b/drivers/gpu/drm/i915/Kconfig.debug
index 5b7162076850..32e9f70e91ed 100644
--- a/drivers/gpu/drm/i915/Kconfig.debug
+++ b/drivers/gpu/drm/i915/Kconfig.debug
@@ -16,6 +16,23 @@ config DRM_I915_WERROR
  
  	  If in doubt, say "N".
  
+config DRM_I915_REPLAY_GPU_HANGS_API

+   bool "Enable GPU hang replay userspace API"
+   depends on DRM_I915
+   depends on EXPERT
+   default n
+   help
+ Choose this option if you want to enable special and unstable
+ userspace API used for replaying GPU hangs on a running system.
+
+ This API is intended to be used by userspace graphics stack developers
+ and provides no stability guarantees.
+
+ The API needs to be activated at boot time using the
+ enable_debug_only_api module parameter.
+
+ If in doubt, say "N".
+
  config DRM_I915_DEBUG
bool "Enable additional driver debugging"
depends on DRM_I915
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..481aacbc1772 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -78,6 +78,7 @@
  #include "gt/intel_engine_user.h"
  #include "gt/intel_gpu_commands.h"
  #include "gt/intel_ring.h"
+#include "gt/shmem_utils.h"
  
  #include "pxp/intel_pxp.h"
  
@@ -949,6 +950,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv,

case I915_CONTEXT_PARAM_NO_ZEROMAP:
case I915_CONTEXT_PARAM_BAN_PERIOD:
case I915_CONTEXT_PARAM_RINGSIZE:
+   case I915_CONTEXT_PARAM_CONTEXT_IMAGE:
default:
ret = -EINVAL;
break;
@@ -2092,6 +2094,95 @@ static int get_protected(struct i915_gem_context *ctx,
return 0;
  }
  
+static int set_context_image(struct i915_gem_context *ctx,

+struct drm_i915_gem_context_param *args)
+{
+   struct i915_gem_context_param_context_image user;
+   struct intel_context *ce;
+   struct file *shmem_state;
+   unsigned long lookup;
+   void *state;
+   int ret = 0;
+
+   if (!IS_ENABLED(CONFIG_DRM_I915_REPLAY_GPU_HANGS_API))
+   return -EINVAL;
+
+   if (!ctx-

Re: [PATCH] drm/i915/guc: Add Compute context hint

2024-02-26 Thread Tvrtko Ursulin



On 26/02/2024 08:47, Tvrtko Ursulin wrote:


On 23/02/2024 19:25, Rodrigo Vivi wrote:

On Fri, Feb 23, 2024 at 10:31:41AM -0800, Belgaumkar, Vinay wrote:


On 2/23/2024 12:51 AM, Tvrtko Ursulin wrote:


On 22/02/2024 23:31, Belgaumkar, Vinay wrote:


On 2/22/2024 7:32 AM, Tvrtko Ursulin wrote:


On 21/02/2024 21:28, Rodrigo Vivi wrote:

On Wed, Feb 21, 2024 at 09:42:34AM +, Tvrtko Ursulin wrote:


On 21/02/2024 00:14, Vinay Belgaumkar wrote:

Allow user to provide a context hint. When this is set, KMD will
send a hint to GuC which results in special handling for this
context. SLPC will ramp the GT frequency aggressively every time
it switches to this context. The down freq threshold will also be
lower so GuC will ramp down the GT freq for this
context more slowly.
We also disable waitboost for this context as that
will interfere with
the strategy.

We need to enable the use of Compute strategy during SLPC init, 
but

it will apply only to contexts that set this bit during context
creation.

Userland can check whether this feature is supported
using a new param-
I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true
for all guc submission
enabled platforms since they use SLPC for freq management.

The Mesa usage model for this flag is here -
https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint


This allows for setting it for the whole application,
correct? Upsides,
downsides? Are there any plans for per context?


Currently there's no extension on a high level API
(Vulkan/OpenGL/OpenCL/etc)
that would allow the application to hint for
power/freq/latency. So Mesa cannot
decide when to hint. So their solution was to use .drirc and
make per-application
decision.

I would prefer a high level extension for a more granular
and informative
decision. We need to work with that goal, but for now I don't see 
any

cons on this approach.


In principle yeah I doesn't harm to have the option. I am just
not sure how useful this intermediate step this is with its lack
of intra-process granularity.


Cc: Rodrigo Vivi 
Signed-off-by: Vinay Belgaumkar 
---
    drivers/gpu/drm/i915/gem/i915_gem_context.c   |  8 +++
    .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
    drivers/gpu/drm/i915/gt/intel_rps.c   |  8 +++
    .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h |
21 +++
    drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c   |
17 +++
    drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h   |  1 +
    .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  7 +++
    drivers/gpu/drm/i915/i915_getparam.c  | 11 ++
    include/uapi/drm/i915_drm.h   | 15 
+

    9 files changed, 89 insertions(+)

diff --git
a/drivers/gpu/drm/i915/gem/i915_gem_context.c
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..ceab7dbe9b47 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -879,6 +879,7 @@ static int
set_proto_ctx_param(struct drm_i915_file_private
*fpriv,
   struct i915_gem_proto_context *pc,
   struct drm_i915_gem_context_param *args)
    {
+    struct drm_i915_private *i915 = fpriv->i915;
    int ret = 0;
    switch (args->param) {
@@ -904,6 +905,13 @@ static int
set_proto_ctx_param(struct drm_i915_file_private
*fpriv,
    pc->user_flags &= ~BIT(UCONTEXT_BANNABLE);
    break;
+    case I915_CONTEXT_PARAM_IS_COMPUTE:
+    if (!intel_uc_uses_guc_submission(_gt(i915)->uc))
+    ret = -EINVAL;
+    else
+    pc->user_flags |= BIT(UCONTEXT_COMPUTE);
+    break;
+
    case I915_CONTEXT_PARAM_RECOVERABLE:
    if (args->size)
    ret = -EINVAL;
diff --git
a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index 03bc7f9d191b..db86d6f6245f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -338,6 +338,7 @@ struct i915_gem_context {
    #define UCONTEXT_BANNABLE    2
    #define UCONTEXT_RECOVERABLE    3
    #define UCONTEXT_PERSISTENCE    4
+#define UCONTEXT_COMPUTE    5


What is the GuC behaviour when
SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for
non-compute engines? Wondering if per intel_context is
what we want instead.
(Which could then be the i915_context_param_engines extension to 
mark

individual contexts as compute strategy.)


Perhaps we should rename this? This is a freq-decision-strategy 
inside
GuC that is there mostly targeting compute workloads that needs 
lower

latency with short burst execution. But the engine itself
doesn't matter.
It can be applied to any engine.


I have no idea if it makes sense for other engines, such as
video, and what would be pros and cons in terms of PnP. But in
the case we end up allowing it on any engine, then at least
userspace name shouldn't be

Re: [PATCH] drm/i915/guc: Add Compute context hint

2024-02-26 Thread Tvrtko Ursulin



On 23/02/2024 19:25, Rodrigo Vivi wrote:

On Fri, Feb 23, 2024 at 10:31:41AM -0800, Belgaumkar, Vinay wrote:


On 2/23/2024 12:51 AM, Tvrtko Ursulin wrote:


On 22/02/2024 23:31, Belgaumkar, Vinay wrote:


On 2/22/2024 7:32 AM, Tvrtko Ursulin wrote:


On 21/02/2024 21:28, Rodrigo Vivi wrote:

On Wed, Feb 21, 2024 at 09:42:34AM +, Tvrtko Ursulin wrote:


On 21/02/2024 00:14, Vinay Belgaumkar wrote:

Allow user to provide a context hint. When this is set, KMD will
send a hint to GuC which results in special handling for this
context. SLPC will ramp the GT frequency aggressively every time
it switches to this context. The down freq threshold will also be
lower so GuC will ramp down the GT freq for this
context more slowly.
We also disable waitboost for this context as that
will interfere with
the strategy.

We need to enable the use of Compute strategy during SLPC init, but
it will apply only to contexts that set this bit during context
creation.

Userland can check whether this feature is supported
using a new param-
I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true
for all guc submission
enabled platforms since they use SLPC for freq management.

The Mesa usage model for this flag is here -
https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint


This allows for setting it for the whole application,
correct? Upsides,
downsides? Are there any plans for per context?


Currently there's no extension on a high level API
(Vulkan/OpenGL/OpenCL/etc)
that would allow the application to hint for
power/freq/latency. So Mesa cannot
decide when to hint. So their solution was to use .drirc and
make per-application
decision.

I would prefer a high level extension for a more granular
and informative
decision. We need to work with that goal, but for now I don't see any
cons on this approach.


In principle yeah I doesn't harm to have the option. I am just
not sure how useful this intermediate step this is with its lack
of intra-process granularity.


Cc: Rodrigo Vivi 
Signed-off-by: Vinay Belgaumkar 
---
    drivers/gpu/drm/i915/gem/i915_gem_context.c   |  8 +++
    .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
    drivers/gpu/drm/i915/gt/intel_rps.c   |  8 +++
    .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h |
21 +++
    drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c   |
17 +++
    drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h   |  1 +
    .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  7 +++
    drivers/gpu/drm/i915/i915_getparam.c  | 11 ++
    include/uapi/drm/i915_drm.h   | 15 +
    9 files changed, 89 insertions(+)

diff --git
a/drivers/gpu/drm/i915/gem/i915_gem_context.c
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..ceab7dbe9b47 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -879,6 +879,7 @@ static int
set_proto_ctx_param(struct drm_i915_file_private
*fpriv,
   struct i915_gem_proto_context *pc,
   struct drm_i915_gem_context_param *args)
    {
+    struct drm_i915_private *i915 = fpriv->i915;
    int ret = 0;
    switch (args->param) {
@@ -904,6 +905,13 @@ static int
set_proto_ctx_param(struct drm_i915_file_private
*fpriv,
    pc->user_flags &= ~BIT(UCONTEXT_BANNABLE);
    break;
+    case I915_CONTEXT_PARAM_IS_COMPUTE:
+    if (!intel_uc_uses_guc_submission(_gt(i915)->uc))
+    ret = -EINVAL;
+    else
+    pc->user_flags |= BIT(UCONTEXT_COMPUTE);
+    break;
+
    case I915_CONTEXT_PARAM_RECOVERABLE:
    if (args->size)
    ret = -EINVAL;
diff --git
a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index 03bc7f9d191b..db86d6f6245f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -338,6 +338,7 @@ struct i915_gem_context {
    #define UCONTEXT_BANNABLE    2
    #define UCONTEXT_RECOVERABLE    3
    #define UCONTEXT_PERSISTENCE    4
+#define UCONTEXT_COMPUTE    5


What is the GuC behaviour when
SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for
non-compute engines? Wondering if per intel_context is
what we want instead.
(Which could then be the i915_context_param_engines extension to mark
individual contexts as compute strategy.)


Perhaps we should rename this? This is a freq-decision-strategy inside
GuC that is there mostly targeting compute workloads that needs lower
latency with short burst execution. But the engine itself
doesn't matter.
It can be applied to any engine.


I have no idea if it makes sense for other engines, such as
video, and what would be pros and cons in terms of PnP. But in
the case we end up allowing it on any engine, then at least
userspace name shouldn't be compute. :)

Yes, one of the suggestions from Daniele was t

Re: [PATCH] drm/i915/guc: Add Compute context hint

2024-02-23 Thread Tvrtko Ursulin



On 22/02/2024 23:31, Belgaumkar, Vinay wrote:


On 2/22/2024 7:32 AM, Tvrtko Ursulin wrote:


On 21/02/2024 21:28, Rodrigo Vivi wrote:

On Wed, Feb 21, 2024 at 09:42:34AM +, Tvrtko Ursulin wrote:


On 21/02/2024 00:14, Vinay Belgaumkar wrote:

Allow user to provide a context hint. When this is set, KMD will
send a hint to GuC which results in special handling for this
context. SLPC will ramp the GT frequency aggressively every time
it switches to this context. The down freq threshold will also be
lower so GuC will ramp down the GT freq for this context more slowly.
We also disable waitboost for this context as that will interfere with
the strategy.

We need to enable the use of Compute strategy during SLPC init, but
it will apply only to contexts that set this bit during context
creation.

Userland can check whether this feature is supported using a new 
param-
I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true for all guc 
submission

enabled platforms since they use SLPC for freq management.

The Mesa usage model for this flag is here -
https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint


This allows for setting it for the whole application, correct? Upsides,
downsides? Are there any plans for per context?


Currently there's no extension on a high level API 
(Vulkan/OpenGL/OpenCL/etc)
that would allow the application to hint for power/freq/latency. So 
Mesa cannot
decide when to hint. So their solution was to use .drirc and make 
per-application

decision.

I would prefer a high level extension for a more granular and 
informative

decision. We need to work with that goal, but for now I don't see any
cons on this approach.


In principle yeah I doesn't harm to have the option. I am just not 
sure how useful this intermediate step this is with its lack of 
intra-process granularity.



Cc: Rodrigo Vivi 
Signed-off-by: Vinay Belgaumkar 
---
   drivers/gpu/drm/i915/gem/i915_gem_context.c   |  8 +++
   .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
   drivers/gpu/drm/i915/gt/intel_rps.c   |  8 +++
   .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 
+++

   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c   | 17 +++
   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h   |  1 +
   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  7 +++
   drivers/gpu/drm/i915/i915_getparam.c  | 11 ++
   include/uapi/drm/i915_drm.h   | 15 +
   9 files changed, 89 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c

index dcbfe32fd30c..ceab7dbe9b47 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct 
drm_i915_file_private *fpriv,

  struct i915_gem_proto_context *pc,
  struct drm_i915_gem_context_param *args)
   {
+    struct drm_i915_private *i915 = fpriv->i915;
   int ret = 0;
   switch (args->param) {
@@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct 
drm_i915_file_private *fpriv,

   pc->user_flags &= ~BIT(UCONTEXT_BANNABLE);
   break;
+    case I915_CONTEXT_PARAM_IS_COMPUTE:
+    if (!intel_uc_uses_guc_submission(_gt(i915)->uc))
+    ret = -EINVAL;
+    else
+    pc->user_flags |= BIT(UCONTEXT_COMPUTE);
+    break;
+
   case I915_CONTEXT_PARAM_RECOVERABLE:
   if (args->size)
   ret = -EINVAL;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h 
b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h

index 03bc7f9d191b..db86d6f6245f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -338,6 +338,7 @@ struct i915_gem_context {
   #define UCONTEXT_BANNABLE    2
   #define UCONTEXT_RECOVERABLE    3
   #define UCONTEXT_PERSISTENCE    4
+#define UCONTEXT_COMPUTE    5


What is the GuC behaviour when SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for
non-compute engines? Wondering if per intel_context is what we want 
instead.

(Which could then be the i915_context_param_engines extension to mark
individual contexts as compute strategy.)


Perhaps we should rename this? This is a freq-decision-strategy inside
GuC that is there mostly targeting compute workloads that needs lower
latency with short burst execution. But the engine itself doesn't 
matter.

It can be applied to any engine.


I have no idea if it makes sense for other engines, such as video, and 
what would be pros and cons in terms of PnP. But in the case we end up 
allowing it on any engine, then at least userspace name shouldn't be 
compute. :)
Yes, one of the suggestions from Daniele was to have something along the 
lines of UCONTEXT_HIFREQ or something along those lines so we don't 
confuse it with the Compute Engine.


Okay, but additional qu

Re: [PATCH] drm/i915/guc: Add Compute context hint

2024-02-22 Thread Tvrtko Ursulin



On 21/02/2024 21:28, Rodrigo Vivi wrote:

On Wed, Feb 21, 2024 at 09:42:34AM +, Tvrtko Ursulin wrote:


On 21/02/2024 00:14, Vinay Belgaumkar wrote:

Allow user to provide a context hint. When this is set, KMD will
send a hint to GuC which results in special handling for this
context. SLPC will ramp the GT frequency aggressively every time
it switches to this context. The down freq threshold will also be
lower so GuC will ramp down the GT freq for this context more slowly.
We also disable waitboost for this context as that will interfere with
the strategy.

We need to enable the use of Compute strategy during SLPC init, but
it will apply only to contexts that set this bit during context
creation.

Userland can check whether this feature is supported using a new param-
I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true for all guc submission
enabled platforms since they use SLPC for freq management.

The Mesa usage model for this flag is here -
https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint


This allows for setting it for the whole application, correct? Upsides,
downsides? Are there any plans for per context?


Currently there's no extension on a high level API (Vulkan/OpenGL/OpenCL/etc)
that would allow the application to hint for power/freq/latency. So Mesa cannot
decide when to hint. So their solution was to use .drirc and make 
per-application
decision.

I would prefer a high level extension for a more granular and informative
decision. We need to work with that goal, but for now I don't see any
cons on this approach.


In principle yeah I doesn't harm to have the option. I am just not sure 
how useful this intermediate step this is with its lack of intra-process 
granularity.



Cc: Rodrigo Vivi 
Signed-off-by: Vinay Belgaumkar 
---
   drivers/gpu/drm/i915/gem/i915_gem_context.c   |  8 +++
   .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
   drivers/gpu/drm/i915/gt/intel_rps.c   |  8 +++
   .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 +++
   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c   | 17 +++
   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h   |  1 +
   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  7 +++
   drivers/gpu/drm/i915/i915_getparam.c  | 11 ++
   include/uapi/drm/i915_drm.h   | 15 +
   9 files changed, 89 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..ceab7dbe9b47 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private 
*fpriv,
   struct i915_gem_proto_context *pc,
   struct drm_i915_gem_context_param *args)
   {
+   struct drm_i915_private *i915 = fpriv->i915;
int ret = 0;
switch (args->param) {
@@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct 
drm_i915_file_private *fpriv,
pc->user_flags &= ~BIT(UCONTEXT_BANNABLE);
break;
+   case I915_CONTEXT_PARAM_IS_COMPUTE:
+   if (!intel_uc_uses_guc_submission(_gt(i915)->uc))
+   ret = -EINVAL;
+   else
+   pc->user_flags |= BIT(UCONTEXT_COMPUTE);
+   break;
+
case I915_CONTEXT_PARAM_RECOVERABLE:
if (args->size)
ret = -EINVAL;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h 
b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index 03bc7f9d191b..db86d6f6245f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -338,6 +338,7 @@ struct i915_gem_context {
   #define UCONTEXT_BANNABLE2
   #define UCONTEXT_RECOVERABLE 3
   #define UCONTEXT_PERSISTENCE 4
+#define UCONTEXT_COMPUTE   5


What is the GuC behaviour when SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for
non-compute engines? Wondering if per intel_context is what we want instead.
(Which could then be the i915_context_param_engines extension to mark
individual contexts as compute strategy.)


Perhaps we should rename this? This is a freq-decision-strategy inside
GuC that is there mostly targeting compute workloads that needs lower
latency with short burst execution. But the engine itself doesn't matter.
It can be applied to any engine.


I have no idea if it makes sense for other engines, such as video, and 
what would be pros and cons in terms of PnP. But in the case we end up 
allowing it on any engine, then at least userspace name shouldn't be 
compute. :)


Or if we decide to call it compute and only apply to compute engines, 
then I would strongly suggest making the uapi per intel_context i.e. the 
set engines extension instead of the GEM context param. Othe

Re: [PATCH 0/1] Always record job cycle and timestamp information

2024-02-21 Thread Tvrtko Ursulin



On 21/02/2024 09:40, Adrián Larumbe wrote:

Hi,

I just wanted to make sure we're on the same page on this matter. So in
Panfrost, and I guess in almost every other single driver out there, HW perf
counters and their uapi interface are orthogonal to fdinfo's reporting on drm
engine utilisation.

At the moment it seems like HW perfcounters and the way they're exposed to UM
are very idiosincratic and any attempt to unify their interface into a common
set of ioctl's sounds like a gargantuan task I wouldn't like to be faced with.


I share the same feeling on this sub-topic.


As for fdinfo, I guess there's more room for coming up with common helpers that
could handle the toggling of HW support for drm engine calculations, but I'd at
least have to see how things are being done in let's say, Freedreno or Intel.


For Intel we don't need this ability, well at least for pre-GuC 
platforms. Stat collection is super cheap and permanently enabled there.


But let me copy Umesh because something at the back of my mind is 
telling me that perhaps there was something expensive about collecting 
these stats with the GuC backend? If so maybe a toggle would be 
beneficial there.



Right now there's a pressing need to get rid of the debugfs knob for fdinfo's
drm engine profiling sources in Panfrost, after which I could perhaps draw up an
RFC for how to generalise this onto other drivers.


There is a knob currently meaning fdinfo does not work by default? If 
that is so, I would have at least expected someone had submitted a patch 
for gputop to handle this toggle. It being kind of a common reference 
implementation I don't think it is great if it does not work out of the box.


The toggle as an idea sounds a bit annoying, but if there is no other 
realistic way maybe it is not too bad. As long as it is documented in 
the drm-usage-stats.rst, doesn't live in debugfs, and has some common 
plumbing implemented both on the kernel side and for the aforementioned 
gputop / igt_drm_fdinfo / igt_drm_clients. Where and how exactly TBD.


Regards,

Tvrtko



On 16.02.2024 17:43, Tvrtko Ursulin wrote:


On 16/02/2024 16:57, Daniel Vetter wrote:

On Wed, Feb 14, 2024 at 01:52:05PM +, Steven Price wrote:

Hi Adrián,

On 14/02/2024 12:14, Adrián Larumbe wrote:

A driver user expressed interest in being able to access engine usage stats
through fdinfo when debugfs is not built into their kernel. In the current
implementation, this wasn't possible, because it was assumed even for
inflight jobs enabling the cycle counter and timestamp registers would
incur in additional power consumption, so both were kept disabled until
toggled through debugfs.

A second read of the TRM made me think otherwise, but this is something
that would be best clarified by someone from ARM's side.


I'm afraid I can't give a definitive answer. This will probably vary
depending on implementation. The command register enables/disables
"propagation" of the cycle/timestamp values. This propagation will cost
some power (gates are getting toggled) but whether that power is
completely in the noise of the GPU as a whole I can't say.

The out-of-tree kbase driver only enables the counters for jobs
explicitly marked (BASE_JD_REQ_PERMON) or due to an explicit connection
from a profiler.

I'd be happier moving the debugfs file to sysfs rather than assuming
that the power consumption is small enough for all platforms.

Ideally we'd have some sort of kernel interface for a profiler to inform
the kernel what it is interested in, but I can't immediately see how to
make that useful across different drivers. kbase's profiling support is
great with our profiling tools, but there's a very strong connection
between the two.


Yeah I'm not sure whether a magic (worse probably per-driver massively
different) file in sysfs is needed to enable gpu perf monitoring stats in
fdinfo.

I get that we do have a bit a gap because the linux perf pmu stuff is
global, and you want per-process, and there's kinda no per-process support
for perf stats for devices. But that's probably the direction we want to
go, not so much fdinfo. At least for hardware performance counters and
things like that.

Iirc the i915 pmu support had some integration for per-process support,
you might want to chat with Tvrtko for kernel side and Lionel for more
userspace side. At least if I'm not making a complete mess and my memory
is vaguely related to reality. Adding them both.


Yeah there are two separate things, i915 PMU and i915 Perf/OA.

If my memory serves me right I indeed did have a per-process support for i915
PMU implemented as an RFC (or at least a branch somewhere) some years back.
IIRC it only exposed the per engine GPU utilisation and did not find it very
useful versus the complexity. (I think it at least required maintaining a map
of drm clients per task.)

Our more useful profiling is using a custom Perf/OA interface (Observation
Architecture) which is possibly similar to kbase mentioned 

[PATCH 1/2] drm/i915: Shadow default engine context image in the context

2024-02-21 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

To enable adding override of the default engine context image let us start
shadowing the per engine state in the context.

Signed-off-by: Tvrtko Ursulin 
Cc: Lionel Landwerlin 
Cc: Carlos Santa 
Cc: Rodrigo Vivi 
---
 drivers/gpu/drm/i915/gt/intel_context_types.h   | 2 ++
 drivers/gpu/drm/i915/gt/intel_lrc.c | 7 ---
 drivers/gpu/drm/i915/gt/intel_ring_submission.c | 7 ---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h 
b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 7eccbd70d89f..b179178680a5 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -99,6 +99,8 @@ struct intel_context {
struct i915_address_space *vm;
struct i915_gem_context __rcu *gem_context;
 
+   struct file *default_state;
+
/*
 * @signal_lock protects the list of requests that need signaling,
 * @signals. While there are any requests that need signaling,
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 7c367ba8d9dc..d4eb822d20ae 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1060,9 +1060,8 @@ void lrc_init_state(struct intel_context *ce,
 
set_redzone(state, engine);
 
-   if (engine->default_state) {
-   shmem_read(engine->default_state, 0,
-  state, engine->context_size);
+   if (ce->default_state) {
+   shmem_read(ce->default_state, 0, state, engine->context_size);
__set_bit(CONTEXT_VALID_BIT, >flags);
inhibit = false;
}
@@ -1174,6 +1173,8 @@ int lrc_alloc(struct intel_context *ce, struct 
intel_engine_cs *engine)
 
GEM_BUG_ON(ce->state);
 
+   ce->default_state = engine->default_state;
+
vma = __lrc_alloc_state(ce, engine);
if (IS_ERR(vma))
return PTR_ERR(vma);
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c 
b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
index 92085ffd23de..8625e88e785f 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
@@ -474,8 +474,7 @@ static int ring_context_init_default_state(struct 
intel_context *ce,
if (IS_ERR(vaddr))
return PTR_ERR(vaddr);
 
-   shmem_read(ce->engine->default_state, 0,
-  vaddr, ce->engine->context_size);
+   shmem_read(ce->default_state, 0, vaddr, ce->engine->context_size);
 
i915_gem_object_flush_map(obj);
__i915_gem_object_release_map(obj);
@@ -491,7 +490,7 @@ static int ring_context_pre_pin(struct intel_context *ce,
struct i915_address_space *vm;
int err = 0;
 
-   if (ce->engine->default_state &&
+   if (ce->default_state &&
!test_bit(CONTEXT_VALID_BIT, >flags)) {
err = ring_context_init_default_state(ce, ww);
if (err)
@@ -570,6 +569,8 @@ static int ring_context_alloc(struct intel_context *ce)
 {
struct intel_engine_cs *engine = ce->engine;
 
+   ce->default_state = engine->default_state;
+
/* One ringbuffer to rule them all */
GEM_BUG_ON(!engine->legacy.ring);
ce->ring = engine->legacy.ring;
-- 
2.40.1



[PATCH 2/2] drm/i915: Support replaying GPU hangs with captured context image

2024-02-21 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

When debugging GPU hangs Mesa developers are finding it useful to replay
the captured error state against the simulator. But due various simulator
limitations which prevent replicating all hangs, one step further is being
able to replay against a real GPU.

This is almost doable today with the missing part being able to upload the
captured context image into the driver state prior to executing the
uploaded hanging batch and all the buffers.

To enable this last part we add a new context parameter called
I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU
configuration pattern of being able to select which context to apply
against, paired with the actual image and its size.

Since this is adding a new concept of debug only uapi, we hide it behind
a new kconfig option and also require activation with a module parameter.
Together with a warning banner printed at driver load, all those combined
should be sufficient to guard against inadvertently enabling the feature.

In terms of implementation we allow the legacy context set param to be
used since that removes the need to record the per context data in the
proto context, while still allowing flexibility of specifying context
images for any context.

Mesa MR using the uapi can be seen at:
  https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594

v2:
 * Fix whitespace alignment as per checkpatch.
 * Added warning on userspace misuse.
 * Rebase for extracting ce->default_state shadowing.

Signed-off-by: Tvrtko Ursulin 
Cc: Lionel Landwerlin 
Cc: Carlos Santa 
Cc: Rodrigo Vivi 
Reviewed-by: Rodrigo Vivi  # v1
---
 drivers/gpu/drm/i915/Kconfig.debug|  17 +++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 113 ++
 drivers/gpu/drm/i915/gt/intel_context.c   |   2 +
 drivers/gpu/drm/i915/gt/intel_context.h   |  22 
 drivers/gpu/drm/i915/gt/intel_context_types.h |   1 +
 drivers/gpu/drm/i915/gt/intel_lrc.c   |   3 +-
 .../gpu/drm/i915/gt/intel_ring_submission.c   |   3 +-
 drivers/gpu/drm/i915/i915_params.c|   5 +
 drivers/gpu/drm/i915/i915_params.h|   3 +-
 include/uapi/drm/i915_drm.h   |  27 +
 10 files changed, 193 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.debug 
b/drivers/gpu/drm/i915/Kconfig.debug
index 5b7162076850..32e9f70e91ed 100644
--- a/drivers/gpu/drm/i915/Kconfig.debug
+++ b/drivers/gpu/drm/i915/Kconfig.debug
@@ -16,6 +16,23 @@ config DRM_I915_WERROR
 
  If in doubt, say "N".
 
+config DRM_I915_REPLAY_GPU_HANGS_API
+   bool "Enable GPU hang replay userspace API"
+   depends on DRM_I915
+   depends on EXPERT
+   default n
+   help
+ Choose this option if you want to enable special and unstable
+ userspace API used for replaying GPU hangs on a running system.
+
+ This API is intended to be used by userspace graphics stack developers
+ and provides no stability guarantees.
+
+ The API needs to be activated at boot time using the
+ enable_debug_only_api module parameter.
+
+ If in doubt, say "N".
+
 config DRM_I915_DEBUG
bool "Enable additional driver debugging"
depends on DRM_I915
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..481aacbc1772 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -78,6 +78,7 @@
 #include "gt/intel_engine_user.h"
 #include "gt/intel_gpu_commands.h"
 #include "gt/intel_ring.h"
+#include "gt/shmem_utils.h"
 
 #include "pxp/intel_pxp.h"
 
@@ -949,6 +950,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private 
*fpriv,
case I915_CONTEXT_PARAM_NO_ZEROMAP:
case I915_CONTEXT_PARAM_BAN_PERIOD:
case I915_CONTEXT_PARAM_RINGSIZE:
+   case I915_CONTEXT_PARAM_CONTEXT_IMAGE:
default:
ret = -EINVAL;
break;
@@ -2092,6 +2094,95 @@ static int get_protected(struct i915_gem_context *ctx,
return 0;
 }
 
+static int set_context_image(struct i915_gem_context *ctx,
+struct drm_i915_gem_context_param *args)
+{
+   struct i915_gem_context_param_context_image user;
+   struct intel_context *ce;
+   struct file *shmem_state;
+   unsigned long lookup;
+   void *state;
+   int ret = 0;
+
+   if (!IS_ENABLED(CONFIG_DRM_I915_REPLAY_GPU_HANGS_API))
+   return -EINVAL;
+
+   if (!ctx->i915->params.enable_debug_only_api)
+   return -EINVAL;
+
+   if (args->size < sizeof(user))
+   return -EINVAL;
+
+   if (copy_from_user(, u64_to_user_ptr(args->value), sizeof(user)))
+   return -EFAULT;
+
+   if (user.mbz)
+   return -EINVAL;
+
+   if

[PATCH v2 0/2] GPU hang replay

2024-02-21 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Please see 2/2 for explanation and rationale.

v2:
 * Extracted shadowing of default state into a leading patch.

Tvrtko Ursulin (2):
  drm/i915: Shadow default engine context image in the context
  drm/i915: Support replaying GPU hangs with captured context image

 drivers/gpu/drm/i915/Kconfig.debug|  17 +++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 113 ++
 drivers/gpu/drm/i915/gt/intel_context.c   |   2 +
 drivers/gpu/drm/i915/gt/intel_context.h   |  22 
 drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
 drivers/gpu/drm/i915/gt/intel_lrc.c   |   8 +-
 .../gpu/drm/i915/gt/intel_ring_submission.c   |   8 +-
 drivers/gpu/drm/i915/i915_params.c|   5 +
 drivers/gpu/drm/i915/i915_params.h|   3 +-
 include/uapi/drm/i915_drm.h   |  27 +
 10 files changed, 201 insertions(+), 7 deletions(-)

-- 
2.40.1



Re: [PATCH v2 2/2] drm/i915/gt: Enable only one CCS for compute workload

2024-02-21 Thread Tvrtko Ursulin




On 21/02/2024 12:08, Tvrtko Ursulin wrote:


On 21/02/2024 11:19, Andi Shyti wrote:

Hi Tvrtko,

On Wed, Feb 21, 2024 at 08:19:34AM +, Tvrtko Ursulin wrote:

On 21/02/2024 00:14, Andi Shyti wrote:

On Tue, Feb 20, 2024 at 02:48:31PM +, Tvrtko Ursulin wrote:

On 20/02/2024 14:35, Andi Shyti wrote:

Enable only one CCS engine by default with all the compute sices


slices


Thanks!

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c 
b/drivers/gpu/drm/i915/gt/intel_engine_user.c

index 833987015b8b..7041acc77810 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_user.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c
@@ -243,6 +243,15 @@ void intel_engines_driver_register(struct 
drm_i915_private *i915)

    if (engine->uabi_class == I915_NO_UABI_CLASS)
    continue;
+    /*
+ * Do not list and do not count CCS engines other than 
the first

+ */
+    if (engine->uabi_class == I915_ENGINE_CLASS_COMPUTE &&
+    engine->uabi_instance > 0) {
+    i915->engine_uabi_class_count[engine->uabi_class]--;
+    continue;
+    }


It's a bit ugly to decrement after increment, instead of somehow
restructuring the loop to satisfy both cases more elegantly.


yes, agree, indeed I had a hard time here to accept this change
myself.

But moving the check above where the counter was incremented it
would have been much uglier.

This check looks ugly everywhere you place it :-)


One idea would be to introduce a separate local counter array for
name_instance, so not use i915->engine_uabi_class_count[]. First one
increments for every engine, second only for the exposed ones. That way
feels wouldn't be too ugly.


Ah... you mean that whenever we change the CCS mode, we update
the indexes of the exposed engines from list of the real engines.
Will try.

My approach was to regenerate the list everytime the CCS mode was
changed, but your suggestion looks a bit simplier.


No, I meant just for this first stage of permanently single engine. For 
avoiding the decrement after increment. Something like this, but not 
compile tested even:


diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c 
b/drivers/gpu/drm/i915/gt/intel_engine_user.c

index 833987015b8b..4c33f30612c4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_user.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c
@@ -203,7 +203,8 @@ static void engine_rename(struct intel_engine_cs 
*engine, const char *name, u16


  void intel_engines_driver_register(struct drm_i915_private *i915)
  {
-   u16 name_instance, other_instance = 0;
+   u16 class_instance[I915_LAST_UABI_ENGINE_CLASS + 2] = { };
+   u16 uabi_class, other_instance = 0;
     struct legacy_ring ring = {};
     struct list_head *it, *next;
     struct rb_node **p, *prev;
@@ -222,15 +223,14 @@ void intel_engines_driver_register(struct 
drm_i915_private *i915)


     GEM_BUG_ON(engine->class >= ARRAY_SIZE(uabi_classes));
     engine->uabi_class = uabi_classes[engine->class];
+
     if (engine->uabi_class == I915_NO_UABI_CLASS) {
-   name_instance = other_instance++;
-   } else {
-   GEM_BUG_ON(engine->uabi_class >=
-  
ARRAY_SIZE(i915->engine_uabi_class_count));

-   name_instance =
-   
i915->engine_uabi_class_count[engine->uabi_class]++;

-   }
-   engine->uabi_instance = name_instance;
+   uabi_class = I915_LAST_UABI_ENGINE_CLASS + 1;
+   else
+   uabi_class = engine->uabi_class;
+
+   GEM_BUG_ON(uabi_class >= ARRAY_SIZE(class_instance));
+   engine->uabi_instance = class_instance[uabi_class]++;

     /*
  * Replace the internal name with the final user and 
log facing
@@ -238,11 +238,15 @@ void intel_engines_driver_register(struct 
drm_i915_private *i915)

  */
     engine_rename(engine,
   intel_engine_class_repr(engine->class),
- name_instance);
+ engine->uabi_instance);

-   if (engine->uabi_class == I915_NO_UABI_CLASS)
+   if (uabi_class == I915_NO_UABI_CLASS)
     continue;


Here you just add the ccs skip condition.

Anyway.. I rushed it a bit so see what you think.

Regards,

Tvrtko



+   GEM_BUG_ON(uabi_class >=
+  ARRAY_SIZE(i915->engine_uabi_class_count));
+   i915->engine_uabi_class_count[uabi_class]++;
+
     rb_link_node(>uabi_node, prev, p);
     rb_insert_color(>uabi_node, >uabi_engines);



In any case, I'm working on a patch that is splitting this
function in two parts

Re: [PATCH v2 2/2] drm/i915/gt: Enable only one CCS for compute workload

2024-02-21 Thread Tvrtko Ursulin



On 21/02/2024 11:19, Andi Shyti wrote:

Hi Tvrtko,

On Wed, Feb 21, 2024 at 08:19:34AM +, Tvrtko Ursulin wrote:

On 21/02/2024 00:14, Andi Shyti wrote:

On Tue, Feb 20, 2024 at 02:48:31PM +, Tvrtko Ursulin wrote:

On 20/02/2024 14:35, Andi Shyti wrote:

Enable only one CCS engine by default with all the compute sices


slices


Thanks!


diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c 
b/drivers/gpu/drm/i915/gt/intel_engine_user.c
index 833987015b8b..7041acc77810 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_user.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c
@@ -243,6 +243,15 @@ void intel_engines_driver_register(struct drm_i915_private 
*i915)
if (engine->uabi_class == I915_NO_UABI_CLASS)
continue;
+   /*
+* Do not list and do not count CCS engines other than the first
+*/
+   if (engine->uabi_class == I915_ENGINE_CLASS_COMPUTE &&
+   engine->uabi_instance > 0) {
+   i915->engine_uabi_class_count[engine->uabi_class]--;
+   continue;
+   }


It's a bit ugly to decrement after increment, instead of somehow
restructuring the loop to satisfy both cases more elegantly.


yes, agree, indeed I had a hard time here to accept this change
myself.

But moving the check above where the counter was incremented it
would have been much uglier.

This check looks ugly everywhere you place it :-)


One idea would be to introduce a separate local counter array for
name_instance, so not use i915->engine_uabi_class_count[]. First one
increments for every engine, second only for the exposed ones. That way
feels wouldn't be too ugly.


Ah... you mean that whenever we change the CCS mode, we update
the indexes of the exposed engines from list of the real engines.
Will try.

My approach was to regenerate the list everytime the CCS mode was
changed, but your suggestion looks a bit simplier.


No, I meant just for this first stage of permanently single engine. For 
avoiding the decrement after increment. Something like this, but not compile 
tested even:

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c 
b/drivers/gpu/drm/i915/gt/intel_engine_user.c
index 833987015b8b..4c33f30612c4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_user.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c
@@ -203,7 +203,8 @@ static void engine_rename(struct intel_engine_cs *engine, 
const char *name, u16
 
 void intel_engines_driver_register(struct drm_i915_private *i915)

 {
-   u16 name_instance, other_instance = 0;
+   u16 class_instance[I915_LAST_UABI_ENGINE_CLASS + 2] = { };
+   u16 uabi_class, other_instance = 0;
struct legacy_ring ring = {};
struct list_head *it, *next;
struct rb_node **p, *prev;
@@ -222,15 +223,14 @@ void intel_engines_driver_register(struct 
drm_i915_private *i915)
 
GEM_BUG_ON(engine->class >= ARRAY_SIZE(uabi_classes));

engine->uabi_class = uabi_classes[engine->class];
+
if (engine->uabi_class == I915_NO_UABI_CLASS) {
-   name_instance = other_instance++;
-   } else {
-   GEM_BUG_ON(engine->uabi_class >=
-  ARRAY_SIZE(i915->engine_uabi_class_count));
-   name_instance =
-   
i915->engine_uabi_class_count[engine->uabi_class]++;
-   }
-   engine->uabi_instance = name_instance;
+   uabi_class = I915_LAST_UABI_ENGINE_CLASS + 1;
+   else
+   uabi_class = engine->uabi_class;
+
+   GEM_BUG_ON(uabi_class >= ARRAY_SIZE(class_instance));
+   engine->uabi_instance = class_instance[uabi_class]++;
 
/*

 * Replace the internal name with the final user and log facing
@@ -238,11 +238,15 @@ void intel_engines_driver_register(struct 
drm_i915_private *i915)
 */
engine_rename(engine,
  intel_engine_class_repr(engine->class),
- name_instance);
+ engine->uabi_instance);
 
-   if (engine->uabi_class == I915_NO_UABI_CLASS)

+   if (uabi_class == I915_NO_UABI_CLASS)
continue;
 
+   GEM_BUG_ON(uabi_class >=

+  ARRAY_SIZE(i915->engine_uabi_class_count));
+   i915->engine_uabi_class_count[uabi_class]++;
+
rb_link_node(>uabi_node, prev, p);
rb_insert_color(>uabi_node, >uabi_engines);



In any case, I'm working on a patch that is splitting this
function in two parts and there is some refactoring happening
here (for the first initialization and the dynamic update).

Please 

Re: [PATCH] drm/i915/guc: Add Compute context hint

2024-02-21 Thread Tvrtko Ursulin



On 21/02/2024 00:14, Vinay Belgaumkar wrote:

Allow user to provide a context hint. When this is set, KMD will
send a hint to GuC which results in special handling for this
context. SLPC will ramp the GT frequency aggressively every time
it switches to this context. The down freq threshold will also be
lower so GuC will ramp down the GT freq for this context more slowly.
We also disable waitboost for this context as that will interfere with
the strategy.

We need to enable the use of Compute strategy during SLPC init, but
it will apply only to contexts that set this bit during context
creation.

Userland can check whether this feature is supported using a new param-
I915_PARAM_HAS_COMPUTE_CONTEXT. This flag is true for all guc submission
enabled platforms since they use SLPC for freq management.

The Mesa usage model for this flag is here -
https://gitlab.freedesktop.org/sushmave/mesa/-/commits/compute_hint


This allows for setting it for the whole application, correct? Upsides, 
downsides? Are there any plans for per context?



Cc: Rodrigo Vivi 
Signed-off-by: Vinay Belgaumkar 
---
  drivers/gpu/drm/i915/gem/i915_gem_context.c   |  8 +++
  .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
  drivers/gpu/drm/i915/gt/intel_rps.c   |  8 +++
  .../drm/i915/gt/uc/abi/guc_actions_slpc_abi.h | 21 +++
  drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c   | 17 +++
  drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.h   |  1 +
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  7 +++
  drivers/gpu/drm/i915/i915_getparam.c  | 11 ++
  include/uapi/drm/i915_drm.h   | 15 +
  9 files changed, 89 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..ceab7dbe9b47 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -879,6 +879,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private 
*fpriv,
   struct i915_gem_proto_context *pc,
   struct drm_i915_gem_context_param *args)
  {
+   struct drm_i915_private *i915 = fpriv->i915;
int ret = 0;
  
  	switch (args->param) {

@@ -904,6 +905,13 @@ static int set_proto_ctx_param(struct 
drm_i915_file_private *fpriv,
pc->user_flags &= ~BIT(UCONTEXT_BANNABLE);
break;
  
+	case I915_CONTEXT_PARAM_IS_COMPUTE:

+   if (!intel_uc_uses_guc_submission(_gt(i915)->uc))
+   ret = -EINVAL;
+   else
+   pc->user_flags |= BIT(UCONTEXT_COMPUTE);
+   break;
+
case I915_CONTEXT_PARAM_RECOVERABLE:
if (args->size)
ret = -EINVAL;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h 
b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index 03bc7f9d191b..db86d6f6245f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -338,6 +338,7 @@ struct i915_gem_context {
  #define UCONTEXT_BANNABLE 2
  #define UCONTEXT_RECOVERABLE  3
  #define UCONTEXT_PERSISTENCE  4
+#define UCONTEXT_COMPUTE   5


What is the GuC behaviour when SLPC_CTX_FREQ_REQ_IS_COMPUTE is set for 
non-compute engines? Wondering if per intel_context is what we want 
instead. (Which could then be the i915_context_param_engines extension 
to mark individual contexts as compute strategy.)


  
  	/**

 * @flags: small set of booleans
diff --git a/drivers/gpu/drm/i915/gt/intel_rps.c 
b/drivers/gpu/drm/i915/gt/intel_rps.c
index 4feef874e6d6..1ed40cd61b70 100644
--- a/drivers/gpu/drm/i915/gt/intel_rps.c
+++ b/drivers/gpu/drm/i915/gt/intel_rps.c
@@ -24,6 +24,7 @@
  #include "intel_pcode.h"
  #include "intel_rps.h"
  #include "vlv_sideband.h"
+#include "../gem/i915_gem_context.h"
  #include "../../../platform/x86/intel_ips.h"
  
  #define BUSY_MAX_EI	20u /* ms */

@@ -1018,6 +1019,13 @@ void intel_rps_boost(struct i915_request *rq)
struct intel_rps *rps = _ONCE(rq->engine)->gt->rps;
  
  		if (rps_uses_slpc(rps)) {

+   const struct i915_gem_context *ctx;
+
+   ctx = i915_request_gem_context(rq);
+   if (ctx &&
+   test_bit(UCONTEXT_COMPUTE, >user_flags))
+   return;
+


I think request and intel_context do not own a strong reference to GEM 
context. So at minimum you need a local one obtained under a RCU lock 
with kref_get_unless_zero, as do some other places do.


However.. it may be simpler to just store the flag in 
intel_context->flags. If you carry it over at the time GEM context is 
assigned to intel_context, not only you simplify runtime rules, but you 
get the ability to not set the compute flags for video etc.


It may even make 

Re: [RFC] drm/i915: Support replaying GPU hangs with captured context image

2024-02-21 Thread Tvrtko Ursulin



On 20/02/2024 22:50, Rodrigo Vivi wrote:

On Tue, Feb 13, 2024 at 01:14:34PM +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

When debugging GPU hangs Mesa developers are finding it useful to replay
the captured error state against the simulator. But due various simulator
limitations which prevent replicating all hangs, one step further is being
able to replay against a real GPU.

This is almost doable today with the missing part being able to upload the
captured context image into the driver state prior to executing the
uploaded hanging batch and all the buffers.

To enable this last part we add a new context parameter called
I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU
configuration pattern of being able to select which context to apply
against, paired with the actual image and its size.

Since this is adding a new concept of debug only uapi, we hide it behind
a new kconfig option and also require activation with a module parameter.
Together with a warning banner printed at driver load, all those combined
should be sufficient to guard against inadvertently enabling the feature.

In terms of implementation the only trivial change is shadowing of the
default state from engine to context. We also allow the legacy context
set param to be used since that removes the need to record the per context
data in the proto context, while still allowing flexibility of specifying
context images for any context.

Mesa MR using the uapi can be seen at:
   https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594


I just wonder if it would be better to split the default_state in a separate
patch but from what I could see it looks correct.


It definitely makes sense to split it. I was just a bit lazy while 
testing the waters. After all this is a very novel idea of debug only 
uapi outside debugfs so I wasn't too sure how it will be received. Stay 
tuned for v2.


Regards,

Tvrtko



Also, I have to say that this approach is nice, clean and well protected.
And much simpler then I imagined when I saw the idea around.

Feel free to use:
Reviewed-by: Rodrigo Vivi 



Signed-off-by: Tvrtko Ursulin 
Cc: Lionel Landwerlin 
Cc: Carlos Santa 
---
  drivers/gpu/drm/i915/Kconfig.debug|  17 +++
  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 106 ++
  drivers/gpu/drm/i915/gt/intel_context.c   |   2 +
  drivers/gpu/drm/i915/gt/intel_context.h   |  22 
  drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
  drivers/gpu/drm/i915/gt/intel_lrc.c   |   8 +-
  .../gpu/drm/i915/gt/intel_ring_submission.c   |   8 +-
  drivers/gpu/drm/i915/i915_params.c|   5 +
  drivers/gpu/drm/i915/i915_params.h|   3 +-
  include/uapi/drm/i915_drm.h   |  27 +
  10 files changed, 194 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/Kconfig.debug 
b/drivers/gpu/drm/i915/Kconfig.debug
index 5b7162076850..32e9f70e91ed 100644
--- a/drivers/gpu/drm/i915/Kconfig.debug
+++ b/drivers/gpu/drm/i915/Kconfig.debug
@@ -16,6 +16,23 @@ config DRM_I915_WERROR
  
  	  If in doubt, say "N".
  
+config DRM_I915_REPLAY_GPU_HANGS_API

+   bool "Enable GPU hang replay userspace API"
+   depends on DRM_I915
+   depends on EXPERT
+   default n
+   help
+ Choose this option if you want to enable special and unstable
+ userspace API used for replaying GPU hangs on a running system.
+
+ This API is intended to be used by userspace graphics stack developers
+ and provides no stability guarantees.
+
+ The API needs to be activated at boot time using the
+ enable_debug_only_api module parameter.
+
+ If in doubt, say "N".
+
  config DRM_I915_DEBUG
bool "Enable additional driver debugging"
depends on DRM_I915
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index dcbfe32fd30c..1cfd624bd978 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -78,6 +78,7 @@
  #include "gt/intel_engine_user.h"
  #include "gt/intel_gpu_commands.h"
  #include "gt/intel_ring.h"
+#include "gt/shmem_utils.h"
  
  #include "pxp/intel_pxp.h"
  
@@ -949,6 +950,7 @@ static int set_proto_ctx_param(struct drm_i915_file_private *fpriv,

case I915_CONTEXT_PARAM_NO_ZEROMAP:
case I915_CONTEXT_PARAM_BAN_PERIOD:
case I915_CONTEXT_PARAM_RINGSIZE:
+   case I915_CONTEXT_PARAM_CONTEXT_IMAGE:
default:
ret = -EINVAL;
break;
@@ -2092,6 +2094,88 @@ static int get_protected(struct i915_gem_context *ctx,
return 0;
  }
  
+static int set_context_image(struct i915_gem_context *ctx,

+struct drm_i915_gem_context_param *args)
+{
+   struct i915_gem_context_param_context_image user;
+ 

Re: [PATCH v2 2/2] drm/i915/gt: Enable only one CCS for compute workload

2024-02-21 Thread Tvrtko Ursulin



On 21/02/2024 00:14, Andi Shyti wrote:

Hi Tvrtko,

On Tue, Feb 20, 2024 at 02:48:31PM +, Tvrtko Ursulin wrote:

On 20/02/2024 14:35, Andi Shyti wrote:

Enable only one CCS engine by default with all the compute sices


slices


Thanks!


diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c 
b/drivers/gpu/drm/i915/gt/intel_engine_user.c
index 833987015b8b..7041acc77810 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_user.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c
@@ -243,6 +243,15 @@ void intel_engines_driver_register(struct drm_i915_private 
*i915)
if (engine->uabi_class == I915_NO_UABI_CLASS)
continue;
+   /*
+* Do not list and do not count CCS engines other than the first
+*/
+   if (engine->uabi_class == I915_ENGINE_CLASS_COMPUTE &&
+   engine->uabi_instance > 0) {
+   i915->engine_uabi_class_count[engine->uabi_class]--;
+   continue;
+   }


It's a bit ugly to decrement after increment, instead of somehow
restructuring the loop to satisfy both cases more elegantly.


yes, agree, indeed I had a hard time here to accept this change
myself.

But moving the check above where the counter was incremented it
would have been much uglier.

This check looks ugly everywhere you place it :-)


One idea would be to introduce a separate local counter array for 
name_instance, so not use i915->engine_uabi_class_count[]. First one 
increments for every engine, second only for the exposed ones. That way 
feels wouldn't be too ugly.



In any case, I'm working on a patch that is splitting this
function in two parts and there is some refactoring happening
here (for the first initialization and the dynamic update).

Please let me know if it's OK with you or you want me to fix it
in this run.


And I wonder if
internally (in dmesg when engine name is logged) we don't end up with ccs0
ccs0 ccs0 ccs0.. for all instances.


I don't see this. Even in sysfs we see only one ccs. Where is it?


When you run this patch on something with two or more ccs-es, the 
"renamed ccs... to ccs.." debug logs do not all log the new name as ccs0?


Regards,

Tvrtko




+
rb_link_node(>uabi_node, prev, p);
rb_insert_color(>uabi_node, >uabi_engines);


[...]


diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index 3baa2f54a86e..d5a5143971f5 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -124,6 +124,7 @@ static int query_geometry_subslices(struct drm_i915_private 
*i915,
return fill_topology_info(sseu, query_item, 
sseu->geometry_subslice_mask);
   }
+


Zap please.


yes... yes... I noticed it after sending the patch :-)

Thanks,
Andi


Re: [PATCH v2 2/2] drm/i915/gt: Enable only one CCS for compute workload

2024-02-20 Thread Tvrtko Ursulin



On 20/02/2024 14:35, Andi Shyti wrote:

Enable only one CCS engine by default with all the compute sices


slices


allocated to it.

While generating the list of UABI engines to be exposed to the
user, exclude any additional CCS engines beyond the first
instance.

This change can be tested with igt i915_query.

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Signed-off-by: Andi Shyti 
Cc: Chris Wilson 
Cc: Joonas Lahtinen 
Cc: Matt Roper 
Cc:  # v6.2+
---
  drivers/gpu/drm/i915/gt/intel_engine_user.c |  9 +
  drivers/gpu/drm/i915/gt/intel_gt.c  | 11 +++
  drivers/gpu/drm/i915/gt/intel_gt_regs.h |  2 ++
  drivers/gpu/drm/i915/i915_query.c   |  1 +
  4 files changed, 23 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c 
b/drivers/gpu/drm/i915/gt/intel_engine_user.c
index 833987015b8b..7041acc77810 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_user.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c
@@ -243,6 +243,15 @@ void intel_engines_driver_register(struct drm_i915_private 
*i915)
if (engine->uabi_class == I915_NO_UABI_CLASS)
continue;
  
+		/*

+* Do not list and do not count CCS engines other than the first
+*/
+   if (engine->uabi_class == I915_ENGINE_CLASS_COMPUTE &&
+   engine->uabi_instance > 0) {
+   i915->engine_uabi_class_count[engine->uabi_class]--;
+   continue;
+   }


It's a bit ugly to decrement after increment, instead of somehow 
restructuring the loop to satisfy both cases more elegantly. And I 
wonder if internally (in dmesg when engine name is logged) we don't end 
up with ccs0 ccs0 ccs0 ccs0.. for all instances.



+
rb_link_node(>uabi_node, prev, p);
rb_insert_color(>uabi_node, >uabi_engines);
  
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c

index a425db5ed3a2..e19df4ef47f6 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -168,6 +168,14 @@ static void init_unused_rings(struct intel_gt *gt)
}
  }
  
+static void intel_gt_apply_ccs_mode(struct intel_gt *gt)

+{
+   if (!IS_DG2(gt->i915))
+   return;
+
+   intel_uncore_write(gt->uncore, XEHP_CCS_MODE, 0);
+}
+
  int intel_gt_init_hw(struct intel_gt *gt)
  {
struct drm_i915_private *i915 = gt->i915;
@@ -195,6 +203,9 @@ int intel_gt_init_hw(struct intel_gt *gt)
  
  	intel_gt_init_swizzling(gt);
  
+	/* Configure CCS mode */

+   intel_gt_apply_ccs_mode(gt);
+
/*
 * At least 830 can leave some of the unused rings
 * "active" (ie. head != tail) after resume which
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index cf709f6c05ae..c148113770ea 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1605,6 +1605,8 @@
  #define   GEN12_VOLTAGE_MASK  REG_GENMASK(10, 0)
  #define   GEN12_CAGF_MASK REG_GENMASK(19, 11)
  
+#define XEHP_CCS_MODE  _MMIO(0x14804)

+
  #define GEN11_GT_INTR_DW(x)   _MMIO(0x190018 + ((x) * 4))
  #define   GEN11_CSME  (31)
  #define   GEN12_HECI_2(30)
diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index 3baa2f54a86e..d5a5143971f5 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -124,6 +124,7 @@ static int query_geometry_subslices(struct drm_i915_private 
*i915,
return fill_topology_info(sseu, query_item, 
sseu->geometry_subslice_mask);
  }
  
+


Zap please.


  static int
  query_engine_info(struct drm_i915_private *i915,
  struct drm_i915_query_item *query_item)


Regards,

Tvrtko


Re: [PATCH 2/2] drm/i915/gt: Set default CCS mode '1'

2024-02-20 Thread Tvrtko Ursulin



On 20/02/2024 14:20, Andi Shyti wrote:

Since CCS automatic load balancing is disabled, we will impose a
fixed balancing policy that involves setting all the CCS engines
to work together on the same load.


Erm *all* CSS engines work together..


Simultaneously, the user will see only 1 CCS rather than the
actual number. As of now, this change affects only DG2.


... *one* CCS engine.



Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Signed-off-by: Andi Shyti 
Cc: Chris Wilson 
Cc: Joonas Lahtinen 
Cc: Matt Roper 
Cc:  # v6.2+
---
  drivers/gpu/drm/i915/gt/intel_gt.c  | 11 +++
  drivers/gpu/drm/i915/gt/intel_gt_regs.h |  2 ++
  drivers/gpu/drm/i915/i915_drv.h | 17 +
  drivers/gpu/drm/i915/i915_query.c   |  5 +++--
  4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c
index a425db5ed3a2..e19df4ef47f6 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -168,6 +168,14 @@ static void init_unused_rings(struct intel_gt *gt)
}
  }
  
+static void intel_gt_apply_ccs_mode(struct intel_gt *gt)

+{
+   if (!IS_DG2(gt->i915))
+   return;
+
+   intel_uncore_write(gt->uncore, XEHP_CCS_MODE, 0);
+}
+
  int intel_gt_init_hw(struct intel_gt *gt)
  {
struct drm_i915_private *i915 = gt->i915;
@@ -195,6 +203,9 @@ int intel_gt_init_hw(struct intel_gt *gt)
  
  	intel_gt_init_swizzling(gt);
  
+	/* Configure CCS mode */

+   intel_gt_apply_ccs_mode(gt);
+
/*
 * At least 830 can leave some of the unused rings
 * "active" (ie. head != tail) after resume which
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index cf709f6c05ae..c148113770ea 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1605,6 +1605,8 @@
  #define   GEN12_VOLTAGE_MASK  REG_GENMASK(10, 0)
  #define   GEN12_CAGF_MASK REG_GENMASK(19, 11)
  
+#define XEHP_CCS_MODE  _MMIO(0x14804)

+
  #define GEN11_GT_INTR_DW(x)   _MMIO(0x190018 + ((x) * 4))
  #define   GEN11_CSME  (31)
  #define   GEN12_HECI_2(30)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e81b3b2858ac..0853ffd3cb8d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -396,6 +396,23 @@ static inline struct intel_gt *to_gt(const struct 
drm_i915_private *i915)
 (engine__); \
 (engine__) = rb_to_uabi_engine(rb_next(&(engine__)->uabi_node)))
  
+/*

+ * Exclude unavailable engines.
+ *
+ * Only the first CCS engine is utilized due to the disabling of CCS auto load
+ * balancing. As a result, all CCS engines operate collectively, functioning
+ * essentially as a single CCS engine, hence the count of active CCS engines is
+ * considered '1'.
+ * Currently, this applies to platforms with more than one CCS engine,
+ * specifically DG2.
+ */
+#define for_each_available_uabi_engine(engine__, i915__) \
+   for_each_uabi_engine(engine__, i915__) \
+   if ((IS_DG2(i915__)) && \
+   ((engine__)->uabi_class == I915_ENGINE_CLASS_COMPUTE) && \
+   ((engine__)->uabi_instance)) { } \
+   else
+


I thought the plan was to simply not register the engine. Like that it 
would be a simpler patch.



  #define INTEL_INFO(i915)  ((i915)->__info)
  #define RUNTIME_INFO(i915)(&(i915)->__runtime)
  #define DRIVER_CAPS(i915) (&(i915)->caps)
diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index fa3e937ed3f5..2d41bda626a6 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -124,6 +124,7 @@ static int query_geometry_subslices(struct drm_i915_private 
*i915,
return fill_topology_info(sseu, query_item, 
sseu->geometry_subslice_mask);
  }
  
+


!


  static int
  query_engine_info(struct drm_i915_private *i915,
  struct drm_i915_query_item *query_item)
@@ -140,7 +141,7 @@ query_engine_info(struct drm_i915_private *i915,
if (query_item->flags)
return -EINVAL;
  
-	for_each_uabi_engine(engine, i915)

+   for_each_available_uabi_engine(engine, i915)
num_uabi_engines++;
  
  	len = struct_size(query_ptr, engines, num_uabi_engines);

@@ -155,7 +156,7 @@ query_engine_info(struct drm_i915_private *i915,
  
  	info_ptr = _ptr->engines[0];
  
-	for_each_uabi_engine(engine, i915) {

+   for_each_available_uabi_engine(engine, i915) {
info.engine.engine_class = engine->uabi_class;
info.engine.engine_instance = engine->uabi_instance;
info.flags = I915_ENGINE_INFO_HAS_LOGICAL_INSTANCE;


I thought you agreed that this still 

Re: [PATCH] drm/i915: Fix possible null pointer dereference after drm_dbg_printer conversion

2024-02-20 Thread Tvrtko Ursulin



On 20/02/2024 10:36, Maxime Ripard wrote:

On Tue, Feb 20, 2024 at 09:16:43AM +, Tvrtko Ursulin wrote:


On 19/02/2024 20:02, Rodrigo Vivi wrote:

On Mon, Feb 19, 2024 at 01:14:23PM +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Request can be NULL if no guilty request was identified so simply use
engine->i915 instead.

Signed-off-by: Tvrtko Ursulin 
Fixes: d50892a9554c ("drm/i915: switch from drm_debug_printer() to device specific 
drm_dbg_printer()")
Reported-by: Dan Carpenter 
Cc: Jani Nikula 
Cc: Luca Coelho 
Cc: Maxime Ripard 
Cc: Jani Nikula 


Reviewed-by: Rodrigo Vivi 


Thanks Rodrigo!

Given how d50892a9554c landed via drm-misc-next, Maxime or Thomas - could
you take this via drm-misc-next-fixes or if there will be another
drm-misc-next pull request?


There will be a drm-misc-next PR on thursday


Could you pull this one into which branch is needed so it appears in 
that pull request?


Regards,

Tvrtko


Re: [PATCH 2/2] drm/i915/gt: Set default CCS mode '1'

2024-02-20 Thread Tvrtko Ursulin



On 20/02/2024 10:11, Andi Shyti wrote:

Hi Tvrtko,

On Mon, Feb 19, 2024 at 12:51:44PM +, Tvrtko Ursulin wrote:

On 19/02/2024 11:16, Tvrtko Ursulin wrote:

On 15/02/2024 13:59, Andi Shyti wrote:


...


+/*
+ * Exclude unavailable engines.
+ *
+ * Only the first CCS engine is utilized due to the disabling of
CCS auto load
+ * balancing. As a result, all CCS engines operate collectively,
functioning
+ * essentially as a single CCS engine, hence the count of active
CCS engines is
+ * considered '1'.
+ * Currently, this applies to platforms with more than one CCS engine,
+ * specifically DG2.
+ */
+#define for_each_available_uabi_engine(engine__, i915__) \
+    for_each_uabi_engine(engine__, i915__) \
+    if ((IS_DG2(i915__)) && \
+    ((engine__)->uabi_class == I915_ENGINE_CLASS_COMPUTE) && \
+    ((engine__)->uabi_instance)) { } \
+    else
+


If you don't want userspace to see some engines, just don't add them to
the uabi list in intel_engines_driver_register or thereabouts?


It will be dynamic. In next series I am preparing the user will
be able to increase the number of CCS engines he wants to use.


Oh tricky and new. Does it need to be at runtime or could be boot time?

If you are aiming to make the static single CCS only into the 6.9 
release, and you feel running out of time, you could always do a simple 
solution for now. The one I mentioned of simply not registering on the 
uabi list. Then you can refine more leisurely for the next release.


Regards,

Tvrtko




Similar as we do for gsc which uses I915_NO_UABI_CLASS, although for ccs
you can choose a different approach, whatever is more elegant.

That is also needed for i915->engine_uabi_class_count to be right, so
userspace stats which rely on it are correct.


Oh yes. Will update it.


I later realized it is more than that - everything that uses
intel_engine_lookup_user to look up class instance passed in from userspace
relies on the engine not being on the user list otherwise userspace could
bypass the fact engine query does not list it. Like PMU, Perf/POA, context
engine map and SSEU context query.


Correct, will look into that, thank you!

Andi


Re: [PATCH] drm/i915: Fix possible null pointer dereference after drm_dbg_printer conversion

2024-02-20 Thread Tvrtko Ursulin



On 19/02/2024 20:02, Rodrigo Vivi wrote:

On Mon, Feb 19, 2024 at 01:14:23PM +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Request can be NULL if no guilty request was identified so simply use
engine->i915 instead.

Signed-off-by: Tvrtko Ursulin 
Fixes: d50892a9554c ("drm/i915: switch from drm_debug_printer() to device specific 
drm_dbg_printer()")
Reported-by: Dan Carpenter 
Cc: Jani Nikula 
Cc: Luca Coelho 
Cc: Maxime Ripard 
Cc: Jani Nikula 


Reviewed-by: Rodrigo Vivi 


Thanks Rodrigo!

Given how d50892a9554c landed via drm-misc-next, Maxime or Thomas - 
could you take this via drm-misc-next-fixes or if there will be another 
drm-misc-next pull request?


Regards,

Tvrtko




---
  drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c 
b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 5f8d86e25993..8d4bb95f8424 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -96,8 +96,8 @@ static void heartbeat_commit(struct i915_request *rq,
  static void show_heartbeat(const struct i915_request *rq,
   struct intel_engine_cs *engine)
  {
-   struct drm_printer p = drm_dbg_printer(>i915->drm, DRM_UT_DRIVER,
-  "heartbeat");
+   struct drm_printer p =
+   drm_dbg_printer(>i915->drm, DRM_UT_DRIVER, "heartbeat");
  
  	if (!rq) {

intel_engine_dump(engine, ,
--
2.40.1



[PATCH] drm/i915: Add some boring kerneldoc

2024-02-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Tooling appears very strict so lets pacify it by adding some comments,
even if fields are completely self-explanatory.

Signed-off-by: Tvrtko Ursulin 
Fixes: b11236486749 ("drm/i915: Add GuC submission interface version query")
Reported-by: Stephen Rothwell 
Cc: Jose Souza 
---
 include/uapi/drm/i915_drm.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index bd87386a8243..2ee338860b7e 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -3572,9 +3572,13 @@ struct drm_i915_query_memory_regions {
  * struct drm_i915_query_guc_submission_version - query GuC submission 
interface version
  */
 struct drm_i915_query_guc_submission_version {
+   /** @branch: Firmware branch version. */
__u32 branch;
+   /** @major: Firmware major version. */
__u32 major;
+   /** @minor: Firmware minor version. */
__u32 minor;
+   /** @patch: Firmware patch version. */
__u32 patch;
 };
 
-- 
2.40.1



[PATCH] drm/i915: Fix possible null pointer dereference after drm_dbg_printer conversion

2024-02-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Request can be NULL if no guilty request was identified so simply use
engine->i915 instead.

Signed-off-by: Tvrtko Ursulin 
Fixes: d50892a9554c ("drm/i915: switch from drm_debug_printer() to device 
specific drm_dbg_printer()")
Reported-by: Dan Carpenter 
Cc: Jani Nikula 
Cc: Luca Coelho 
Cc: Maxime Ripard 
Cc: Jani Nikula 
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c 
b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 5f8d86e25993..8d4bb95f8424 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -96,8 +96,8 @@ static void heartbeat_commit(struct i915_request *rq,
 static void show_heartbeat(const struct i915_request *rq,
   struct intel_engine_cs *engine)
 {
-   struct drm_printer p = drm_dbg_printer(>i915->drm, DRM_UT_DRIVER,
-  "heartbeat");
+   struct drm_printer p =
+   drm_dbg_printer(>i915->drm, DRM_UT_DRIVER, "heartbeat");
 
if (!rq) {
intel_engine_dump(engine, ,
-- 
2.40.1



  1   2   3   4   5   6   7   8   9   10   >