Re: [PATCH v2] drm/i915: Increase FLR timeout from 3s to 9s

2024-05-29 Thread Nirmoy Das



On 5/24/2024 1:58 AM, Andi Shyti wrote:

Following the guidelines it takes 3 seconds to perform an FLR
reset. Let's give it a bit more slack because this time can
change depending on the platform and on the firmware

Signed-off-by: Andi Shyti 

Reviewed-by: Nirmoy Das 

---
Hi,

In this second version I removed patch 2 that was ignoring the
FLR reset timeouts, until we develop a proper patch.

This first patch is basically the same as v1. Thanks Nirmoy for
your review.

Andi

  drivers/gpu/drm/i915/intel_uncore.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c 
b/drivers/gpu/drm/i915/intel_uncore.c
index 729409a4bada..2eba289d88ad 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -2614,11 +2614,18 @@ void intel_uncore_prune_engine_fw_domains(struct 
intel_uncore *uncore,
  static void driver_initiated_flr(struct intel_uncore *uncore)
  {
struct drm_i915_private *i915 = uncore->i915;
-   const unsigned int flr_timeout_ms = 3000; /* specs recommend a 3s wait 
*/
+   unsigned int flr_timeout_ms;
int ret;
  
  	drm_dbg(>drm, "Triggering Driver-FLR\n");
  
+	/*

+* The specification recommends a 3 seconds FLR reset timeout. To be
+* cautious, we will extend this to 9 seconds, three times the specified
+* timeout.
+*/
+   flr_timeout_ms = 9000;
+
/*
 * Make sure any pending FLR requests have cleared by waiting for the
 * FLR trigger bit to go to zero. Also clear GU_DEBUG's DRIVERFLR_STATUS


Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors

2024-05-22 Thread Nirmoy Das

Hi Andi,

On 5/21/2024 12:56 PM, Andi Shyti wrote:

Hi Nirmoy,

On Fri, May 17, 2024 at 10:13:37PM +0200, Nirmoy Das wrote:

Hi Andi,

On 5/17/2024 9:34 PM, Andi Shyti wrote:

 Hi Nirmoy,

 On Fri, May 17, 2024 at 04:00:02PM +0200, Nirmoy Das wrote:

 On 5/17/2024 1:25 PM, Andi Shyti wrote:

 If we timeout while waiting for an FLR reset, there is nothing we
 can do and i915 doesn't have any control on it. In any case the
 system is still perfectly usable

 If a FLR reset fails then we will have a dead GPU, I don't think the 
GPU is
 usable without a cold reboot.

 fact is that the GPU keeps going and even though the timeout has
 expired, the system moves to the next phase.

The current test might look like it is has passed, but if you look into the
subsequent tests you can see a dead GPU:

<7>[  369.168121] pci :00:02.0: [drm:intel_uncore_fini_mmio [i915]] 
Triggering Driver-FLR
<3>[  372.170189] pci :00:02.0: [drm] *ERROR* Driver-FLR-teardown wait 
completion failed! -110
<7>[  372.437630] [IGT] i915_selftest: finished subtest requests, SUCCESS
<7>[  372.438356] [IGT] i915_selftest: starting dynamic subtest migrate
<5>[  373.110580] Setting dangerous option live_selftests - tainting kernel
<3>[  373.183499] i915 :00:02.0: Unable to change power state from D0 to 
D0, device inaccessible
<3>[  373.246921] i915 :00:02.0: [drm] *ERROR* Unrecognized display IP 
version 1023.255; disabling display.
<7>[  373.247130] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
steppings
<7>[  373.247716] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
steppings
<7>[  373.248263] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
display steppings
<7>[  373.251843] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
WOPCM: 2048K
<7>[  373.252505] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT0: 
enable_guc=3 (guc:yes submission:yes huc:no slpc:yes)
<7>[  373.253140] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT0: 
Setting up Primary GT
<7>[  373.253556] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT1: 
Setting up Standalone Media GT
<7>[  373.253941] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
WOPCM: 2048K
<7>[  373.254365] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT1: 
enable_guc=3 (guc:yes submission:yes huc:yes slpc:yes)
<3>[  375.256235] i915 :00:02.0: [drm] *ERROR* Device is non-operational; 
MMIO access returns 0x!
<3>[  375.259089] i915 :00:02.0: Device initialization failed (-5)
<3>[  375.260521] i915 :00:02.0: probe with driver i915 failed with error -5
<7>[  375.392209] [IGT] i915_selftest: finished subtest migrate, FAIL

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14724/bat-arls-3/dmesg0.txt

Are we sure this is dependent on the FLR reset?


Yes, while on FLR read into memory will return either 0/F.



  There are cases
when the FLR reset doesn't make any difference and in any case
this error is completely ignored by the driver.


This happens at very late with no recovery possible and hope is module  
reload works.





Perhaps we can change it to a warning?


I think it should be error. CI will still complain even on warning.





 This is a serious issue and should be report as an error.  I think we 
need
 to create a HW ticket to understand

 why is FLR reset fails.

 Maybe it takes longer and longer to reset. We've been sending
 several patches in the latest years to fix the timings.

HW spec says 3 sec but we can try increasing it bit higher to try it out.

We could go, then, with just patch 1 and see if it improves.


Does it help ? If helps then we can go ahead with increased timeout.



  Also
because, the FLR reset might also depend on the firmware.


Possible. In that case we should wait for firmware fix ?


Regards,

Nirmoy



Thanks, Nirmoy,
Andi


Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors

2024-05-17 Thread Nirmoy Das

Hi Andi,

On 5/17/2024 9:34 PM, Andi Shyti wrote:

Hi Nirmoy,

On Fri, May 17, 2024 at 04:00:02PM +0200, Nirmoy Das wrote:

On 5/17/2024 1:25 PM, Andi Shyti wrote:

If we timeout while waiting for an FLR reset, there is nothing we
can do and i915 doesn't have any control on it. In any case the
system is still perfectly usable

If a FLR reset fails then we will have a dead GPU, I don't think the GPU is
usable without a cold reboot.

fact is that the GPU keeps going and even though the timeout has
expired, the system moves to the next phase.
The current test might look like it is has passed, but if you look into 
the subsequent tests you can see a dead GPU:


<7>[  369.168121] pci :00:02.0: [drm:intel_uncore_fini_mmio [i915]] 
Triggering Driver-FLR
*<3>[ 372.170189] pci :00:02.0: [drm] *ERROR* Driver-FLR-teardown 
wait completion failed! -110*

*<7>[ 372.437630] [IGT] i915_selftest: finished subtest requests, SUCCESS*
<7>[  372.438356] [IGT] i915_selftest: starting dynamic subtest migrate
<5>[  373.110580] Setting dangerous option live_selftests - tainting kernel
<3>[  373.183499] i915 :00:02.0: Unable to change power state from D0 to 
D0, device inaccessible
<3>[  373.246921] i915 :00:02.0: [drm] *ERROR* Unrecognized display IP 
version 1023.255; disabling display.
<7>[  373.247130] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
steppings
<7>[  373.247716] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
steppings
<7>[  373.248263] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
display steppings
<7>[  373.251843] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
WOPCM: 2048K
<7>[  373.252505] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT0: 
enable_guc=3 (guc:yes submission:yes huc:no slpc:yes)
<7>[  373.253140] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT0: 
Setting up Primary GT
<7>[  373.253556] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT1: 
Setting up Standalone Media GT
<7>[  373.253941] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
WOPCM: 2048K
<7>[  373.254365] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT1: 
enable_guc=3 (guc:yes submission:yes huc:yes slpc:yes)
*<3>[ 375.256235] i915 :00:02.0: [drm] *ERROR* Device is 
non-operational; MMIO access returns 0x!*

<3>[  375.259089] i915 :00:02.0: Device initialization failed (-5)
<3>[  375.260521] i915 :00:02.0: probe with driver i915 failed with error -5
<7>[  375.392209] [IGT] i915_selftest: finished subtest migrate, FAIL

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14724/bat-arls-3/dmesg0.txt




This is a serious issue and should be report as an error.  I think we need
to create a HW ticket to understand

why is FLR reset fails.

Maybe it takes longer and longer to reset. We've been sending
several patches in the latest years to fix the timings.


HW spec says 3 sec but we can try increasing it bit higher to try it out.


Regards,

Nirmoy



Andi

Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors

2024-05-17 Thread Nirmoy Das

Hi Andi,

On 5/17/2024 1:25 PM, Andi Shyti wrote:

If we timeout while waiting for an FLR reset, there is nothing we
can do and i915 doesn't have any control on it. In any case the
system is still perfectly usable


If a FLR reset fails then we will have a dead GPU, I don't think the GPU 
is usable without a cold reboot.


This is a serious issue and should be report as an error.  I think we 
need to create a HW ticket to understand


why is FLR reset fails.


Regards,

Nirmoy




  and the function returns void.

We don't need to be alarmed, therefore, print the timeout
expiration as a debug message instead of an error.

Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/10955
Signed-off-by: Andi Shyti 
---
  drivers/gpu/drm/i915/intel_uncore.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c 
b/drivers/gpu/drm/i915/intel_uncore.c
index 2eba289d88ad..a3fa2ed91aae 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -2637,7 +2637,7 @@ static void driver_initiated_flr(struct intel_uncore 
*uncore)
 */
ret = intel_wait_for_register_fw(uncore, GU_CNTL, DRIVERFLR, 0, 
flr_timeout_ms);
if (ret) {
-   drm_err(>drm,
+   drm_dbg(>drm,
"Failed to wait for Driver-FLR bit to clear! %d\n",
ret);
return;
@@ -2652,7 +2652,7 @@ static void driver_initiated_flr(struct intel_uncore 
*uncore)
 DRIVERFLR, 0,
 flr_timeout_ms);
if (ret) {
-   drm_err(>drm, "Driver-FLR-teardown wait completion failed! 
%d\n", ret);
+   drm_dbg(>drm, "Driver-FLR-teardown wait completion failed! 
%d\n", ret);
return;
}
  
@@ -2661,7 +2661,7 @@ static void driver_initiated_flr(struct intel_uncore *uncore)

 DRIVERFLR_STATUS, DRIVERFLR_STATUS,
 flr_timeout_ms);
if (ret) {
-   drm_err(>drm, "Driver-FLR-reinit wait completion failed! 
%d\n", ret);
+   drm_dbg(>drm, "Driver-FLR-reinit wait completion failed! 
%d\n", ret);
return;
}
  


Re: [PATCH] drm/i915/selftests: Set always_coherent to false when reading from CPU

2024-05-17 Thread Nirmoy Das



On 5/17/2024 1:53 PM, Jani Nikula wrote:

On Fri, 17 May 2024, Nirmoy Das  wrote:

Hi Jani,

On 5/17/2024 9:39 AM, Jani Nikula wrote:

On Thu, 16 May 2024, Nirmoy Das  wrote:

The previous commit 'commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick

"previous commit" is a fairly vague reference once this gets
committed. It's not going to be "previous" in any meaningful sense.

Please just start with:

Commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick correct caching mode.")
was not complete...

Will do that.



And probably add:

Fixes: 8d4ba9fc1c6c ("drm/i915/selftests: Pick correct caching mode.")

Do we need Fixes for selftest ? I always assumed it is not required as
this code is for debug/CI

Maybe not for stuff that's already in stable, but we do run CI on
drm-next and -rc kernels, and if this causes issues there, why not have
them fixed?


Not sure a commit with Fixes flows from drm-intel-next to drm-next/-rc 
but I see no issue adding Fixes without CC-ing to stable.


Pushed it to drm-intel-next with above modifications.  b4-shazam picked 
Fixes as well which was nice.



Thanks,

Nirmoy



BR,
Jani.



Thanks,

Nirmoy


BR,
Jani.


correct caching mode.")' was not complete as for non LLC  sharing platforms
cpu read can happen from LLC which probably doesn't have the latest
changes made by GPU.

Cc: Andi Shyti 
Cc: Janusz Krzysztofik 
Cc: Jonathan Cavitt 
Signed-off-by: Nirmoy Das 
---
   drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
index 65a931ea80e9..3527b8f446fe 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
@@ -196,7 +196,7 @@ static int verify_access(struct drm_i915_private *i915,
if (err)
goto out_file;
   
-	mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, true);

+   mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, false);
vaddr = i915_gem_object_pin_map_unlocked(native_obj, mode);
if (IS_ERR(vaddr)) {
err = PTR_ERR(vaddr);


Re: [PATCH] drm/i915/selftests: Set always_coherent to false when reading from CPU

2024-05-17 Thread Nirmoy Das

Hi Jani,

On 5/17/2024 9:39 AM, Jani Nikula wrote:

On Thu, 16 May 2024, Nirmoy Das  wrote:

The previous commit 'commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick

"previous commit" is a fairly vague reference once this gets
committed. It's not going to be "previous" in any meaningful sense.

Please just start with:

Commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick correct caching mode.")
was not complete...


Will do that.




And probably add:

Fixes: 8d4ba9fc1c6c ("drm/i915/selftests: Pick correct caching mode.")


Do we need Fixes for selftest ? I always assumed it is not required as 
this code is for debug/CI



Thanks,

Nirmoy



BR,
Jani.


correct caching mode.")' was not complete as for non LLC  sharing platforms
cpu read can happen from LLC which probably doesn't have the latest
changes made by GPU.

Cc: Andi Shyti 
Cc: Janusz Krzysztofik 
Cc: Jonathan Cavitt 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
index 65a931ea80e9..3527b8f446fe 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
@@ -196,7 +196,7 @@ static int verify_access(struct drm_i915_private *i915,
if (err)
goto out_file;
  
-	mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, true);

+   mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, false);
vaddr = i915_gem_object_pin_map_unlocked(native_obj, mode);
if (IS_ERR(vaddr)) {
err = PTR_ERR(vaddr);


[PATCH] drm/i915/selftests: Set always_coherent to false when reading from CPU

2024-05-16 Thread Nirmoy Das
The previous commit 'commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick
correct caching mode.")' was not complete as for non LLC  sharing platforms
cpu read can happen from LLC which probably doesn't have the latest
changes made by GPU.

Cc: Andi Shyti 
Cc: Janusz Krzysztofik 
Cc: Jonathan Cavitt 
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
index 65a931ea80e9..3527b8f446fe 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
@@ -196,7 +196,7 @@ static int verify_access(struct drm_i915_private *i915,
if (err)
goto out_file;
 
-   mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, true);
+   mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, false);
vaddr = i915_gem_object_pin_map_unlocked(native_obj, mode);
if (IS_ERR(vaddr)) {
err = PTR_ERR(vaddr);
-- 
2.42.0



[PATCH] drm/i915: Use for_each_child instead of manual for-loop

2024-05-14 Thread Nirmoy Das
Simplify child iteration using for_each_child macro
instead of using manual for loop. There is no functional
change.

Cc: John Harrison 
Cc: Tvrtko Ursulin 
Signed-off-by: Nirmoy Das 
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 64 ++-
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0eaa1064242c..7e88d90e935b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1800,14 +1800,37 @@ __unwind_incomplete_requests(struct intel_context *ce)
spin_unlock_irqrestore(_engine->lock, flags);
 }
 
-static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t 
stalled)
+static void guc_reset_context_state(struct intel_context *ce, 
intel_engine_mask_t stalled)
 {
-   bool guilty;
struct i915_request *rq;
-   unsigned long flags;
+   bool guilty = false;
u32 head;
-   int i, number_children = ce->parallel.number_children;
-   struct intel_context *parent = ce;
+
+   if (!intel_context_is_pinned(ce))
+   return;
+
+   rq = intel_context_get_active_request(ce);
+   if (!rq) {
+   head = ce->ring->tail;
+   goto out_replay;
+   }
+
+   if (i915_request_started(rq))
+   guilty = stalled & ce->engine->mask;
+
+   GEM_BUG_ON(i915_active_is_idle(>active));
+   head = intel_ring_wrap(ce->ring, rq->head);
+
+   __i915_request_reset(rq, guilty);
+   i915_request_put(rq);
+out_replay:
+   guc_reset_state(ce, head, guilty);
+}
+
+static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t 
stalled)
+{
+   struct intel_context *child;
+   unsigned long flags;
 
GEM_BUG_ON(intel_context_is_child(ce));
 
@@ -1826,34 +1849,13 @@ static void __guc_reset_context(struct intel_context 
*ce, intel_engine_mask_t st
 * For each context in the relationship find the hanging request
 * resetting each context / request as needed
 */
-   for (i = 0; i < number_children + 1; ++i) {
-   if (!intel_context_is_pinned(ce))
-   goto next_context;
-
-   guilty = false;
-   rq = intel_context_get_active_request(ce);
-   if (!rq) {
-   head = ce->ring->tail;
-   goto out_replay;
-   }
-
-   if (i915_request_started(rq))
-   guilty = stalled & ce->engine->mask;
-
-   GEM_BUG_ON(i915_active_is_idle(>active));
-   head = intel_ring_wrap(ce->ring, rq->head);
-
-   __i915_request_reset(rq, guilty);
-   i915_request_put(rq);
-out_replay:
-   guc_reset_state(ce, head, guilty);
-next_context:
-   if (i != number_children)
-   ce = list_next_entry(ce, parallel.child_link);
+   guc_reset_context_state(ce, stalled);
+   for_each_child(ce, child) {
+   guc_reset_context_state(child, stalled);
}
 
-   __unwind_incomplete_requests(parent);
-   intel_context_put(parent);
+   __unwind_incomplete_requests(ce);
+   intel_context_put(ce);
 }
 
 void wake_up_all_tlb_invalidate(struct intel_guc *guc)
-- 
2.42.0



Re: [PATCH] drm/i915: Correct error handler

2024-05-13 Thread Nirmoy Das



On 5/11/2024 5:48 PM, Jiasheng Jiang wrote:

Replace "slab_priorities" with "slab_dependencies" in the error handler to 
avoid memory leak.


Nice catch. I would make the subject more like:

drm/i915: Fix memory leak by correcting cache object name in error handler



Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")


Also need Cc:  # v5.2+

With those:

Reviewed-by: Nirmoy Das 


Nirmoy


Signed-off-by: Jiasheng Jiang 
---
  drivers/gpu/drm/i915/i915_scheduler.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_scheduler.c 
b/drivers/gpu/drm/i915/i915_scheduler.c
index 762127dd56c5..70a854557e6e 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -506,6 +506,6 @@ int __init i915_scheduler_module_init(void)
return 0;
  
  err_priorities:

-   kmem_cache_destroy(slab_priorities);
+   kmem_cache_destroy(slab_dependencies);
return -ENOMEM;
  }


Re: [PATCH] Revert "drm/i915: Remove extra multi-gt pm-references"

2024-05-07 Thread Nirmoy Das



On 5/7/2024 7:10 PM, Rodrigo Vivi wrote:

On Tue, May 07, 2024 at 10:54:11AM +0200, Janusz Krzysztofik wrote:

On Tuesday, 7 May 2024 09:30:15 GMT+2 Nirmoy Das wrote:

Hi Janusz,


Just realized we need Fixes tag for this.

Fixes: 1f33dc0c1189 ("drm/i915: Remove extra multi-gt pm-references")

Whoever is going to push this patch, please feel free to add this tag.

dim b4-shazam gets that automagically, now it was sent in reply ;)

Nice!


I just pushed the patch. thanks for the patch and reviews.



Thanks,

Nirmoy




Thanks,
Janusz



Regards,

Nirmoy

On 5/6/2024 8:02 PM, Janusz Krzysztofik wrote:

This reverts commit 1f33dc0c1189efb9ae19c6fc22b64dd3e26261fb.

There was a patch supposed to fix an issue of illegal attempts to free a
still active i915 VMA object when parking a GT believed to be idle,
reported by CI on 2-GT Meteor Lake.  As a solution, an extra wakeref for
a Primary GT was acquired from i915_gem_do_execbuffer() -- see commit
f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform").

However, that fix occurred insufficient -- the issue was still reported by
CI.  That wakeref was released on exit from i915_gem_do_execbuffer(), then
potentially before completion of the request and deactivation of its
associated VMAs.  Moreover, CI reports indicated that single-GT platforms
also suffered sporadically from the same race.

Since that issue was fixed by another commit f3c71b2ded5c ("drm/i915/vma:
Fix UAF on destroy against retire race"), the changes introduced by that
insufficient fix were dropped as no longer useful.  However, that series
resulted in another VMA UAF scenario now being triggered in CI.

<4> [260.290809] [ cut here ]
<4> [260.290988] list_del corruption. prev->next should be 888118c5d990, 
but was 888118c5a510. (prev=888118c5a510)
<4> [260.291004] WARNING: CPU: 2 PID: 1143 at lib/list_debug.c:62 
__list_del_entry_valid_or_report+0xb7/0xe0
..
<4> [260.291055] CPU: 2 PID: 1143 Comm: kms_plane Not tainted 
6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1
<4> [260.291058] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024
<4> [260.291060] RIP: 0010:__list_del_entry_valid_or_report+0xb7/0xe0
...
<4> [260.291087] Call Trace:
<4> [260.291089]  
<4> [260.291124]  i915_vma_reopen+0x43/0x80 [i915]
<4> [260.291298]  eb_lookup_vmas+0x9cb/0xcc0 [i915]
<4> [260.291579]  i915_gem_do_execbuffer+0xc9a/0x26d0 [i915]
<4> [260.291883]  i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915]
...
<4> [260.292301]  
...
<4> [260.292506] ---[ end trace  ]---
<4> [260.292782] general protection fault, probably for non-canonical address 
0x6b6b6b6b6b6b6ca3:  [#1] PREEMPT SMP NOPTI
<4> [260.303575] CPU: 2 PID: 1143 Comm: kms_plane Tainted: GW  
6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1
<4> [260.313851] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024
<4> [260.326359] RIP: 0010:eb_validate_vmas+0x114/0xd80 [i915]
...
<4> [260.428756] Call Trace:
<4> [260.431192]  
<4> [639.283393]  i915_gem_do_execbuffer+0xd05/0x26d0 [i915]
<4> [639.305245]  i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915]
...
<4> [639.411134]  
...
<4> [639.449979] ---[ end trace  ]---

We defer actually closing, unbinding and destroying a VMA until next idle
point, or until the object is freed in the meantime.  By postponing the
unbind, we allow for the VMA to be reopened by the client, avoiding the
work required to rebind the VMA.

Starting from commit b0647a5e79b1 ("drm/i915: Avoid live-lock with
i915_vma_parked()"), we assume that as long as a GT is held idle, no VMA
would be reopened while we destroy them.  That assumption is no longer
true in multi-GT configurations, where a VMA we reopen may be handled by a
GT different from the one that we already keep active via its engine while
we set up an execbuf request.

Restoring the extra GT0 PM wakeref removed from i915_gem_do_execbuffer()
processing path seems to fix this issue.

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10608
Signed-off-by: Janusz Krzysztofik 
Cc: Rodrigo Vivi 
Cc: Nirmoy Das 
---
   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 18 ++
   1 file changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 42619fc05de48..090724fa766c9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -255,6 +255,7 @@ struct i915_execbuffer {
struct intel_context *context; /* logical state for the request */
struct i915_gem_context *gem_context; /** caller's context */
   

Re: [PATCH] Revert "drm/i915: Remove extra multi-gt pm-references"

2024-05-07 Thread Nirmoy Das

Hi Janusz,


Just realized we need Fixes tag for this.

Fixes: 1f33dc0c1189 ("drm/i915: Remove extra multi-gt pm-references")


Regards,

Nirmoy

On 5/6/2024 8:02 PM, Janusz Krzysztofik wrote:

This reverts commit 1f33dc0c1189efb9ae19c6fc22b64dd3e26261fb.

There was a patch supposed to fix an issue of illegal attempts to free a
still active i915 VMA object when parking a GT believed to be idle,
reported by CI on 2-GT Meteor Lake.  As a solution, an extra wakeref for
a Primary GT was acquired from i915_gem_do_execbuffer() -- see commit
f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform").

However, that fix occurred insufficient -- the issue was still reported by
CI.  That wakeref was released on exit from i915_gem_do_execbuffer(), then
potentially before completion of the request and deactivation of its
associated VMAs.  Moreover, CI reports indicated that single-GT platforms
also suffered sporadically from the same race.

Since that issue was fixed by another commit f3c71b2ded5c ("drm/i915/vma:
Fix UAF on destroy against retire race"), the changes introduced by that
insufficient fix were dropped as no longer useful.  However, that series
resulted in another VMA UAF scenario now being triggered in CI.

<4> [260.290809] [ cut here ]
<4> [260.290988] list_del corruption. prev->next should be 888118c5d990, 
but was 888118c5a510. (prev=888118c5a510)
<4> [260.291004] WARNING: CPU: 2 PID: 1143 at lib/list_debug.c:62 
__list_del_entry_valid_or_report+0xb7/0xe0
..
<4> [260.291055] CPU: 2 PID: 1143 Comm: kms_plane Not tainted 
6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1
<4> [260.291058] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024
<4> [260.291060] RIP: 0010:__list_del_entry_valid_or_report+0xb7/0xe0
...
<4> [260.291087] Call Trace:
<4> [260.291089]  
<4> [260.291124]  i915_vma_reopen+0x43/0x80 [i915]
<4> [260.291298]  eb_lookup_vmas+0x9cb/0xcc0 [i915]
<4> [260.291579]  i915_gem_do_execbuffer+0xc9a/0x26d0 [i915]
<4> [260.291883]  i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915]
...
<4> [260.292301]  
...
<4> [260.292506] ---[ end trace  ]---
<4> [260.292782] general protection fault, probably for non-canonical address 
0x6b6b6b6b6b6b6ca3:  [#1] PREEMPT SMP NOPTI
<4> [260.303575] CPU: 2 PID: 1143 Comm: kms_plane Tainted: GW  
6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1
<4> [260.313851] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024
<4> [260.326359] RIP: 0010:eb_validate_vmas+0x114/0xd80 [i915]
...
<4> [260.428756] Call Trace:
<4> [260.431192]  
<4> [639.283393]  i915_gem_do_execbuffer+0xd05/0x26d0 [i915]
<4> [639.305245]  i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915]
...
<4> [639.411134]  
...
<4> [639.449979] ---[ end trace  ]---

We defer actually closing, unbinding and destroying a VMA until next idle
point, or until the object is freed in the meantime.  By postponing the
unbind, we allow for the VMA to be reopened by the client, avoiding the
work required to rebind the VMA.

Starting from commit b0647a5e79b1 ("drm/i915: Avoid live-lock with
i915_vma_parked()"), we assume that as long as a GT is held idle, no VMA
would be reopened while we destroy them.  That assumption is no longer
true in multi-GT configurations, where a VMA we reopen may be handled by a
GT different from the one that we already keep active via its engine while
we set up an execbuf request.

Restoring the extra GT0 PM wakeref removed from i915_gem_do_execbuffer()
processing path seems to fix this issue.

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10608
Signed-off-by: Janusz Krzysztofik 
Cc: Rodrigo Vivi 
Cc: Nirmoy Das 
---
  drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 18 ++
  1 file changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 42619fc05de48..090724fa766c9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -255,6 +255,7 @@ struct i915_execbuffer {
struct intel_context *context; /* logical state for the request */
struct i915_gem_context *gem_context; /** caller's context */
intel_wakeref_t wakeref;
+   intel_wakeref_t wakeref_gt0;
  
  	/** our requests to build */

struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
@@ -2685,6 +2686,7 @@ static int
  eb_select_engine(struct i915_execbuffer *eb)
  {
struct intel_context *ce, *child;
+   struct intel_gt *gt;
unsigned int idx;
int err;
  
@@ -2708,10 +2710,17 @@ eb_sel

Re: [PATCH] Revert "drm/i915: Remove extra multi-gt pm-references"

2024-05-06 Thread Nirmoy Das


On 5/6/2024 8:02 PM, Janusz Krzysztofik wrote:

This reverts commit 1f33dc0c1189efb9ae19c6fc22b64dd3e26261fb.

There was a patch supposed to fix an issue of illegal attempts to free a
still active i915 VMA object when parking a GT believed to be idle,
reported by CI on 2-GT Meteor Lake.  As a solution, an extra wakeref for
a Primary GT was acquired from i915_gem_do_execbuffer() -- see commit
f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform").

However, that fix occurred insufficient -- the issue was still reported by
CI.  That wakeref was released on exit from i915_gem_do_execbuffer(), then
potentially before completion of the request and deactivation of its
associated VMAs.  Moreover, CI reports indicated that single-GT platforms
also suffered sporadically from the same race.

Since that issue was fixed by another commit f3c71b2ded5c ("drm/i915/vma:
Fix UAF on destroy against retire race"), the changes introduced by that
insufficient fix were dropped as no longer useful.  However, that series
resulted in another VMA UAF scenario now being triggered in CI.

<4> [260.290809] [ cut here ]
<4> [260.290988] list_del corruption. prev->next should be 888118c5d990, 
but was 888118c5a510. (prev=888118c5a510)
<4> [260.291004] WARNING: CPU: 2 PID: 1143 at lib/list_debug.c:62 
__list_del_entry_valid_or_report+0xb7/0xe0
..
<4> [260.291055] CPU: 2 PID: 1143 Comm: kms_plane Not tainted 
6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1
<4> [260.291058] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024
<4> [260.291060] RIP: 0010:__list_del_entry_valid_or_report+0xb7/0xe0
...
<4> [260.291087] Call Trace:
<4> [260.291089]  
<4> [260.291124]  i915_vma_reopen+0x43/0x80 [i915]
<4> [260.291298]  eb_lookup_vmas+0x9cb/0xcc0 [i915]
<4> [260.291579]  i915_gem_do_execbuffer+0xc9a/0x26d0 [i915]
<4> [260.291883]  i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915]
...
<4> [260.292301]  
...
<4> [260.292506] ---[ end trace  ]---
<4> [260.292782] general protection fault, probably for non-canonical address 
0x6b6b6b6b6b6b6ca3:  [#1] PREEMPT SMP NOPTI
<4> [260.303575] CPU: 2 PID: 1143 Comm: kms_plane Tainted: GW  
6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1
<4> [260.313851] Hardware name: Intel Corporation Meteor Lake Client 
Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024
<4> [260.326359] RIP: 0010:eb_validate_vmas+0x114/0xd80 [i915]
...
<4> [260.428756] Call Trace:
<4> [260.431192]  
<4> [639.283393]  i915_gem_do_execbuffer+0xd05/0x26d0 [i915]
<4> [639.305245]  i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915]
...
<4> [639.411134]  
...
<4> [639.449979] ---[ end trace  ]---

We defer actually closing, unbinding and destroying a VMA until next idle
point, or until the object is freed in the meantime.  By postponing the
unbind, we allow for the VMA to be reopened by the client, avoiding the
work required to rebind the VMA.

Starting from commit b0647a5e79b1 ("drm/i915: Avoid live-lock with
i915_vma_parked()"), we assume that as long as a GT is held idle, no VMA
would be reopened while we destroy them.  That assumption is no longer
true in multi-GT configurations, where a VMA we reopen may be handled by a
GT different from the one that we already keep active via its engine while
we set up an execbuf request.

Restoring the extra GT0 PM wakeref removed from i915_gem_do_execbuffer()
processing path seems to fix this issue.

Closes:https://gitlab.freedesktop.org/drm/intel/-/issues/10608
Signed-off-by: Janusz Krzysztofik
Cc: Rodrigo Vivi
Cc: Nirmoy Das


Reviewed-by: Nirmoy Das 


---
  drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 18 ++
  1 file changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 42619fc05de48..090724fa766c9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -255,6 +255,7 @@ struct i915_execbuffer {
struct intel_context *context; /* logical state for the request */
struct i915_gem_context *gem_context; /** caller's context */
intel_wakeref_t wakeref;
+   intel_wakeref_t wakeref_gt0;
  
  	/** our requests to build */

struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
@@ -2685,6 +2686,7 @@ static int
  eb_select_engine(struct i915_execbuffer *eb)
  {
struct intel_context *ce, *child;
+   struct intel_gt *gt;
unsigned int idx;
int err;
  
@@ -2708,10 +2710,17 @@ eb_select_engine(struct i915_execbuffer *eb)

}
}
eb->num_batches = ce->parallel.number_children + 1;
+   gt = ce->eng

Re: [PATCH v3] drm/i915/vma: Fix UAF on reopen vs destroy race

2024-05-06 Thread Nirmoy Das
aring after I reverted that workaround.  However, its
effectiveness is limited to MTL topology.

perhaps the safer path for this case indeed. something that could be really
limited to a single platform would be better.



I agree with Rodrigo here. it would be safe revert the mentioned patch 
now and think about more robust solution


later on as the issue is effecting current user.


Regards,

Nirmoy



But I confess that I don't have other better suggestions.
If we need to go with this patch as a quick solution, it is apparently
better than leaving the bug there as is.

+Thomas. any good thoughts there or advices?

Thanks,
Rodrigo.


Thanks,
Janusz


Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10608
Signed-off-by: Janusz Krzysztofik 
Cc: Chris Wilson 
Cc: Tvrtko Ursulin 
Cc: sta...@vger.kernel.org # v6.0+
---
  .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 10 --
  drivers/gpu/drm/i915/i915_vma.c   | 32 +++
  drivers/gpu/drm/i915/i915_vma.h   |  2 +-
  drivers/gpu/drm/i915/i915_vma_types.h |  3 ++
  4 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 42619fc05de48..97e014f94002e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -847,9 +847,12 @@ static int __eb_add_lut(struct i915_execbuffer *eb,
if (unlikely(!lut))
return -ENOMEM;
  
+	if (!i915_vma_open(vma)) {

+   err = -EEXIST;  /* let eb_vma_lookup() retry */
+   goto err_lut_free;
+   }
+
i915_vma_get(vma);
-   if (!atomic_fetch_inc(>open_count))
-   i915_vma_reopen(vma);
lut->handle = handle;
lut->ctx = ctx;
  
@@ -880,8 +883,9 @@ static int __eb_add_lut(struct i915_execbuffer *eb,

return 0;
  
  err:

-   i915_vma_close(vma);
i915_vma_put(vma);
+   i915_vma_close(vma);
+err_lut_free:
i915_lut_handle_free(lut);
return err;
  }
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index d2f064d2525cc..4435c76f28c8c 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1735,14 +1735,33 @@ static void __i915_vma_remove_closed(struct i915_vma 
*vma)
list_del_init(>closed_link);
  }
  
-void i915_vma_reopen(struct i915_vma *vma)

+static struct i915_vma *i915_vma_reopen(struct i915_vma *vma)
+{
+   if (atomic_read(>flags) & I915_VMA_PARKED)
+   return NULL;
+
+   __i915_vma_remove_closed(vma);
+   return vma;
+}
+
+struct i915_vma *i915_vma_open(struct i915_vma *vma)
  {
struct intel_gt *gt = vma->vm->gt;
  
+	if (atomic_inc_not_zero(>open_count))

+   return vma;
+
spin_lock_irq(>closed_lock);
-   if (i915_vma_is_closed(vma))
-   __i915_vma_remove_closed(vma);
+   if (!atomic_inc_not_zero(>open_count)) {
+   if (i915_vma_is_closed(vma))
+   vma = i915_vma_reopen(vma);
+
+   if (vma)
+   atomic_inc(>open_count);
+   }
spin_unlock_irq(>closed_lock);
+
+   return vma;
  }
  
  static void force_unbind(struct i915_vma *vma)

@@ -1770,7 +1789,8 @@ static void release_references(struct i915_vma *vma, 
struct intel_gt *gt,
spin_unlock(>vma.lock);
  
  	spin_lock_irq(>closed_lock);

-   __i915_vma_remove_closed(vma);
+   if (!(atomic_read(>flags) & I915_VMA_PARKED))
+   __i915_vma_remove_closed(vma);
spin_unlock_irq(>closed_lock);
  
  	if (vm_ddestroy)

@@ -1854,22 +1874,22 @@ void i915_vma_parked(struct intel_gt *gt)
}
  
  		list_move(>closed_link, );

+   atomic_or(I915_VMA_PARKED, >flags);
}
spin_unlock_irq(>closed_lock);
  
-	/* As the GT is held idle, no vma can be reopened as we destroy them */

list_for_each_entry_safe(vma, next, , closed_link) {
struct drm_i915_gem_object *obj = vma->obj;
struct i915_address_space *vm = vma->vm;
  
  		if (i915_gem_object_trylock(obj, NULL)) {

-   INIT_LIST_HEAD(>closed_link);
i915_vma_destroy(vma);
i915_gem_object_unlock(obj);
} else {
/* back you go.. */
spin_lock_irq(>closed_lock);
list_add(>closed_link, >closed_vma);
+   atomic_andnot(I915_VMA_PARKED, >flags);
spin_unlock_irq(>closed_lock);
}
  
diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h

index e356dfb883d34..331d19672c764 100644
--- a/drivers/gpu/drm/i915/i915_vma.h
+++ b/drivers/gpu/drm/i915/i915_vma.h
@@ -268,7 +268,7 @

Re: [PATCH] drm/i915/gt: Disarm breadcrumbs if engines are already idle

2024-04-26 Thread Nirmoy Das



On 4/23/2024 6:23 PM, Janusz Krzysztofik wrote:

From: Chris Wilson 

The breadcrumbs use a GT wakeref for guarding the interrupt, but are
disarmed during release of the engine wakeref. This leaves a hole where
we may attach a breadcrumb just as the engine is parking (after it has
parked its breadcrumbs), execute the irq worker with some signalers still
attached, but never be woken again.

That issue manifests itself in CI with IGT runner timeouts while tests
are waiting indefinitely for release of all GT wakerefs.

<6> [209.151778] i915: Running live_engine_pm_selftests/live_engine_busy_stats
<7> [209.231628] i915 :00:02.0: [drm:intel_power_well_disable [i915]] 
disabling PW_5
<7> [209.231816] i915 :00:02.0: [drm:intel_power_well_disable [i915]] 
disabling PW_4
<7> [209.231944] i915 :00:02.0: [drm:intel_power_well_disable [i915]] 
disabling PW_3
<7> [209.232056] i915 :00:02.0: [drm:intel_power_well_disable [i915]] 
disabling PW_2
<7> [209.232166] i915 :00:02.0: [drm:intel_power_well_disable [i915]] 
disabling DC_off
<7> [209.232270] i915 :00:02.0: [drm:skl_enable_dc6 [i915]] Enabling DC6
<7> [209.232368] i915 :00:02.0: [drm:gen9_set_dc_state.part.0 [i915]] 
Setting DC state from 00 to 02
<4> [299.356116] [IGT] Inactivity timeout exceeded. Killing the current test 
with SIGQUIT.
...
<6> [299.356526] sysrq: Show State
...
<6> [299.373964] task:i915_selftest   state:D stack:11784 pid:5578  tgid:5578  
ppid:873flags:0x4002
<6> [299.373967] Call Trace:
<6> [299.373968]  
<6> [299.373970]  __schedule+0x3bb/0xda0
<6> [299.373974]  schedule+0x41/0x110
<6> [299.373976]  intel_wakeref_wait_for_idle+0x82/0x100 [i915]
<6> [299.374083]  ? __pfx_var_wake_function+0x10/0x10
<6> [299.374087]  live_engine_busy_stats+0x9b/0x500 [i915]
<6> [299.374173]  __i915_subtests+0xbe/0x240 [i915]
<6> [299.374277]  ? __pfx___intel_gt_live_setup+0x10/0x10 [i915]
<6> [299.374369]  ? __pfx___intel_gt_live_teardown+0x10/0x10 [i915]
<6> [299.374456]  intel_engine_live_selftests+0x1c/0x30 [i915]
<6> [299.374547]  __run_selftests+0xbb/0x190 [i915]
<6> [299.374635]  i915_live_selftests+0x4b/0x90 [i915]
<6> [299.374717]  i915_pci_probe+0x10d/0x210 [i915]

At the end of the interrupt worker, if there are no more engines awake,
disarm the breadcrumb and go to sleep.

Fixes: 9d5612ca165a ("drm/i915/gt: Defer enabling the breadcrumb interrupt to after 
submission")
Closes: https://gitlab.freedesktop.org/drm/intel/issues/10026
Signed-off-by: Chris Wilson 
Cc: Andrzej Hajda 
Cc:  # v5.12+
Signed-off-by: Janusz Krzysztofik 



Acked-by: Nirmoy Das 

I will let others/Andrzej r-b this as I am not very familiar with the code.


Thanks,

Nirmoy


---
  drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 15 +++
  1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c 
b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index d650beb8ed22f..20b9b04ec1e0b 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -263,8 +263,13 @@ static void signal_irq_work(struct irq_work *work)
i915_request_put(rq);
}
  
+	/* Lazy irq enabling after HW submission */

if (!READ_ONCE(b->irq_armed) && !list_empty(>signalers))
intel_breadcrumbs_arm_irq(b);
+
+   /* And confirm that we still want irqs enabled before we yield */
+   if (READ_ONCE(b->irq_armed) && !atomic_read(>active))
+   intel_breadcrumbs_disarm_irq(b);
  }
  
  struct intel_breadcrumbs *

@@ -315,13 +320,7 @@ void __intel_breadcrumbs_park(struct intel_breadcrumbs *b)
return;
  
  	/* Kick the work once more to drain the signalers, and disarm the irq */

-   irq_work_sync(>irq_work);
-   while (READ_ONCE(b->irq_armed) && !atomic_read(>active)) {
-   local_irq_disable();
-   signal_irq_work(>irq_work);
-   local_irq_enable();
-   cond_resched();
-   }
+   irq_work_queue(>irq_work);
  }
  
  void intel_breadcrumbs_free(struct kref *kref)

@@ -404,7 +403,7 @@ static void insert_breadcrumb(struct i915_request *rq)
 * the request as it may have completed and raised the interrupt as
 * we were attaching it into the lists.
 */
-   if (!b->irq_armed || __i915_request_is_complete(rq))
+   if (!READ_ONCE(b->irq_armed) || __i915_request_is_complete(rq))
irq_work_queue(>irq_work);
  }
  


Re: [PATCH v2 2/2] drm/i915: Fix gt reset with GuC submission is disabled

2024-04-23 Thread Nirmoy Das

Hi Andi,

On 4/23/2024 11:32 AM, Andi Shyti wrote:

Hi Nirmoy,

On Mon, Apr 22, 2024 at 10:19:51PM +0200, Nirmoy Das wrote:

Currently intel_gt_reset() kills the GuC and then resets requested
engines. This is problematic because there is a dedicated CSB FIFO
which only GuC can access and if that FIFO fills up, the hardware
will block on the next context switch until there is space that means
the system is effectively hung. If an engine is reset whilst actively
executing a context, a CSB entry will be sent to say that the context
has gone idle. Thus if reset happens on a very busy system then
killing GuC before killing the engines will lead to deadlock because
of filled up CSB FIFO.

is this a fix?


I went quite far back in the commit logs, and it appears to me that 
we've always been using the current reset flow.


I believe we don't perform a GT reset immediately after sending a number 
of requests, which is what the current failed test is doing.


So, I don't think there will be any visible impact on the user with the 
current flow.





To address this issue, the GuC should be killed only after resetting
the requested engines and before calling intel_gt_init_hw().

v2: Improve commit message(John)

Cc: John Harrison 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++--
  1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index b1393863ca9b..6161f7a3ff70 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -879,8 +879,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt 
*gt)
intel_engine_mask_t awake = 0;
enum intel_engine_id id;
  
-	/* For GuC mode, ensure submission is disabled before stopping ring */

-   intel_uc_reset_prepare(>uc);
+   /**
+* For GuC mode with submission enabled, ensure submission
+* is disabled before stopping ring.

nit: "stopping *the* ring"

Will fix it while merging if I don't have to resend this again.



+*
+* For GuC mode with submission disabled, ensure that GuC is not
+* sanitized, do that after engine reset. reset_prepare()
+* is followed by engine reset which in this mode requires GuC to
+* process any CSB FIFO entries generated by the resets.
+*/
+   if (intel_uc_uses_guc_submission(>uc))
+   intel_uc_reset_prepare(>uc);
  
  	for_each_engine(engine, gt, id) {

if (intel_engine_pm_get_if_awake(engine))
@@ -1227,6 +1236,9 @@ void intel_gt_reset(struct intel_gt *gt,
  
  	intel_overlay_reset(gt->i915);
  
+	/* sanitize uC after engine reset */

+   if (!intel_uc_uses_guc_submission(>uc))
+   intel_uc_reset_prepare(>uc);

Reviewed-by: Andi Shyti 


Thanks,

Nirmoy


Thanks,
Andi


Re: [PATCH] drm/i915/gt: Refactor uabi engine class/instance list creation

2024-04-23 Thread Nirmoy Das

Hi Andi,

On 4/17/2024 12:49 AM, Andi Shyti wrote:

For the upcoming changes we need a cleaner way to build the list
of uabi engines.

Suggested-by: Tvrtko Ursulin
Signed-off-by: Andi Shyti
---
Hi,

just sending this patch to unburden the coming series from this
single patch inherited from a previously sent series.

Andi

  drivers/gpu/drm/i915/gt/intel_engine_user.c | 29 -
  1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c 
b/drivers/gpu/drm/i915/gt/intel_engine_user.c
index 833987015b8b..11cc06c0c785 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_user.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c
@@ -203,7 +203,7 @@ static void engine_rename(struct intel_engine_cs *engine, 
const char *name, u16
  
  void intel_engines_driver_register(struct drm_i915_private *i915)

  {
-   u16 name_instance, other_instance = 0;
+   u16 class_instance[I915_LAST_UABI_ENGINE_CLASS + 2] = { };

+2 is confusing here. I think we need a better macro.

struct legacy_ring ring = {};
struct list_head *it, *next;
struct rb_node **p, *prev;
@@ -214,6 +214,8 @@ void intel_engines_driver_register(struct drm_i915_private 
*i915)
prev = NULL;
p = >uabi_engines.rb_node;
list_for_each_safe(it, next, ) {
+   u16 uabi_class;
+
struct intel_engine_cs *engine =
container_of(it, typeof(*engine), uabi_list);
  
@@ -222,15 +224,14 @@ void intel_engines_driver_register(struct drm_i915_private *i915)
  
  		GEM_BUG_ON(engine->class >= ARRAY_SIZE(uabi_classes));

engine->uabi_class = uabi_classes[engine->class];
-   if (engine->uabi_class == I915_NO_UABI_CLASS) {
-   name_instance = other_instance++;
-   } else {
-   GEM_BUG_ON(engine->uabi_class >=
-  ARRAY_SIZE(i915->engine_uabi_class_count));
-   name_instance =
-   
i915->engine_uabi_class_count[engine->uabi_class]++;
-   }
-   engine->uabi_instance = name_instance;
+
+   if (engine->uabi_class == I915_NO_UABI_CLASS)
+   uabi_class = I915_LAST_UABI_ENGINE_CLASS + 1;
+   else
+   uabi_class = engine->uabi_class;
+
+   GEM_BUG_ON(uabi_class >= ARRAY_SIZE(class_instance));
+   engine->uabi_instance = class_instance[uabi_class]++;
  
  		/*

 * Replace the internal name with the final user and log facing
@@ -238,11 +239,15 @@ void intel_engines_driver_register(struct 
drm_i915_private *i915)
 */
engine_rename(engine,
  intel_engine_class_repr(engine->class),
- name_instance);
+ engine->uabi_instance);
  
-		if (engine->uabi_class == I915_NO_UABI_CLASS)

+   if (uabi_class > I915_LAST_UABI_ENGINE_CLASS)
continue;
  
+		GEM_BUG_ON(uabi_class >=

+  ARRAY_SIZE(i915->engine_uabi_class_count));
+   i915->engine_uabi_class_count[uabi_class]++;
Shouldn't this be  i915->engine_uabi_class_count[uabi_class] = 
class_instance[uabi_class]; ?


What I see is that this patch mainly adding this class_instance array 
and rest looks the same.
May be it make sense to add other upcoming  patches to better understand 
why we need this patch.


Regards,
Nirmoy


+
rb_link_node(>uabi_node, prev, p);
rb_insert_color(>uabi_node, >uabi_engines);
  

[PATCH v2 1/2] drm/i915: Refactor confusing __intel_gt_reset()

2024-04-22 Thread Nirmoy Das
__intel_gt_reset() is really for resetting engines though
the name might suggest something else. So add a helper function
to remove confusions with no functional changes.

v2: Move intel_gt_reset_all_engines() next to
intel_gt_reset_engine() to make diff simple(John)

Cc: John Harrison 
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
 .../drm/i915/gt/intel_execlists_submission.c  |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt.c|  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |  2 +-
 drivers/gpu/drm/i915/gt/intel_reset.c | 35 +++
 drivers/gpu/drm/i915/gt/intel_reset.h |  3 +-
 drivers/gpu/drm/i915/gt/selftest_reset.c  |  2 +-
 drivers/gpu/drm/i915/i915_driver.c|  2 +-
 8 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 8c44af1c3451..5c8e9ee3b008 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -678,7 +678,7 @@ void intel_engines_release(struct intel_gt *gt)
 */
GEM_BUG_ON(intel_gt_pm_is_awake(gt));
if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
-   __intel_gt_reset(gt, ALL_ENGINES);
+   intel_gt_reset_all_engines(gt);
 
/* Decouple the backend; but keep the layout for late GPU resets */
for_each_engine(engine, gt, id) {
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 355aab5b38ba..21829439e686 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2898,7 +2898,7 @@ static void enable_error_interrupt(struct intel_engine_cs 
*engine)
drm_err(>i915->drm,
"engine '%s' resumed still in error: %08x\n",
engine->name, status);
-   __intel_gt_reset(engine->gt, engine->mask);
+   intel_gt_reset_engine(engine);
}
 
/*
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c
index 580b5141ce1e..626b166e67ef 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -832,7 +832,7 @@ void intel_gt_driver_unregister(struct intel_gt *gt)
 
/* Scrub all HW state upon release */
with_intel_runtime_pm(gt->uncore->rpm, wakeref)
-   __intel_gt_reset(gt, ALL_ENGINES);
+   intel_gt_reset_all_engines(gt);
 }
 
 void intel_gt_driver_release(struct intel_gt *gt)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 220ac4f92edf..c08fdb65cc69 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -159,7 +159,7 @@ static bool reset_engines(struct intel_gt *gt)
if (INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
return false;
 
-   return __intel_gt_reset(gt, ALL_ENGINES) == 0;
+   return intel_gt_reset_all_engines(gt) == 0;
 }
 
 static void gt_sanitize(struct intel_gt *gt, bool force)
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index c8e9aa41fdea..b1393863ca9b 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -764,7 +764,7 @@ wa_14015076503_end(struct intel_gt *gt, intel_engine_mask_t 
engine_mask)
 HECI_H_GS1_ER_PREP, 0);
 }
 
-int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask)
+static int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t 
engine_mask)
 {
const int retries = engine_mask == ALL_ENGINES ? RESET_MAX_RETRIES : 1;
reset_func reset;
@@ -978,7 +978,7 @@ static void __intel_gt_set_wedged(struct intel_gt *gt)
 
/* Even if the GPU reset fails, it should still stop the engines */
if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
-   __intel_gt_reset(gt, ALL_ENGINES);
+   intel_gt_reset_all_engines(gt);
 
for_each_engine(engine, gt, id)
engine->submit_request = nop_submit_request;
@@ -1089,7 +1089,7 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt)
/* We must reset pending GPU events before restoring our submission */
ok = !HAS_EXECLISTS(gt->i915); /* XXX better agnosticism desired */
if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
-   ok = __intel_gt_reset(gt, ALL_ENGINES) == 0;
+   ok = intel_gt_reset_all_engines(gt) == 0;
if (!ok) {
/*
 * Warn CI about the unrecoverable wedged condition.
@@ -1133,10 +1133,10 @@ static int do_reset(struct intel_gt *gt, 
intel_engine_mask_t stalled_mask)
 {
int err, i;
 

[PATCH v2 2/2] drm/i915: Fix gt reset with GuC submission is disabled

2024-04-22 Thread Nirmoy Das
Currently intel_gt_reset() kills the GuC and then resets requested
engines. This is problematic because there is a dedicated CSB FIFO
which only GuC can access and if that FIFO fills up, the hardware
will block on the next context switch until there is space that means
the system is effectively hung. If an engine is reset whilst actively
executing a context, a CSB entry will be sent to say that the context
has gone idle. Thus if reset happens on a very busy system then
killing GuC before killing the engines will lead to deadlock because
of filled up CSB FIFO.

To address this issue, the GuC should be killed only after resetting
the requested engines and before calling intel_gt_init_hw().

v2: Improve commit message(John)

Cc: John Harrison 
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index b1393863ca9b..6161f7a3ff70 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -879,8 +879,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt 
*gt)
intel_engine_mask_t awake = 0;
enum intel_engine_id id;
 
-   /* For GuC mode, ensure submission is disabled before stopping ring */
-   intel_uc_reset_prepare(>uc);
+   /**
+* For GuC mode with submission enabled, ensure submission
+* is disabled before stopping ring.
+*
+* For GuC mode with submission disabled, ensure that GuC is not
+* sanitized, do that after engine reset. reset_prepare()
+* is followed by engine reset which in this mode requires GuC to
+* process any CSB FIFO entries generated by the resets.
+*/
+   if (intel_uc_uses_guc_submission(>uc))
+   intel_uc_reset_prepare(>uc);
 
for_each_engine(engine, gt, id) {
if (intel_engine_pm_get_if_awake(engine))
@@ -1227,6 +1236,9 @@ void intel_gt_reset(struct intel_gt *gt,
 
intel_overlay_reset(gt->i915);
 
+   /* sanitize uC after engine reset */
+   if (!intel_uc_uses_guc_submission(>uc))
+   intel_uc_reset_prepare(>uc);
/*
 * Next we need to restore the context, but we don't use those
 * yet either...
-- 
2.42.0



Re: [PATCH 3/3] drm/i915: Fix gt reset with GuC submission disabled

2024-04-19 Thread Nirmoy Das

Hi John,

On 4/19/2024 1:38 AM, John Harrison wrote:

On 4/18/2024 10:10, Nirmoy Das wrote:

Currently intel_gt_reset() happens as follows:

reset_prepare() ---> Sends GDRST to GuC, GuC is in GS_MIA_IN_RESET
do_reset()
   intel_gt_reset_all_engines()
 *_engine_reset_prepare() -->RESET_CTL expects running GuC
Not technically correct. There is no direct connection between 
RESET_CTL and GuC.



 *_reset_engines()
intel_gt_init_hw() --> GuC comes out of GS_MIA_IN_RESET with FW loaded.

Fix the issue by sanitizing the GuC only after resetting requested
engines and before intel_gt_init_hw().

You never actually state what the issue is.

The problem is that there is a dedicated CSB FIFO going to GuC (and 
nothing else has access to it). If that FIFO fills up, the hardware 
will block on the next context switch until there is space. If no-one 
(i.e. GuC) is draining it, that means the system is effectively hung. 
If an engine is reset whilst actively executing a context, a CSB entry 
will be sent to say that the context has gone idle. Thus if you reset 
a very busy system and start with killing GuC before killing the 
engines and only then re-enabling GuC, you run the risk of generating 
more CSB entries than will fit in the FIFO and deadlocking. Whereas, 
if the system is idle then you can reset the engines as much as you 
like while GuC is dead and it won't be a problem.



I wasn't sure if I could talk about internal details so kept it to 
minimal. I will borrow above explanation and resend :)






Note intel_uc_reset_finish() and intel_uc_reset() are nop when
guc submission is disabled.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++--
  1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c

index 6504e8ba9c58..bd166f5aca4b 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -907,8 +907,17 @@ static intel_engine_mask_t reset_prepare(struct 
intel_gt *gt)

  intel_engine_mask_t awake = 0;
  enum intel_engine_id id;
  -    /* For GuC mode, ensure submission is disabled before stopping 
ring */

-    intel_uc_reset_prepare(>uc);
+    /**
+ * For GuC mode with submission enabled, ensure submission
+ * is disabled before stopping ring.
+ *
+ * For GuC mode with submission disabled, ensure that GuC is not
+ * sanitized, do that at the end in reset_finish(). reset_prepare()
+ * is followed by engine reset which in this mode requires GuC to
+ * be functional to process engine reset events.

-> to process any CSB FIFO entries generated by the resets.


I will add this.


Thanks,

Nirmoy



John.


+ */
+    if (intel_uc_uses_guc_submission(>uc))
+    intel_uc_reset_prepare(>uc);
    for_each_engine(engine, gt, id) {
  if (intel_engine_pm_get_if_awake(engine))
@@ -1255,6 +1264,9 @@ void intel_gt_reset(struct intel_gt *gt,
    intel_overlay_reset(gt->i915);
  +    /* sanitize uC after engine reset */
+    if (!intel_uc_uses_guc_submission(>uc))
+    intel_uc_reset_prepare(>uc);
  /*
   * Next we need to restore the context, but we don't use those
   * yet either...




Re: [PATCH 2/3] drm/i915 Rename intel_engine_reset to intel_gt_engine_recover

2024-04-19 Thread Nirmoy Das

Hi John,

On 4/19/2024 1:27 AM, John Harrison wrote:

On 4/18/2024 10:10, Nirmoy Das wrote:

intel_engine_reset() not only reset a engine but also
tries to recover it so give it a proper name without
any functional changes.
Not seeing what the difference is. If this was a super low level 
function (with an __ prefix for example) then one might expect it to 
literally just poke the reset register and leave the engine in a dead 
state. But as a high level function, I think it is reasonable to 
expect a reset function to 'recover' the entity being reset.


Also, many of the callers are tests that are explicitly testing reset. 
So now the tests all talk about attempting resets, resets failing, 
etc. but around a call to 'recover' instead of 'reset', which seems 
confusing.



Didn't think about it, I will drop it.


Thanks,

Nirmoy



John.



Signed-off-by: Nirmoy Das 
---
  .../drm/i915/gem/selftests/i915_gem_context.c |  2 +-
  .../drm/i915/gt/intel_execlists_submission.c  |  2 +-
  drivers/gpu/drm/i915/gt/intel_reset.c |  4 ++--
  drivers/gpu/drm/i915/gt/intel_reset.h |  4 ++--
  drivers/gpu/drm/i915/gt/selftest_hangcheck.c  | 20 +--
  drivers/gpu/drm/i915/gt/selftest_mocs.c   |  4 ++--
  drivers/gpu/drm/i915/gt/selftest_reset.c  |  2 +-
  .../gpu/drm/i915/gt/selftest_workarounds.c    |  6 +++---
  8 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c

index 89d4dc8b60c6..4f4cde55f621 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
@@ -1171,7 +1171,7 @@ __sseu_finish(const char *name,
  int ret = 0;
    if (flags & TEST_RESET) {
-    ret = intel_engine_reset(ce->engine, "sseu");
+    ret = intel_gt_engine_recover(ce->engine, "sseu");
  if (ret)
  goto out;
  }
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c

index 21829439e686..9485a622a704 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2404,7 +2404,7 @@ static void execlists_reset(struct 
intel_engine_cs *engine, const char *msg)
    ring_set_paused(engine, 1); /* Freeze the current request in 
place */

  execlists_capture(engine);
-    intel_engine_reset(engine, msg);
+    intel_gt_engine_recover(engine, msg);
tasklet_enable(>sched_engine->tasklet);
  clear_and_wake_up_bit(bit, lock);
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c

index b825daace58e..6504e8ba9c58 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1348,7 +1348,7 @@ int __intel_engine_reset_bh(struct 
intel_engine_cs *engine, const char *msg)

  }
    /**
- * intel_engine_reset - reset GPU engine to recover from a hang
+ * intel_gt_engine_recover - reset GPU engine to recover from a hang
   * @engine: engine to reset
   * @msg: reason for GPU reset; or NULL for no drm_notice()
   *
@@ -1360,7 +1360,7 @@ int __intel_engine_reset_bh(struct 
intel_engine_cs *engine, const char *msg)

   *  - reset engine (which will force the engine to idle)
   *  - re-init/configure engine
   */
-int intel_engine_reset(struct intel_engine_cs *engine, const char *msg)
+int intel_gt_engine_recover(struct intel_engine_cs *engine, const 
char *msg)

  {
  int err;
  diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h 
b/drivers/gpu/drm/i915/gt/intel_reset.h

index c00de353075c..be984357bf27 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.h
+++ b/drivers/gpu/drm/i915/gt/intel_reset.h
@@ -31,8 +31,8 @@ void intel_gt_handle_error(struct intel_gt *gt,
  void intel_gt_reset(struct intel_gt *gt,
  intel_engine_mask_t stalled_mask,
  const char *reason);
-int intel_engine_reset(struct intel_engine_cs *engine,
-   const char *reason);
+int intel_gt_engine_recover(struct intel_engine_cs *engine,
+    const char *reason);
  int __intel_engine_reset_bh(struct intel_engine_cs *engine,
  const char *reason);
  diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c 
b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c

index 9ce8ff1c04fe..9bfda3f2bd24 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -495,9 +495,9 @@ static int igt_reset_nop_engine(void *arg)
    i915_request_add(rq);
  }
-    err = intel_engine_reset(engine, NULL);
+    err = intel_gt_engine_recover(engine, NULL);
  if (err) {
-    pr_err("intel_engine_reset(%s) failed, err:%d\n",
+    pr_err("intel_gt_engine_recover(%s) failed, err:%d\n",

Re: [PATCH 1/3] drm/i915: Refactor confusing __intel_gt_reset()

2024-04-19 Thread Nirmoy Das

Hi John.

On 4/19/2024 1:27 AM, John Harrison wrote:

On 4/18/2024 10:10, Nirmoy Das wrote:

__intel_gt_reset() is really for resetting engines though
the name might suggest something else. So add two helper functions
to remove confusions with no functional changes.
Technically you only added one and just moved the other :). It already 
existed, it just wasn't being used everywhere that it could be!


I did have one more helper functions but I removed it in favor of 
intel_gt_reset_engine() but didn't change the commit message :/


Thanks for catching it. I will fix it.





Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
  .../drm/i915/gt/intel_execlists_submission.c  |  2 +-
  drivers/gpu/drm/i915/gt/intel_gt.c    |  2 +-
  drivers/gpu/drm/i915/gt/intel_gt_pm.c |  2 +-
  drivers/gpu/drm/i915/gt/intel_reset.c | 43 ++-
  drivers/gpu/drm/i915/gt/intel_reset.h |  3 +-
  drivers/gpu/drm/i915/gt/selftest_reset.c  |  2 +-
  drivers/gpu/drm/i915/i915_driver.c    |  2 +-
  8 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c

index 8c44af1c3451..5c8e9ee3b008 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -678,7 +678,7 @@ void intel_engines_release(struct intel_gt *gt)
   */
  GEM_BUG_ON(intel_gt_pm_is_awake(gt));
  if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
-    __intel_gt_reset(gt, ALL_ENGINES);
+    intel_gt_reset_all_engines(gt);
    /* Decouple the backend; but keep the layout for late GPU 
resets */

  for_each_engine(engine, gt, id) {
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c

index 355aab5b38ba..21829439e686 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2898,7 +2898,7 @@ static void enable_error_interrupt(struct 
intel_engine_cs *engine)

  drm_err(>i915->drm,
  "engine '%s' resumed still in error: %08x\n",
  engine->name, status);
-    __intel_gt_reset(engine->gt, engine->mask);
+    intel_gt_reset_engine(engine);
  }
    /*
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c

index 580b5141ce1e..626b166e67ef 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -832,7 +832,7 @@ void intel_gt_driver_unregister(struct intel_gt *gt)
    /* Scrub all HW state upon release */
  with_intel_runtime_pm(gt->uncore->rpm, wakeref)
-    __intel_gt_reset(gt, ALL_ENGINES);
+    intel_gt_reset_all_engines(gt);
  }
    void intel_gt_driver_release(struct intel_gt *gt)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c

index 220ac4f92edf..c08fdb65cc69 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -159,7 +159,7 @@ static bool reset_engines(struct intel_gt *gt)
  if (INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
  return false;
  -    return __intel_gt_reset(gt, ALL_ENGINES) == 0;
+    return intel_gt_reset_all_engines(gt) == 0;
  }
    static void gt_sanitize(struct intel_gt *gt, bool force)
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c

index c8e9aa41fdea..b825daace58e 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -764,7 +764,7 @@ wa_14015076503_end(struct intel_gt *gt, 
intel_engine_mask_t engine_mask)

   HECI_H_GS1_ER_PREP, 0);
  }
  -int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t 
engine_mask)
+static int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t 
engine_mask)

  {
  const int retries = engine_mask == ALL_ENGINES ? 
RESET_MAX_RETRIES : 1;

  reset_func reset;
@@ -795,6 +795,34 @@ int __intel_gt_reset(struct intel_gt *gt, 
intel_engine_mask_t engine_mask)

  return ret;
  }
  +/**
+ * intel_gt_reset_all_engines() - Reset all engines in the given gt.
+ * @gt: the GT to reset all engines for.
+ *
+ * This function resets all engines within the given gt.
+ *
+ * Returns:
+ * Zero on success, negative error code on failure.
+ */
+int intel_gt_reset_all_engines(struct intel_gt *gt)
+{
+    return __intel_gt_reset(gt, ALL_ENGINES);
+}
+
+/**
+ * intel_gt_reset_engine() - Reset a specific engine within a gt.
+ * @engine: engine to be reset.
+ *
+ * This function resets the specified engine within a gt.
+ *
+ * Returns:
+ * Zero on success, negative error code on failure.
+ */
+int intel_gt_reset_engine(struct intel_engine_cs *engine)
+{
+    return __intel_gt_reset(engine->gt, engine->mask);
+}
+
You could have just dropped the 

[PATCH 3/3] drm/i915: Fix gt reset with GuC submission disabled

2024-04-18 Thread Nirmoy Das
Currently intel_gt_reset() happens as follows:

reset_prepare() ---> Sends GDRST to GuC, GuC is in GS_MIA_IN_RESET
do_reset()
  intel_gt_reset_all_engines()
*_engine_reset_prepare() -->RESET_CTL expects running GuC
*_reset_engines()
intel_gt_init_hw() --> GuC comes out of GS_MIA_IN_RESET with FW loaded.

Fix the issue by sanitizing the GuC only after resetting requested
engines and before intel_gt_init_hw().

Note intel_uc_reset_finish() and intel_uc_reset() are nop when
guc submission is disabled.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index 6504e8ba9c58..bd166f5aca4b 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -907,8 +907,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt 
*gt)
intel_engine_mask_t awake = 0;
enum intel_engine_id id;
 
-   /* For GuC mode, ensure submission is disabled before stopping ring */
-   intel_uc_reset_prepare(>uc);
+   /**
+* For GuC mode with submission enabled, ensure submission
+* is disabled before stopping ring.
+*
+* For GuC mode with submission disabled, ensure that GuC is not
+* sanitized, do that at the end in reset_finish(). reset_prepare()
+* is followed by engine reset which in this mode requires GuC to
+* be functional to process engine reset events.
+*/
+   if (intel_uc_uses_guc_submission(>uc))
+   intel_uc_reset_prepare(>uc);
 
for_each_engine(engine, gt, id) {
if (intel_engine_pm_get_if_awake(engine))
@@ -1255,6 +1264,9 @@ void intel_gt_reset(struct intel_gt *gt,
 
intel_overlay_reset(gt->i915);
 
+   /* sanitize uC after engine reset */
+   if (!intel_uc_uses_guc_submission(>uc))
+   intel_uc_reset_prepare(>uc);
/*
 * Next we need to restore the context, but we don't use those
 * yet either...
-- 
2.42.0



[PATCH 2/3] drm/i915 Rename intel_engine_reset to intel_gt_engine_recover

2024-04-18 Thread Nirmoy Das
intel_engine_reset() not only reset a engine but also
tries to recover it so give it a proper name without
any functional changes.

Signed-off-by: Nirmoy Das 
---
 .../drm/i915/gem/selftests/i915_gem_context.c |  2 +-
 .../drm/i915/gt/intel_execlists_submission.c  |  2 +-
 drivers/gpu/drm/i915/gt/intel_reset.c |  4 ++--
 drivers/gpu/drm/i915/gt/intel_reset.h |  4 ++--
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  | 20 +--
 drivers/gpu/drm/i915/gt/selftest_mocs.c   |  4 ++--
 drivers/gpu/drm/i915/gt/selftest_reset.c  |  2 +-
 .../gpu/drm/i915/gt/selftest_workarounds.c|  6 +++---
 8 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
index 89d4dc8b60c6..4f4cde55f621 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
@@ -1171,7 +1171,7 @@ __sseu_finish(const char *name,
int ret = 0;
 
if (flags & TEST_RESET) {
-   ret = intel_engine_reset(ce->engine, "sseu");
+   ret = intel_gt_engine_recover(ce->engine, "sseu");
if (ret)
goto out;
}
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 21829439e686..9485a622a704 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2404,7 +2404,7 @@ static void execlists_reset(struct intel_engine_cs 
*engine, const char *msg)
 
ring_set_paused(engine, 1); /* Freeze the current request in place */
execlists_capture(engine);
-   intel_engine_reset(engine, msg);
+   intel_gt_engine_recover(engine, msg);
 
tasklet_enable(>sched_engine->tasklet);
clear_and_wake_up_bit(bit, lock);
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index b825daace58e..6504e8ba9c58 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1348,7 +1348,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs 
*engine, const char *msg)
 }
 
 /**
- * intel_engine_reset - reset GPU engine to recover from a hang
+ * intel_gt_engine_recover - reset GPU engine to recover from a hang
  * @engine: engine to reset
  * @msg: reason for GPU reset; or NULL for no drm_notice()
  *
@@ -1360,7 +1360,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs 
*engine, const char *msg)
  *  - reset engine (which will force the engine to idle)
  *  - re-init/configure engine
  */
-int intel_engine_reset(struct intel_engine_cs *engine, const char *msg)
+int intel_gt_engine_recover(struct intel_engine_cs *engine, const char *msg)
 {
int err;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h 
b/drivers/gpu/drm/i915/gt/intel_reset.h
index c00de353075c..be984357bf27 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.h
+++ b/drivers/gpu/drm/i915/gt/intel_reset.h
@@ -31,8 +31,8 @@ void intel_gt_handle_error(struct intel_gt *gt,
 void intel_gt_reset(struct intel_gt *gt,
intel_engine_mask_t stalled_mask,
const char *reason);
-int intel_engine_reset(struct intel_engine_cs *engine,
-  const char *reason);
+int intel_gt_engine_recover(struct intel_engine_cs *engine,
+   const char *reason);
 int __intel_engine_reset_bh(struct intel_engine_cs *engine,
const char *reason);
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c 
b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 9ce8ff1c04fe..9bfda3f2bd24 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -495,9 +495,9 @@ static int igt_reset_nop_engine(void *arg)
 
i915_request_add(rq);
}
-   err = intel_engine_reset(engine, NULL);
+   err = intel_gt_engine_recover(engine, NULL);
if (err) {
-   pr_err("intel_engine_reset(%s) failed, 
err:%d\n",
+   pr_err("intel_gt_engine_recover(%s) failed, 
err:%d\n",
   engine->name, err);
break;
}
@@ -574,7 +574,7 @@ static int igt_reset_fail_engine(void *arg)
>reset.flags));
 
force_reset_timeout(engine);
-   err = intel_engine_reset(engine, NULL);
+   err = intel_gt_engine_recover(engine, NULL);
cancel_reset_timeout(engine);
if (err == 0) /* timeouts only generated on gen8+ */
goto skip;

[PATCH 1/3] drm/i915: Refactor confusing __intel_gt_reset()

2024-04-18 Thread Nirmoy Das
__intel_gt_reset() is really for resetting engines though
the name might suggest something else. So add two helper functions
to remove confusions with no functional changes.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
 .../drm/i915/gt/intel_execlists_submission.c  |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt.c|  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |  2 +-
 drivers/gpu/drm/i915/gt/intel_reset.c | 43 ++-
 drivers/gpu/drm/i915/gt/intel_reset.h |  3 +-
 drivers/gpu/drm/i915/gt/selftest_reset.c  |  2 +-
 drivers/gpu/drm/i915/i915_driver.c|  2 +-
 8 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 8c44af1c3451..5c8e9ee3b008 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -678,7 +678,7 @@ void intel_engines_release(struct intel_gt *gt)
 */
GEM_BUG_ON(intel_gt_pm_is_awake(gt));
if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
-   __intel_gt_reset(gt, ALL_ENGINES);
+   intel_gt_reset_all_engines(gt);
 
/* Decouple the backend; but keep the layout for late GPU resets */
for_each_engine(engine, gt, id) {
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 355aab5b38ba..21829439e686 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2898,7 +2898,7 @@ static void enable_error_interrupt(struct intel_engine_cs 
*engine)
drm_err(>i915->drm,
"engine '%s' resumed still in error: %08x\n",
engine->name, status);
-   __intel_gt_reset(engine->gt, engine->mask);
+   intel_gt_reset_engine(engine);
}
 
/*
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c
index 580b5141ce1e..626b166e67ef 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -832,7 +832,7 @@ void intel_gt_driver_unregister(struct intel_gt *gt)
 
/* Scrub all HW state upon release */
with_intel_runtime_pm(gt->uncore->rpm, wakeref)
-   __intel_gt_reset(gt, ALL_ENGINES);
+   intel_gt_reset_all_engines(gt);
 }
 
 void intel_gt_driver_release(struct intel_gt *gt)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 220ac4f92edf..c08fdb65cc69 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -159,7 +159,7 @@ static bool reset_engines(struct intel_gt *gt)
if (INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
return false;
 
-   return __intel_gt_reset(gt, ALL_ENGINES) == 0;
+   return intel_gt_reset_all_engines(gt) == 0;
 }
 
 static void gt_sanitize(struct intel_gt *gt, bool force)
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index c8e9aa41fdea..b825daace58e 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -764,7 +764,7 @@ wa_14015076503_end(struct intel_gt *gt, intel_engine_mask_t 
engine_mask)
 HECI_H_GS1_ER_PREP, 0);
 }
 
-int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask)
+static int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t 
engine_mask)
 {
const int retries = engine_mask == ALL_ENGINES ? RESET_MAX_RETRIES : 1;
reset_func reset;
@@ -795,6 +795,34 @@ int __intel_gt_reset(struct intel_gt *gt, 
intel_engine_mask_t engine_mask)
return ret;
 }
 
+/**
+ * intel_gt_reset_all_engines() - Reset all engines in the given gt.
+ * @gt: the GT to reset all engines for.
+ *
+ * This function resets all engines within the given gt.
+ *
+ * Returns:
+ * Zero on success, negative error code on failure.
+ */
+int intel_gt_reset_all_engines(struct intel_gt *gt)
+{
+   return __intel_gt_reset(gt, ALL_ENGINES);
+}
+
+/**
+ * intel_gt_reset_engine() - Reset a specific engine within a gt.
+ * @engine: engine to be reset.
+ *
+ * This function resets the specified engine within a gt.
+ *
+ * Returns:
+ * Zero on success, negative error code on failure.
+ */
+int intel_gt_reset_engine(struct intel_engine_cs *engine)
+{
+   return __intel_gt_reset(engine->gt, engine->mask);
+}
+
 bool intel_has_gpu_reset(const struct intel_gt *gt)
 {
if (!gt->i915->params.reset)
@@ -978,7 +1006,7 @@ static void __intel_gt_set_wedged(struct intel_gt *gt)
 
/* Even if the GPU reset fails, it should still stop the engines */
if (!INTEL_INFO(gt->i915)->gpu_reset_clobb

Re: [PATCH v11 02/10] drm/ttm/tests: Delete unnecessary config option

2024-04-17 Thread Nirmoy Das



On 4/17/2024 3:03 PM, Karolina Stolarek wrote:

DRM KUnit helpers are selected automatically when TTM tests are enabled,
so there's no need to do it directly in the .kunitconfig file.

Signed-off-by: Karolina Stolarek 

Reviewed-by: Nirmoy Das 

---
  drivers/gpu/drm/ttm/tests/.kunitconfig | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/tests/.kunitconfig 
b/drivers/gpu/drm/ttm/tests/.kunitconfig
index 75fdce0cd98e..1ae1ffabd51e 100644
--- a/drivers/gpu/drm/ttm/tests/.kunitconfig
+++ b/drivers/gpu/drm/ttm/tests/.kunitconfig
@@ -1,4 +1,3 @@
  CONFIG_KUNIT=y
  CONFIG_DRM=y
-CONFIG_DRM_KUNIT_TEST_HELPERS=y
  CONFIG_DRM_TTM_KUNIT_TEST=y


Re: [PATCH v11 03/10] drm/ttm/tests: Set DMA mask in KUnit device

2024-04-17 Thread Nirmoy Das



On 4/17/2024 3:03 PM, Karolina Stolarek wrote:

In commit d393acce7b3f ("drm/tests: Switch to kunit devices"),
DRM test helpers migrated away from using a dummy platform driver
in favour of KUnit device. This means that DMA masks for the device
are not set but are required by ttm_pool_alloc tests.

Set the DMA mask for coherent mappings to unblock testing.

Signed-off-by: Karolina Stolarek 
---
  drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c 
b/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c
index 7b7c1fa805fc..cb1cd676f8ae 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c
@@ -98,6 +98,9 @@ struct ttm_test_devices *ttm_test_devices_basic(struct kunit 
*test)
devs->dev = drm_kunit_helper_alloc_device(test);
KUNIT_ASSERT_NOT_ERR_OR_NULL(test, devs->dev);
  
+	/* Set mask for alloc_coherent mappings to enable ttm_pool_alloc testing */

+   devs->dev->coherent_dma_mask = -1;


DMA_BIT_MASK() would be nice here. I wonder if it make sense to move 
that to kunit device related calls,  anyway this is:


Reviewed-by: Nirmoy Das 


+
devs->drm = __drm_kunit_helper_alloc_drm_device(test, devs->dev,
sizeof(*devs->drm), 0,
DRIVER_GEM);


Re: [PATCH] drm/i915/guc: Fix the fix for reset lock confusion

2024-04-03 Thread Nirmoy Das



On 3/30/2024 12:53 AM, john.c.harri...@intel.com wrote:

From: John Harrison 

The previous fix for the circlular lock splat about the busyness
worker wasn't quite complete. Even though the reset-in-progress flag
is cleared at the start of intel_uc_reset_finish, the entire function
is still inside the reset mutex lock. Not sure why the patch appeared
to fix the issue both locally and in CI. However, it is now back
again.

There is a further complication the wedge code path within
intel_gt_reset() jumps around so much it results in nested
reset_prepare/_finish calls. That is, the call sequence is:
   intel_gt_reset
   | reset_prepare
   | __intel_gt_set_wedged
   | | reset_prepare
   | | reset_finish
   | reset_finish

The nested finish means that even if the clear of the in-progress flag
was moved to the end of _finish, it would still be clear for the
entire second call. Surprisingly, this does not seem to be causing any
other problems at present.

As an aside, a wedge on fini does not call the finish functions at
all. The reset_in_progress flag is left set (twice).

So instead of trying to cancel the worker anywhere at all in the reset
path, just add a cancel to intel_guc_submission_fini instead. Note
that it is not a problem if the worker is still active during a reset.
Either it will run before the reset path starts locking things and
will simply block the reset code for a tiny amount of time. Or it will
run after the locks have been acquired and will early exit due to the
try-lock.

Also, do not use the reset-in-progress flag to decide whether a
synchronous cancel is safe (from a lockdep perspective) or not.
Instead, use the actual reset mutex state (both the genuine one and
the custom rolled BACKOFF one).

Fixes: 0e00a8814eec ("drm/i915/guc: Avoid circular locking issue on busyness 
flush")
Signed-off-by: John Harrison 
Cc: Zhanjun Dong 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Daniel Vetter 
Cc: Daniel Vetter 
Cc: Rodrigo Vivi 
Cc: Nirmoy Das 
Cc: Tvrtko Ursulin 
Cc: Umesh Nerlige Ramappa 
Cc: Andrzej Hajda 
Cc: Matt Roper 
Cc: Jonathan Cavitt 
Cc: Prathap Kumar Valsan 
Cc: Alan Previn 
Cc: Madhumitha Tolakanahalli Pradeep 

Cc: Daniele Ceraolo Spurio 
Cc: Ashutosh Dixit 
Cc: Dnyaneshwar Bhadane 


Thanks for the details, looks good to me:

Reviewed-by: Nirmoy Das 


---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 ---
  drivers/gpu/drm/i915/gt/uc/intel_uc.c |  4 
  2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 16640d6dd0589..00757d6333e88 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1403,14 +1403,17 @@ static void guc_cancel_busyness_worker(struct intel_guc 
*guc)
 * Trying to pass a 'need_sync' or 'in_reset' flag all the way down 
through
 * every possible call stack is unfeasible. It would be too intrusive 
to many
 * areas that really don't care about the GuC backend. However, there 
is the
-* 'reset_in_progress' flag available, so just use that.
+* I915_RESET_BACKOFF flag and the gt->reset.mutex can be tested for 
is_locked.
+* So just use those. Note that testing both is required due to the 
hideously
+* complex nature of the i915 driver's reset code paths.
 *
 * And note that in the case of a reset occurring during driver unload
-* (wedge_on_fini), skipping the cancel in _prepare (when the reset 
flag is set
-* is fine because there is another cancel in _finish (when the reset 
flag is
-* not).
+* (wedged_on_fini), skipping the cancel in reset_prepare/reset_fini 
(when the
+* reset flag/mutex are set) is fine because there is another explicit 
cancel in
+* intel_guc_submission_fini (when the reset flag/mutex are not).
 */
-   if (guc_to_gt(guc)->uc.reset_in_progress)
+   if (mutex_is_locked(_to_gt(guc)->reset.mutex) ||
+   test_bit(I915_RESET_BACKOFF, _to_gt(guc)->reset.flags))
cancel_delayed_work(>timestamp.work);
else
cancel_delayed_work_sync(>timestamp.work);
@@ -1424,8 +1427,6 @@ static void __reset_guc_busyness_stats(struct intel_guc 
*guc)
unsigned long flags;
ktime_t unused;
  
-	guc_cancel_busyness_worker(guc);

-
spin_lock_irqsave(>timestamp.lock, flags);
  
  	guc_update_pm_timestamp(guc, );

@@ -2004,13 +2005,6 @@ void intel_guc_submission_cancel_requests(struct 
intel_guc *guc)
  
  void intel_guc_submission_reset_finish(struct intel_guc *guc)

  {
-   /*
-* Ensure the busyness worker gets cancelled even on a fatal wedge.
-* Note that reset_prepare is not allowed to because it confuses 
lockdep.
-*/
-   if (guc_submission_initialized(guc))
-   guc_cancel_busyness_

Re: [PATCH] drm/i915/gt: Limit the reserved VM space to only the platforms that need it

2024-03-28 Thread Nirmoy Das

Hi Andi,

On 3/27/2024 9:05 PM, Andi Shyti wrote:

Commit 9bb66c179f50 ("drm/i915: Reserve some kernel space per
vm") reduces the available VM space of one page in order to apply
Wa_16018031267 and Wa_16018063123.

This page was reserved indiscrimitely in all platforms even when
not needed. Limit it to DG2 onwards.


I would use "Limit it to platforms that need WAs" as those WA are only 
needed till 12.71,  otherwise


Reviewed-by: Nirmoy Das 



Fixes: 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm")
Signed-off-by: Andi Shyti 
Cc: Andrzej Hajda 
Cc: Chris Wilson 
Cc: Jonathan Cavitt 
Cc: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 3 +++
  drivers/gpu/drm/i915/gt/intel_gt.c   | 6 ++
  drivers/gpu/drm/i915/gt/intel_gt.h   | 9 +
  3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c 
b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
index 1bd0e041e15c..398d60a66410 100644
--- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
@@ -961,6 +961,9 @@ static int gen8_init_rsvd(struct i915_address_space *vm)
struct i915_vma *vma;
int ret;
  
+	if (!intel_gt_needs_wa_16018031267(vm->gt))

+   return 0;
+
/* The memory will be used only by GPU. */
obj = i915_gem_object_create_lmem(i915, PAGE_SIZE,
  I915_BO_ALLOC_VOLATILE |
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c
index 2c6d31b8fc1a..580b5141ce1e 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -1024,6 +1024,12 @@ enum i915_map_type intel_gt_coherent_map_type(struct 
intel_gt *gt,
return I915_MAP_WC;
  }
  
+bool intel_gt_needs_wa_16018031267(struct intel_gt *gt)

+{
+   /* Wa_16018031267, Wa_16018063123 */
+   return IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 55), IP_VER(12, 71));
+}
+
  bool intel_gt_needs_wa_22016122933(struct intel_gt *gt)
  {
return MEDIA_VER_FULL(gt->i915) == IP_VER(13, 0) && gt->type == 
GT_MEDIA;
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h 
b/drivers/gpu/drm/i915/gt/intel_gt.h
index 6e7cab60834c..b5e114d284ad 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt.h
@@ -82,17 +82,18 @@ struct drm_printer;
  ##__VA_ARGS__);   \
  } while (0)
  
-#define NEEDS_FASTCOLOR_BLT_WABB(engine) ( \

-   IS_GFX_GT_IP_RANGE(engine->gt, IP_VER(12, 55), IP_VER(12, 71)) && \
-   engine->class == COPY_ENGINE_CLASS && engine->instance == 0)
-
  static inline bool gt_is_root(struct intel_gt *gt)
  {
return !gt->info.id;
  }
  
+bool intel_gt_needs_wa_16018031267(struct intel_gt *gt);

  bool intel_gt_needs_wa_22016122933(struct intel_gt *gt);
  
+#define NEEDS_FASTCOLOR_BLT_WABB(engine) ( \

+   intel_gt_needs_wa_16018031267(engine->gt) && \
+   engine->class == COPY_ENGINE_CLASS && engine->instance == 0)
+
  static inline struct intel_gt *uc_to_gt(struct intel_uc *uc)
  {
return container_of(uc, struct intel_gt, uc);


Re: [PATCH] drm/i915/gem: Calculate object page offset for partial memory mapping

2024-03-26 Thread Nirmoy Das

Hi Andi,

On 3/26/2024 12:12 PM, Andi Shyti wrote:

Hi Nirmoy,

...


diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index a2195e28b625..57a2dda2c3cc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -276,7 +276,7 @@ static vm_fault_t vm_fault_cpu(struct vm_fault *vmf)
/* PTEs are revoked in obj->ops->put_pages() */
err = remap_io_sg(area,
  area->vm_start, area->vm_end - area->vm_start,
- obj->mm.pages->sgl, iomap);
+ obj->mm.pages->sgl, 0, iomap);

Why don't we need partial mmap for CPU but only for GTT ?

As far as I understood we don't. I have a version with the CPU
offset as well in trybot[*]

But without support for segmented buffer objects, I don't know
how much this has any effect.


You confused me more :) Why segmented buffer object is needed for 
partial CPU mmap but not for GTT  ?


From high level,  GTT and CPU both should support partial mmap unless I 
missing something here.





Sounds like this also need to be cover by a IGT tests.

Yes, I it does need some igt work, working on it.


Don't we need "Fixes" tag for this?

Why should we? I'm not fixing anything here,


If userspace  expects partial mmap to work then this is a bug/gap in 
i915 so we need to


backport this as far as possible. Need some information about the 
requirement about  why we need this patch suddenly?



Regards,

Nirmoy


  I'm just
recalculating the mapping not starting from the beginning of the
scatter page.

Andi

[*] https://patchwork.freedesktop.org/patch/584474/?series=131539=2


Re: [PATCH] drm/i915/gem: Calculate object page offset for partial memory mapping

2024-03-25 Thread Nirmoy Das

Hi Andi,

I have too many questions :) I think the patch makes sense but need more 
context, see below:


On 3/25/2024 2:40 PM, Andi Shyti wrote:

To enable partial memory mapping of GPU virtual memory, it's
necessary to introduce an offset to the object's memory
(obj->mm.pages) scatterlist. This adjustment compensates for
instances when userspace mappings do not start from the beginning
of the object.

Based on a patch by Chris Wilson
.

Signed-off-by: Andi Shyti 
Cc: Chris Wilson 
Cc: Lionel Landwerlin 
---
  drivers/gpu/drm/i915/gem/i915_gem_mman.c |  8 +---
  drivers/gpu/drm/i915/i915_mm.c   | 12 +++-
  drivers/gpu/drm/i915/i915_mm.h   |  3 ++-
  3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index a2195e28b625..57a2dda2c3cc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -276,7 +276,7 @@ static vm_fault_t vm_fault_cpu(struct vm_fault *vmf)
/* PTEs are revoked in obj->ops->put_pages() */
err = remap_io_sg(area,
  area->vm_start, area->vm_end - area->vm_start,
- obj->mm.pages->sgl, iomap);
+ obj->mm.pages->sgl, 0, iomap);


Why don't we need partial mmap for CPU but only for GTT ?

Sounds like this also need to be cover by a IGT tests.  Don't we need 
"Fixes" tag for this?


Regards,

Nirmoy

  
  	if (area->vm_flags & VM_WRITE) {

GEM_BUG_ON(!i915_gem_object_has_pinned_pages(obj));
@@ -302,14 +302,16 @@ static vm_fault_t vm_fault_gtt(struct vm_fault *vmf)
struct i915_ggtt *ggtt = to_gt(i915)->ggtt;
bool write = area->vm_flags & VM_WRITE;
struct i915_gem_ww_ctx ww;
+   unsigned long obj_offset;
intel_wakeref_t wakeref;
struct i915_vma *vma;
pgoff_t page_offset;
int srcu;
int ret;
  
-	/* We don't use vmf->pgoff since that has the fake offset */

+   obj_offset = area->vm_pgoff - drm_vma_node_start(>vma_node);
page_offset = (vmf->address - area->vm_start) >> PAGE_SHIFT;
+   page_offset += obj_offset;
  
  	trace_i915_gem_object_fault(obj, page_offset, true, write);
  
@@ -404,7 +406,7 @@ static vm_fault_t vm_fault_gtt(struct vm_fault *vmf)
  
  	/* Finally, remap it using the new GTT offset */

ret = remap_io_mapping(area,
-  area->vm_start + (vma->gtt_view.partial.offset 
<< PAGE_SHIFT),
+  area->vm_start + ((vma->gtt_view.partial.offset - 
obj_offset) << PAGE_SHIFT),
   (ggtt->gmadr.start + i915_ggtt_offset(vma)) >> 
PAGE_SHIFT,
   min_t(u64, vma->size, area->vm_end - 
area->vm_start),
   >iomap);
diff --git a/drivers/gpu/drm/i915/i915_mm.c b/drivers/gpu/drm/i915/i915_mm.c
index 7998bc74ab49..f5c97a620962 100644
--- a/drivers/gpu/drm/i915/i915_mm.c
+++ b/drivers/gpu/drm/i915/i915_mm.c
@@ -122,13 +122,15 @@ int remap_io_mapping(struct vm_area_struct *vma,
   * @addr: target user address to start at
   * @size: size of map area
   * @sgl: Start sg entry
+ * @offset: offset from the start of the page
   * @iobase: Use stored dma address offset by this address or pfn if -1
   *
   *  Note: this is only safe if the mm semaphore is held when called.
   */
  int remap_io_sg(struct vm_area_struct *vma,
unsigned long addr, unsigned long size,
-   struct scatterlist *sgl, resource_size_t iobase)
+   struct scatterlist *sgl, unsigned long offset,
+   resource_size_t iobase)
  {
struct remap_pfn r = {
.mm = vma->vm_mm,
@@ -141,6 +143,14 @@ int remap_io_sg(struct vm_area_struct *vma,
/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
  
+	while (offset >= sg_dma_len(r.sgt.sgp) >> PAGE_SHIFT) {

+   offset -= sg_dma_len(r.sgt.sgp) >> PAGE_SHIFT;
+   r.sgt = __sgt_iter(__sg_next(r.sgt.sgp), use_dma(iobase));
+   if (!r.sgt.sgp)
+   return -EINVAL;
+   }
+   r.sgt.curr = offset << PAGE_SHIFT;
+
if (!use_dma(iobase))
flush_cache_range(vma, addr, size);
  
diff --git a/drivers/gpu/drm/i915/i915_mm.h b/drivers/gpu/drm/i915/i915_mm.h

index 04c8974d822b..69f9351b1a1c 100644
--- a/drivers/gpu/drm/i915/i915_mm.h
+++ b/drivers/gpu/drm/i915/i915_mm.h
@@ -30,6 +30,7 @@ int remap_io_mapping(struct vm_area_struct *vma,
  
  int remap_io_sg(struct vm_area_struct *vma,

unsigned long addr, unsigned long size,
-   struct scatterlist *sgl, resource_size_t iobase);
+   struct scatterlist *sgl, unsigned long offset,
+   resource_size_t iobase);
  
  #endif /* __I915_MM_H__ */


Re: [PATCH v2] drm/i915/gt: Report full vm address range

2024-03-21 Thread Nirmoy Das

Hi Andi,

On 3/21/2024 4:17 PM, Andi Shyti wrote:

Commit 9bb66c179f50 ("drm/i915: Reserve some kernel space per
vm") has reserved an object for kernel space usage.

Userspace, though, needs to know the full address range.

In the former patch the reserved space was substructed from the
total amount of the VM space. Add it back when the user requests
the GTT size through ioctl (I915_CONTEXT_PARAM_GTT_SIZE).

Fixes: 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm")
Signed-off-by: Andi Shyti 
Cc: Andrzej Hajda 
Cc: Chris Wilson 
Cc: Lionel Landwerlin 
Cc: Michal Mrozek 
Cc: Nirmoy Das 
Cc:  # v6.2+
Acked-by: Michal Mrozek 
Acked-by: Lionel Landwerlin 
---
Hi,

Just proposing a different implementation that doesn't affect
i915 internally but provides the same result. Instead of not
substracting the space during the reservation, I add it back
during the ioctl call.

All the "vm->rsvd.vma->node.size" looks a bit ugly,


Yes, this need document and also vm->total should be vm->total and may 
be we should have


vm->usable which will be used by kernel internal and return vm->total.

For me, I am fine with the kernel change as long as UMD is aware/fine of 
side-effect if


UMD ended up using the reserved page. Basically we need to document this 
well :)


Also may be we should limit this reserving page only on platform where 
it is required ?



Regards,

Nirmoy


  but that's
how it is. Maybe a comment can help to understand better why
there is this addition.

I kept the Ack from Michal and Lionel, because the outcome from
userspace perspactive doesn't really change.

Andi

  drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 81f65cab1330..60d9e7fe33b3 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -2454,7 +2454,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device 
*dev, void *data,
case I915_CONTEXT_PARAM_GTT_SIZE:
args->size = 0;
vm = i915_gem_context_get_eb_vm(ctx);
-   args->value = vm->total;
+   args->value = vm->total + vm->rsvd.vma->node.size;
i915_vm_put(vm);
  
  		break;


Re: [PATCH] drm/i915/gt: Report full vm address range

2024-03-14 Thread Nirmoy Das



On 3/14/2024 3:04 PM, Lionel Landwerlin wrote:

Hi Andi,

In Mesa we've been relying on I915_CONTEXT_PARAM_GTT_SIZE so as long 
as that is adjusted by the kernel


What do you mean by adjusted by, should it be a aligned size?

I915_CONTEXT_PARAM_GTT_SIZE ioctl is returning vm->total which is 
adjusted(reduced by a page).


This patch might cause silent error as it is not removing WABB which is 
using the reserved page to add dummy blt and if userspace is using that


page then it will be overwritten.


Regards,

Nirmoy


, we should be able to continue working without issues.

Acked-by: Lionel Landwerlin 

Thanks,

-Lionel

On 13/03/2024 21:39, Andi Shyti wrote:

Commit 9bb66c179f50 ("drm/i915: Reserve some kernel space per
vm") has reserved an object for kernel space usage.

Userspace, though, needs to know the full address range.

Fixes: 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm")
Signed-off-by: Andi Shyti 
Cc: Andrzej Hajda 
Cc: Chris Wilson 
Cc: Lionel Landwerlin 
Cc: Michal Mrozek 
Cc: Nirmoy Das 
Cc:  # v6.2+
---
  drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c 
b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c

index fa46d2308b0e..d76831f50106 100644
--- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
@@ -982,8 +982,9 @@ static int gen8_init_rsvd(struct 
i915_address_space *vm)

    vm->rsvd.vma = i915_vma_make_unshrinkable(vma);
  vm->rsvd.obj = obj;
-    vm->total -= vma->node.size;
+
  return 0;
+
  unref:
  i915_gem_object_put(obj);
  return ret;





Re: [PATCH] drm/i915/selftests: Pick correct caching mode.

2024-03-12 Thread Nirmoy Das



On 3/12/2024 3:28 PM, Andi Shyti wrote:

Hi Nirmoy,

On Tue, Mar 12, 2024 at 12:18:15PM +0100, Nirmoy Das wrote:

Caching mode is HW dependent so pick a correct one using
intel_gt_coherent_map_type().

Cc: Andi Shyti 
Cc: Janusz Krzysztofik 
Cc: Jonathan Cavitt 
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10249
Signed-off-by: Nirmoy Das 

I think it's a good choice not to have the Fixes tag here either.


Yes,  fixes tag isn't needed for selftests




Reviewed-by: Andi Shyti 


Thanks,

Nirmoy



Thanks,
Andi


[PATCH] drm/i915/selftests: Pick correct caching mode.

2024-03-12 Thread Nirmoy Das
Caching mode is HW dependent so pick a correct one using
intel_gt_coherent_map_type().

Cc: Andi Shyti 
Cc: Janusz Krzysztofik 
Cc: Jonathan Cavitt 
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10249
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
index d684a70f2c04..65a931ea80e9 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c
@@ -7,6 +7,7 @@
 #include "i915_drv.h"
 #include "i915_selftest.h"
 #include "gem/i915_gem_context.h"
+#include "gt/intel_gt.h"
 
 #include "mock_context.h"
 #include "mock_dmabuf.h"
@@ -155,6 +156,7 @@ static int verify_access(struct drm_i915_private *i915,
struct file *file;
u32 *vaddr;
int err = 0, i;
+   unsigned int mode;
 
file = mock_file(i915);
if (IS_ERR(file))
@@ -194,7 +196,8 @@ static int verify_access(struct drm_i915_private *i915,
if (err)
goto out_file;
 
-   vaddr = i915_gem_object_pin_map_unlocked(native_obj, I915_MAP_WB);
+   mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, true);
+   vaddr = i915_gem_object_pin_map_unlocked(native_obj, mode);
if (IS_ERR(vaddr)) {
err = PTR_ERR(vaddr);
goto out_file;
-- 
2.42.0



Re: [PATCH v7 2/3] drm/i915: Remove extra multi-gt pm-references

2024-03-05 Thread Nirmoy Das


On 3/5/2024 3:35 PM, Janusz Krzysztofik wrote:

There was an attempt to fix an issue of illegal attempts to free a still
active i915 VMA object when parking a GT believed to be idle, reported by
CI on 2-GT Meteor Lake.  As a solution, an extra wakeref for a Primary GT
was acquired from i915_gem_do_execbuffer() -- see commit f56fe3e91787
("drm/i915: Fix a VMA UAF for multi-gt platform").

However, that fix occurred insufficient -- the issue was still reported by
CI.  That wakeref was released on exit from i915_gem_do_execbuffer(), then
potentially before completion of the request and deactivation of its
associated VMAs.  Moreover, CI reports indicated that single-GT platforms
also suffered sporadically from the same race.

Since the issue has now been fixed by a preceding patch "drm/i915/vma: Fix
UAF on destroy against retire race", drop the no longer useful changes
introduced by that insufficient fix.

v3: Also drop the no longer used .wakeref_gt0 field from struct
 i915_execbuffer.
v2: Avoid the word "revert" in commit message (Rodrigo),
   - update commit description reusing relevant chunks dropped from the
 description of the proper fix (Rodrigo).

Signed-off-by: Janusz Krzysztofik
Cc: Nirmoy Das
Cc: Rodrigo Vivi


Reviewed-by: Nirmoy Das 


---
  drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 18 --
  1 file changed, 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index d3a771afb083e..3f20fe3811999 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -255,7 +255,6 @@ struct i915_execbuffer {
struct intel_context *context; /* logical state for the request */
struct i915_gem_context *gem_context; /** caller's context */
intel_wakeref_t wakeref;
-   intel_wakeref_t wakeref_gt0;
  
  	/** our requests to build */

struct i915_request *requests[MAX_ENGINE_INSTANCE + 1];
@@ -2686,7 +2685,6 @@ static int
  eb_select_engine(struct i915_execbuffer *eb)
  {
struct intel_context *ce, *child;
-   struct intel_gt *gt;
unsigned int idx;
int err;
  
@@ -2710,17 +2708,10 @@ eb_select_engine(struct i915_execbuffer *eb)

}
}
eb->num_batches = ce->parallel.number_children + 1;
-   gt = ce->engine->gt;
  
  	for_each_child(ce, child)

intel_context_get(child);
eb->wakeref = intel_gt_pm_get(ce->engine->gt);
-   /*
-* Keep GT0 active on MTL so that i915_vma_parked() doesn't
-* free VMAs while execbuf ioctl is validating VMAs.
-*/
-   if (gt->info.id)
-   eb->wakeref_gt0 = intel_gt_pm_get(to_gt(gt->i915));
  
  	if (!test_bit(CONTEXT_ALLOC_BIT, >flags)) {

err = intel_context_alloc_state(ce);
@@ -2759,9 +2750,6 @@ eb_select_engine(struct i915_execbuffer *eb)
return err;
  
  err:

-   if (gt->info.id)
-   intel_gt_pm_put(to_gt(gt->i915), eb->wakeref_gt0);
-
intel_gt_pm_put(ce->engine->gt, eb->wakeref);
for_each_child(ce, child)
intel_context_put(child);
@@ -2775,12 +2763,6 @@ eb_put_engine(struct i915_execbuffer *eb)
struct intel_context *child;
  
  	i915_vm_put(eb->context->vm);

-   /*
-* This works in conjunction with eb_select_engine() to prevent
-* i915_vma_parked() from interfering while execbuf validates vmas.
-*/
-   if (eb->gt->info.id)
-   intel_gt_pm_put(to_gt(eb->gt->i915), eb->wakeref_gt0);
intel_gt_pm_put(eb->context->engine->gt, eb->wakeref);
for_each_child(eb->context, child)
intel_context_put(child);

Re: [PATCH v7 1/3] drm/i915/vma: Fix UAF on destroy against retire race

2024-03-05 Thread Nirmoy Das


On 3/5/2024 3:35 PM, Janusz Krzysztofik wrote:

Object debugging tools were sporadically reporting illegal attempts to
free a still active i915 VMA object when parking a GT believed to be idle.

[161.359441] ODEBUG: free active (active state 0) object: 88811643b958 
object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915]
[161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 
debug_print_object+0x80/0xb0
...
[161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 
6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1
[161.360314] Hardware name: Intel Corporation Rocket Lake Client 
Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 
04/21/2022
[161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915]
[161.360592] RIP: 0010:debug_print_object+0x80/0xb0
...
[161.361347] debug_object_free+0xeb/0x110
[161.361362] i915_active_fini+0x14/0x130 [i915]
[161.361866] release_references+0xfe/0x1f0 [i915]
[161.362543] i915_vma_parked+0x1db/0x380 [i915]
[161.363129] __gt_park+0x121/0x230 [i915]
[161.363515] intel_wakeref_put_last+0x1f/0x70 [i915]

That has been tracked down to be happening when another thread is
deactivating the VMA inside __active_retire() helper, after the VMA's
active counter has been already decremented to 0, but before deactivation
of the VMA's object is reported to the object debugging tool.

We could prevent from that race by serializing i915_active_fini() with
__active_retire() via ref->tree_lock, but that wouldn't stop the VMA from
being used, e.g. from __i915_vma_retire() called at the end of
__active_retire(), after that VMA has been already freed by a concurrent
i915_vma_destroy() on return from the i915_active_fini().  Then, we should
rather fix the issue at the VMA level, not in i915_active.

Since __i915_vma_parked() is called from __gt_park() on last put of the
GT's wakeref, the issue could be addressed by holding the GT wakeref long
enough for __active_retire() to complete before that wakeref is released
and the GT parked.

I believe the issue was introduced by commit d93939730347 ("drm/i915:
Remove the vma refcount") which moved a call to i915_active_fini() from
a dropped i915_vma_release(), called on last put of the removed VMA kref,
to i915_vma_parked() processing path called on last put of a GT wakeref.
However, its visibility to the object debugging tool was suppressed by a
bug in i915_active that was fixed two weeks later with commit e92eb246feb9
("drm/i915/active: Fix missing debug object activation").

A VMA associated with a request doesn't acquire a GT wakeref by itself.
Instead, it depends on a wakeref held directly by the request's active
intel_context for a GT associated with its VM, and indirectly on that
intel_context's engine wakeref if the engine belongs to the same GT as the
VMA's VM.  Those wakerefs are released asynchronously to VMA deactivation.

Fix the issue by getting a wakeref for the VMA's GT when activating it,
and putting that wakeref only after the VMA is deactivated.  However,
exclude global GTT from that processing path, otherwise the GPU never goes
idle.  Since __i915_vma_retire() may be called from atomic contexts, use
async variant of wakeref put.  Also, to avoid circular locking dependency,
take care of acquiring the wakeref before VM mutex when both are needed.

v7: Add inline comments with justifications for:
 - using untracked variants of intel_gt_pm_get/put() (Nirmoy),
 - using async variant of _put(),
 - not getting the wakeref in case of a global GTT,
 - always getting the first wakeref outside vm->mutex.
v6: Since __i915_vma_active/retire() callbacks are not serialized, storing
 a wakeref tracking handle inside struct i915_vma is not safe, and
 there is no other good place for that.  Use untracked variants of
 intel_gt_pm_get/put_async().
v5: Replace "tile" with "GT" across commit description (Rodrigo),
   - avoid mentioning multi-GT case in commit description (Rodrigo),
   - explain why we need to take a temporary wakeref unconditionally inside
 i915_vma_pin_ww() (Rodrigo).
v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm
 wakerefs") (Andi),
   - for more easy backporting, split out removal of former insufficient
     workarounds and move them to separate patches (Nirmoy).
   - clean up commit message and description a bit.
v3: Identify root cause more precisely, and a commit to blame,
   - identify and drop former workarounds,
   - update commit message and description.
v2: Get the wakeref before VM mutex to avoid circular locking dependency,
   - drop questionable Fixes: tag.

Fixes: d93939730347 ("drm/i915: Remove the vma refcount")
Closes:https://gitlab.freedesktop.org/drm/intel/issues/8875
Signed-off-by: Janusz Krzysztofik
Cc: Thomas Hellström
Cc: Nirmoy Das
Cc: Andi Shyti
Cc: Rodrigo Vivi
Cc:sta...@vger.kernel.org  # v5.19+


Reviewed-by: Nirmoy Das 


---
  drivers/

Re: [PATCH v7 3/3] Revert "drm/i915: Wait for active retire before i915_active_fini()"

2024-03-05 Thread Nirmoy Das


On 3/5/2024 3:35 PM, Janusz Krzysztofik wrote:

This reverts commit 7a2280e8dcd2f1f436db9631287c0b21cf6a92b0, obsoleted
by "drm/i915/vma: Fix UAF on destroy against retire race".

Signed-off-by: Janusz Krzysztofik
Cc: Nirmoy Das


Reviewed-by: Nirmoy Das 


---
  drivers/gpu/drm/i915/i915_vma.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index b70715b1411d6..d2f064d2525cc 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1776,8 +1776,6 @@ static void release_references(struct i915_vma *vma, 
struct intel_gt *gt,
if (vm_ddestroy)
i915_vm_resv_put(vma->vm);
  
-	/* Wait for async active retire */

-   i915_active_wait(>active);
i915_active_fini(>active);
GEM_WARN_ON(vma->resource);
i915_vma_free(vma);

Re: [PATCH v6 1/3] drm/i915/vma: Fix UAF on destroy against retire race

2024-03-04 Thread Nirmoy Das



On 3/1/2024 8:29 AM, Janusz Krzysztofik wrote:

Object debugging tools were sporadically reporting illegal attempts to
free a still active i915 VMA object when parking a GT believed to be idle.

[161.359441] ODEBUG: free active (active state 0) object: 88811643b958 
object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915]
[161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 
debug_print_object+0x80/0xb0
...
[161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 
6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1
[161.360314] Hardware name: Intel Corporation Rocket Lake Client 
Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 
04/21/2022
[161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915]
[161.360592] RIP: 0010:debug_print_object+0x80/0xb0
...
[161.361347] debug_object_free+0xeb/0x110
[161.361362] i915_active_fini+0x14/0x130 [i915]
[161.361866] release_references+0xfe/0x1f0 [i915]
[161.362543] i915_vma_parked+0x1db/0x380 [i915]
[161.363129] __gt_park+0x121/0x230 [i915]
[161.363515] intel_wakeref_put_last+0x1f/0x70 [i915]

That has been tracked down to be happening when another thread is
deactivating the VMA inside __active_retire() helper, after the VMA's
active counter has been already decremented to 0, but before deactivation
of the VMA's object is reported to the object debugging tool.

We could prevent from that race by serializing i915_active_fini() with
__active_retire() via ref->tree_lock, but that wouldn't stop the VMA from
being used, e.g. from __i915_vma_retire() called at the end of
__active_retire(), after that VMA has been already freed by a concurrent
i915_vma_destroy() on return from the i915_active_fini().  Then, we should
rather fix the issue at the VMA level, not in i915_active.

Since __i915_vma_parked() is called from __gt_park() on last put of the
GT's wakeref, the issue could be addressed by holding the GT wakeref long
enough for __active_retire() to complete before that wakeref is released
and the GT parked.

I believe the issue was introduced by commit d93939730347 ("drm/i915:
Remove the vma refcount") which moved a call to i915_active_fini() from
a dropped i915_vma_release(), called on last put of the removed VMA kref,
to i915_vma_parked() processing path called on last put of a GT wakeref.
However, its visibility to the object debugging tool was suppressed by a
bug in i915_active that was fixed two weeks later with commit e92eb246feb9
("drm/i915/active: Fix missing debug object activation").

A VMA associated with a request doesn't acquire a GT wakeref by itself.
Instead, it depends on a wakeref held directly by the request's active
intel_context for a GT associated with its VM, and indirectly on that
intel_context's engine wakeref if the engine belongs to the same GT as the
VMA's VM.  Those wakerefs are released asynchronously to VMA deactivation.

Fix the issue by getting a wakeref for the VMA's GT when activating it,
and putting that wakeref only after the VMA is deactivated.  However,
exclude global GTT from that processing path, otherwise the GPU never goes
idle.  Since __i915_vma_retire() may be called from atomic contexts, use
async variant of wakeref put.  Also, to avoid circular locking dependency,
take care of acquiring the wakeref before VM mutex when both are needed.

v6: Since __i915_vma_active/retire() callbacks are not serialized, storing
 a wakeref tracking handle inside struct i915_vma is not safe, and
 there is no other good place for that.  Use untracked variants of
 intel_gt_pm_get/put_async().
v5: Replace "tile" with "GT" across commit description (Rodrigo),
   - avoid mentioning multi-GT case in commit description (Rodrigo),
   - explain why we need to take a temporary wakeref unconditionally inside
 i915_vma_pin_ww() (Rodrigo).
v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm
 wakerefs") (Andi),
   - for more easy backporting, split out removal of former insufficient
 workarounds and move them to separate patches (Nirmoy).
   - clean up commit message and description a bit.
v3: Identify root cause more precisely, and a commit to blame,
   - identify and drop former workarounds,
   - update commit message and description.
v2: Get the wakeref before VM mutex to avoid circular locking dependency,
   - drop questionable Fixes: tag.

Fixes: d93939730347 ("drm/i915: Remove the vma refcount")
Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875
Signed-off-by: Janusz Krzysztofik 
Cc: Thomas Hellström 
Cc: Nirmoy Das 
Cc: Andi Shyti 
Cc: Rodrigo Vivi 
Cc: sta...@vger.kernel.org # v5.19+
---
  drivers/gpu/drm/i915/i915_vma.c | 26 +++---
  1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index d09aad34ba37f..ffe81fe338f7e 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/

Re: [PATCH] drm/i915: check before removing mm notifier

2024-02-28 Thread Nirmoy Das



On 2/28/2024 2:24 PM, Tvrtko Ursulin wrote:


On 27/02/2024 09:26, Nirmoy Das wrote:

Hi Tvrtko,

On 2/27/2024 10:04 AM, Tvrtko Ursulin wrote:


On 21/02/2024 11:52, Nirmoy Das wrote:

Merged it to drm-intel-gt-next with s/check/Check


Shouldn't this have had:

Fixes: ed29c2691188 ("drm/i915: Fix userptr so we do not have to 
worry about obj->mm.lock, v7.")

Cc:  # v5.13+

?


Yes. Sorry, I missed that. Can we still the tag ?


I've added them and force pushed the branch since commit was still at 
the top.


Thanks a lot, Tvrtko!




FYI + Jani, Joonas and Rodrigo

Regards,

Tvrtko




Thanks,

Nirmoy


Regards,

Tvrtko


On 2/19/2024 1:50 PM, Nirmoy Das wrote:

Error in mmu_interval_notifier_insert() can leave a NULL
notifier.mm pointer. Catch that and return early.

Cc: Andi Shyti 
Cc: Shawn Lee 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c

index 0e21ce9d3e5a..61abfb505766 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -349,6 +349,9 @@ i915_gem_userptr_release(struct 
drm_i915_gem_object *obj)

  {
  GEM_WARN_ON(obj->userptr.page_ref);
+    if (!obj->userptr.notifier.mm)
+    return;
+
mmu_interval_notifier_remove(>userptr.notifier);
  obj->userptr.notifier.mm = NULL;
  }


Re: [PATCH] drm/i915: check before removing mm notifier

2024-02-27 Thread Nirmoy Das

Hi Tvrtko,

On 2/27/2024 10:04 AM, Tvrtko Ursulin wrote:


On 21/02/2024 11:52, Nirmoy Das wrote:

Merged it to drm-intel-gt-next with s/check/Check


Shouldn't this have had:

Fixes: ed29c2691188 ("drm/i915: Fix userptr so we do not have to worry 
about obj->mm.lock, v7.")

Cc:  # v5.13+

?


Yes. Sorry, I missed that. Can we still the tag ?


Thanks,

Nirmoy


Regards,

Tvrtko


On 2/19/2024 1:50 PM, Nirmoy Das wrote:

Error in mmu_interval_notifier_insert() can leave a NULL
notifier.mm pointer. Catch that and return early.

Cc: Andi Shyti 
Cc: Shawn Lee 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c

index 0e21ce9d3e5a..61abfb505766 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -349,6 +349,9 @@ i915_gem_userptr_release(struct 
drm_i915_gem_object *obj)

  {
  GEM_WARN_ON(obj->userptr.page_ref);
+    if (!obj->userptr.notifier.mm)
+    return;
+
mmu_interval_notifier_remove(>userptr.notifier);
  obj->userptr.notifier.mm = NULL;
  }


Re: [PATCH] drm/i915: check before removing mm notifier

2024-02-21 Thread Nirmoy Das

Merged it to drm-intel-gt-next with s/check/Check

On 2/19/2024 1:50 PM, Nirmoy Das wrote:

Error in mmu_interval_notifier_insert() can leave a NULL
notifier.mm pointer. Catch that and return early.

Cc: Andi Shyti 
Cc: Shawn Lee 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 0e21ce9d3e5a..61abfb505766 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj)
  {
GEM_WARN_ON(obj->userptr.page_ref);
  
+	if (!obj->userptr.notifier.mm)

+   return;
+
mmu_interval_notifier_remove(>userptr.notifier);
obj->userptr.notifier.mm = NULL;
  }


Re: [PATCH] drm/i915: check before removing mm notifier

2024-02-20 Thread Nirmoy Das

Hi Rodrigo,

On 2/19/2024 9:12 PM, Rodrigo Vivi wrote:

On Mon, Feb 19, 2024 at 01:50:47PM +0100, Nirmoy Das wrote:

Error in mmu_interval_notifier_insert() can leave a NULL
notifier.mm pointer. Catch that and return early.

Cc: Andi Shyti
Cc: Shawn Lee
Signed-off-by: Nirmoy Das
---
  drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 0e21ce9d3e5a..61abfb505766 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj)
  {
GEM_WARN_ON(obj->userptr.page_ref);
  
+	if (!obj->userptr.notifier.mm)

+   return;
+

hmmm... right, it looks that we need this protection. But...

I mean, feel free to use
Reviewed-by: Rodrigo Vivi

for this patch,

but I believe that if this mmu insert failed we might have other
deeper problems like when checking i915_gem_object_is_userptr() ?

No?!


We are returning an error if mmu insert fails while creating a userptr 
object  so the obj struct is only available to obj cleanup methods.


As far as I see, i915_gem_object_is_userptr() should not happen on such obj 
struct.

Thanks,
Nirmoy


mmu_interval_notifier_remove(>userptr.notifier);
obj->userptr.notifier.mm = NULL;
  }
--
2.42.0


[PATCH] drm/i915: check before removing mm notifier

2024-02-19 Thread Nirmoy Das
Error in mmu_interval_notifier_insert() can leave a NULL
notifier.mm pointer. Catch that and return early.

Cc: Andi Shyti 
Cc: Shawn Lee 
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 0e21ce9d3e5a..61abfb505766 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj)
 {
GEM_WARN_ON(obj->userptr.page_ref);
 
+   if (!obj->userptr.notifier.mm)
+   return;
+
mmu_interval_notifier_remove(>userptr.notifier);
obj->userptr.notifier.mm = NULL;
 }
-- 
2.42.0



Re: [PATCH v5 0/3] drm/i915: Fix VMA UAF on destroy against deactivate race

2024-01-29 Thread Nirmoy Das

Hi Janusz,

There seems to be a regression in CI related to this:

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v2/bat-dg1-7/igt@gem_lmem_swapping@random-engi...@lmem0.html#dmesg-warnings1053

Please have a look.


Regards,

Nirmoy

On 1/24/2024 6:13 PM, Janusz Krzysztofik wrote:

Object debugging tools were sporadically reporting illegal attempts to
free a still active i915 VMA object when parking a GT believed to be idle.

[161.359441] ODEBUG: free active (active state 0) object: 88811643b958 
object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915]
[161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 
debug_print_object+0x80/0xb0
...
[161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 
6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1
[161.360314] Hardware name: Intel Corporation Rocket Lake Client 
Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 
04/21/2022
[161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915]
[161.360592] RIP: 0010:debug_print_object+0x80/0xb0
...
[161.361347] debug_object_free+0xeb/0x110
[161.361362] i915_active_fini+0x14/0x130 [i915]
[161.361866] release_references+0xfe/0x1f0 [i915]
[161.362543] i915_vma_parked+0x1db/0x380 [i915]
[161.363129] __gt_park+0x121/0x230 [i915]
[161.363515] intel_wakeref_put_last+0x1f/0x70 [i915]

That has been tracked down to be happening when another thread is
deactivating the VMA inside __active_retire() helper, after the VMA's
active counter has been already decremented to 0, but before deactivation
of the VMA's object is reported to the object debugging tool.

We could prevent from that race by serializing i915_active_fini() with
__active_retire() via ref->tree_lock, but that wouldn't stop the VMA from
being used, e.g. from __i915_vma_retire() called at the end of
__active_retire(), after that VMA has been already freed by a concurrent
i915_vma_destroy() on return from the i915_active_fini().  Then, we should
rather fix the issue at the VMA level, not in i915_active.

Since __i915_vma_parked() is called from __gt_park() on last put of the
GT's wakeref, the issue could be addressed by holding the GT wakeref long
enough for __active_retire() to complete before that wakeref is released
and the GT parked.

A VMA associated with a request doesn't acquire a GT wakeref by itself.
Instead, it depends on a wakeref held directly by the request's active
intel_context for a GT associated with its VM, and indirectly on that
intel_context's engine wakeref if the engine belongs to the same GT as the
VMA's VM.  Those wakerefs are released asynchronously to VMA deactivation.

In case of single-GT platforms, at least one of those wakerefs is usually
held long enough for the request's VMA to be deactivated on time, before
it is destroyed on last put of its VM GT wakeref.  However, on multi-GT
platforms, a request may use a VMA from a GT other than the one that hosts
the request's engine, then it is protected only with the intel_context's
VM GT wakeref.

There was an attempt to fix the issue on 2-GT Meteor Lake by acquiring an
extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see commit
f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform").  However,
that fix occurred insufficient -- the issue was still reported by CI.
That wakeref was released on exit from i915_gem_do_execbuffer(), then
potentially before completion of the request and deactivation of its
associated VMAs.  Moreover, CI reports indicated that single-GT platforms
also suffered sporadically from the same race.

I believe the issue was introduced by commit d93939730347 ("drm/i915:
Remove the vma refcount") which moved a call to i915_active_fini() from
a dropped i915_vma_release(), called on last put of the removed VMA kref,
to i915_vma_parked() processing path called on last put of a GT wakeref.
However, its visibility to the object debugging tool was suppressed by a
bug in i915_active that was fixed two weeks later with commit e92eb246feb9
("drm/i915/active: Fix missing debug object activation").

Fix the issue by getting a wakeref for the VMA's GT when activating it,
and putting that wakeref only after the VMA is deactivated.  However,
exclude global GTT from that processing path, otherwise the GPU never goes
idle.  Since __i915_vma_retire() may be called from atomic contexts, use
async variant of wakeref put.  Also, to avoid circular locking dependency,
take care of acquiring the wakeref before VM mutex when both are needed.

Having that fixed, stop explicitly acquiring the extra GT0 wakeref from
inside i915_gem_do_execbuffer(), and also drop an extra call to
i915_active_wait(), introduced by commit 7a2280e8dcd2 ("drm/i915: Wait for
active retire before i915_active_fini()") as another insufficient fix for
this UAF race.

v5: Replace "tile" with "GT" across commit descriptions (Rodrigo),
   - reword commit message and description 

Re: [PATCH v3 4/4] drm/i915/guc: Use the ce_to_guc() wrapper whenever possible

2023-12-21 Thread Nirmoy Das



On 12/6/2023 9:46 PM, Andi Shyti wrote:

Get the guc reference from the ce using the ce_to_guc() helper.
Just a leftover from previous cleanups.

Signed-off-by: Andi Shyti 

Reviewed-by: Nirmoy Das 

---
  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 4f51cc5f1604..3c7821ae9f0d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3513,7 +3513,7 @@ static inline void sub_context_inflight_prio(struct 
intel_context *ce,
  
  static inline void update_context_prio(struct intel_context *ce)

  {
-   struct intel_guc *guc = >engine->gt->uc.guc;
+   struct intel_guc *guc = ce_to_guc(ce);
int i;
  
  	BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH != 0);


Re: [PATCH v3 3/4] drm/i915: Use the new gt_to_guc() wrapper

2023-12-21 Thread Nirmoy Das



On 12/6/2023 9:46 PM, Andi Shyti wrote:

Get the guc reference from the gt using the gt_to_guc() helper.

Signed-off-by: Andi Shyti 

Reviewed-by: Nirmoy Das 

---
  drivers/gpu/drm/i915/i915_debugfs_params.c   | 2 +-
  drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c | 4 ++--
  2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs_params.c 
b/drivers/gpu/drm/i915/i915_debugfs_params.c
index 8bca02025e09..74b7f2fd8b57 100644
--- a/drivers/gpu/drm/i915/i915_debugfs_params.c
+++ b/drivers/gpu/drm/i915/i915_debugfs_params.c
@@ -43,7 +43,7 @@ static int notify_guc(struct drm_i915_private *i915)
  
  	for_each_gt(gt, i915, i) {

if (intel_uc_uses_guc_submission(>uc))
-   ret = intel_guc_global_policies_update(>uc.guc);
+   ret = intel_guc_global_policies_update(gt_to_guc(gt));
}
  
  	return ret;

diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c 
b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
index 2990dd4d4a0d..d9d8f0336702 100644
--- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
+++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c
@@ -65,7 +65,7 @@ int intel_selftest_modify_policy(struct intel_engine_cs 
*engine,
if (!intel_engine_uses_guc(engine))
return 0;
  
-	err = intel_guc_global_policies_update(>gt->uc.guc);

+   err = intel_guc_global_policies_update(gt_to_guc(engine->gt));
if (err)
intel_selftest_restore_policy(engine, saved);
  
@@ -84,7 +84,7 @@ int intel_selftest_restore_policy(struct intel_engine_cs *engine,

if (!intel_engine_uses_guc(engine))
return 0;
  
-	return intel_guc_global_policies_update(>gt->uc.guc);

+   return intel_guc_global_policies_update(gt_to_guc(engine->gt));
  }
  
  int intel_selftest_wait_for_rq(struct i915_request *rq)


Re: [PATCH v3 2/4] drm/i915/guc: Use the new gt_to_guc() wrapper

2023-12-21 Thread Nirmoy Das



On 12/6/2023 9:46 PM, Andi Shyti wrote:

Get the guc reference from the gt using the gt_to_guc() helper.

Signed-off-by: Andi Shyti 

Reviewed-by: Nirmoy Das 

---
  drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c |  4 +--
  drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c  |  3 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  2 +-
  .../gpu/drm/i915/gt/uc/intel_guc_capture.c|  6 ++--
  .../gpu/drm/i915/gt/uc/intel_guc_hwconfig.c   |  2 +-
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 28 +--
  drivers/gpu/drm/i915/gt/uc/intel_huc.c|  4 +--
  drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c  |  4 +--
  drivers/gpu/drm/i915/gt/uc/selftest_guc.c |  2 +-
  9 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c 
b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
index e2e42b3e0d5d..3b69bc6616bd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
@@ -298,7 +298,7 @@ static int gsc_fw_load_prepare(struct intel_gsc_uc *gsc)
memcpy_toio(gsc->local_vaddr, src, gsc->fw.size);
memset_io(gsc->local_vaddr + gsc->fw.size, 0, gsc->local->size - 
gsc->fw.size);
  
-	intel_guc_write_barrier(>uc.guc);

+   intel_guc_write_barrier(gt_to_guc(gt));
  
  	i915_gem_object_unpin_map(gsc->fw.obj);
  
@@ -351,7 +351,7 @@ static int gsc_fw_query_compatibility_version(struct intel_gsc_uc *gsc)

void *vaddr;
int err;
  
-	err = intel_guc_allocate_and_map_vma(>uc.guc, GSC_VER_PKT_SZ * 2,

+   err = intel_guc_allocate_and_map_vma(gt_to_guc(gt), GSC_VER_PKT_SZ * 2,
 , );
if (err) {
gt_err(gt, "failed to allocate vma for GSC version query\n");
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c 
b/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c
index 40817ebcca71..a7d5465655f9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c
@@ -358,7 +358,8 @@ static int proxy_channel_alloc(struct intel_gsc_uc *gsc)
void *vaddr;
int err;
  
-	err = intel_guc_allocate_and_map_vma(>uc.guc, GSC_PROXY_CHANNEL_SIZE,

+   err = intel_guc_allocate_and_map_vma(gt_to_guc(gt),
+GSC_PROXY_CHANNEL_SIZE,
 , );
if (err)
return err;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 63724e17829a..1ef470e64604 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -956,7 +956,7 @@ u32 intel_guc_engine_usage_offset(struct intel_guc *guc)
  
  struct iosys_map intel_guc_engine_usage_record_map(struct intel_engine_cs *engine)

  {
-   struct intel_guc *guc = >gt->uc.guc;
+   struct intel_guc *guc = gt_to_guc(engine->gt);
u8 guc_class = engine_class_to_guc_class(engine->class);
size_t offset = offsetof(struct __guc_ads_blob,
 
engine_usage.engines[guc_class][ilog2(engine->logical_mask)]);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index a4da0208c883..84a8807391c5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -1441,7 +1441,7 @@ int intel_guc_capture_print_engine_node(struct 
drm_i915_error_state_buf *ebuf,
if (!cap || !ee->engine)
return -ENODEV;
  
-	guc = >engine->gt->uc.guc;

+   guc = gt_to_guc(ee->engine->gt);
  
  	i915_error_printf(ebuf, "global --- GuC Error Capture on %s command stream:\n",

  ee->engine->name);
@@ -1543,7 +1543,7 @@ bool intel_guc_capture_is_matching_engine(struct intel_gt 
*gt,
if (!gt || !ce || !engine)
return false;
  
-	guc = >uc.guc;

+   guc = gt_to_guc(gt);
if (!guc->capture)
return false;
  
@@ -1573,7 +1573,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt,

if (!gt || !ee || !ce)
return;
  
-	guc = >uc.guc;

+   guc = gt_to_guc(gt);
if (!guc->capture)
return;
  
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_hwconfig.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_hwconfig.c

index cc9569af7f0c..b67a15f74276 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_hwconfig.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_hwconfig.c
@@ -111,7 +111,7 @@ static bool has_table(struct drm_i915_private *i915)
  static int guc_hwconfig_init(struct intel_gt *gt)
  {
struct intel_hwconfig *hwconfig = >info.hwconfig;
-   struct intel_guc *guc = >uc.guc;
+   struct intel_guc *guc = gt_to_guc(gt);
int ret;
  
  	if (!has_

Re: [PATCH v3 1/4] drm/i915/gt: Create the gt_to_guc() wrapper

2023-12-21 Thread Nirmoy Das



On 12/6/2023 9:46 PM, Andi Shyti wrote:

We already have guc_to_gt() and getting to guc from the GT it
requires some mental effort. Add the gt_to_guc().

Given the reference to the "gt", the gt_to_guc() will return the
pinter to the "guc".

Update all the files under the gt/ directory.

Signed-off-by: Andi Shyti 

Reviewed-by: Nirmoy Das 

---
  drivers/gpu/drm/i915/gt/intel_engine_cs.c | 4 ++--
  drivers/gpu/drm/i915/gt/intel_ggtt.c  | 9 +++--
  drivers/gpu/drm/i915/gt/intel_gt.h| 5 +
  drivers/gpu/drm/i915/gt/intel_gt_irq.c| 6 +++---
  drivers/gpu/drm/i915/gt/intel_gt_pm_debugfs.c | 2 +-
  drivers/gpu/drm/i915/gt/intel_gt_sysfs_pm.c   | 8 
  drivers/gpu/drm/i915/gt/intel_rc6.c   | 4 ++--
  drivers/gpu/drm/i915/gt/intel_rps.c   | 2 +-
  drivers/gpu/drm/i915/gt/intel_tlb.c   | 2 +-
  drivers/gpu/drm/i915/gt/selftest_slpc.c   | 6 +++---
  10 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 40687806d22a..bede7f09d4af 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -589,7 +589,7 @@ u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs 
*engine, u64 value)
 * NB: The GuC API only supports 32bit values. However, the limit is 
further
 * reduced due to internal calculations which would otherwise overflow.
 */
-   if (intel_guc_submission_is_wanted(>gt->uc.guc))
+   if (intel_guc_submission_is_wanted(gt_to_guc(engine->gt)))
value = min_t(u64, value, guc_policy_max_preempt_timeout_ms());
  
  	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));

@@ -610,7 +610,7 @@ u64 intel_clamp_timeslice_duration_ms(struct 
intel_engine_cs *engine, u64 value)
 * NB: The GuC API only supports 32bit values. However, the limit is 
further
 * reduced due to internal calculations which would otherwise overflow.
 */
-   if (intel_guc_submission_is_wanted(>gt->uc.guc))
+   if (intel_guc_submission_is_wanted(gt_to_guc(engine->gt)))
value = min_t(u64, value, guc_policy_max_exec_quantum_ms());
  
  	value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 21a7e3191c18..aa1e9249d393 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -230,11 +230,8 @@ static void guc_ggtt_ct_invalidate(struct intel_gt *gt)
struct intel_uncore *uncore = gt->uncore;
intel_wakeref_t wakeref;
  
-	with_intel_runtime_pm_if_active(uncore->rpm, wakeref) {

-   struct intel_guc *guc = >uc.guc;
-
-   intel_guc_invalidate_tlb_guc(guc);
-   }
+   with_intel_runtime_pm_if_active(uncore->rpm, wakeref)
+   intel_guc_invalidate_tlb_guc(gt_to_guc(gt));
  }
  
  static void guc_ggtt_invalidate(struct i915_ggtt *ggtt)

@@ -245,7 +242,7 @@ static void guc_ggtt_invalidate(struct i915_ggtt *ggtt)
gen8_ggtt_invalidate(ggtt);
  
  	list_for_each_entry(gt, >gt_list, ggtt_link) {

-   if (intel_guc_tlb_invalidation_is_available(>uc.guc))
+   if (intel_guc_tlb_invalidation_is_available(gt_to_guc(gt)))
guc_ggtt_ct_invalidate(gt);
else if (GRAPHICS_VER(i915) >= 12)
intel_uncore_write_fw(gt->uncore,
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h 
b/drivers/gpu/drm/i915/gt/intel_gt.h
index b0e453e27ea8..d7c859039828 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt.h
@@ -118,6 +118,11 @@ static inline struct intel_gt *gsc_to_gt(struct intel_gsc 
*gsc)
return container_of(gsc, struct intel_gt, gsc);
  }
  
+static inline struct intel_guc *gt_to_guc(struct intel_gt *gt)

+{
+   return >uc.guc;
+}
+
  void intel_gt_common_init_early(struct intel_gt *gt);
  int intel_root_gt_init_early(struct drm_i915_private *i915);
  int intel_gt_assign_ggtt(struct intel_gt *gt);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_irq.c 
b/drivers/gpu/drm/i915/gt/intel_gt_irq.c
index 77fb57223465..ad4c51f18d3a 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_irq.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_irq.c
@@ -68,9 +68,9 @@ gen11_other_irq_handler(struct intel_gt *gt, const u8 
instance,
struct intel_gt *media_gt = gt->i915->media_gt;
  
  	if (instance == OTHER_GUC_INSTANCE)

-   return guc_irq_handler(>uc.guc, iir);
+   return guc_irq_handler(gt_to_guc(gt), iir);
if (instance == OTHER_MEDIA_GUC_INSTANCE && media_gt)
-   return guc_irq_handler(_gt->uc.guc, iir);
+   return guc_irq_handler(gt_to_guc(media_gt), iir);
  
  	if (instance == OTHER_GTPM_INSTANC

Re: [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace

2023-12-06 Thread Nirmoy Das

Hi John,

On 12/5/2023 8:50 PM, John Harrison wrote:

On 12/5/2023 02:39, Nirmoy Das wrote:

Hi John,

On 12/5/2023 10:10 AM, John Harrison wrote:

On 12/5/2023 00:52, Nirmoy Das wrote:

gen8_engine_reset_prepare() can fail when HW fails to set
RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
error as driver will retry.

Convert the log to a trace log for debugging without triggering
unnecessary concerns in CI or for end-users during non-fatal 
scenarios.
I strongly disagree with this change. The hardware spec for the 
RESET_CTL and GDRST registers are that they will self clear within a 
matter of microseconds. If something is so badly wrong with the 
hardware that it can't even manage to reset



This message is for reset readiness  poll timeout not that the reset 
is failed which doesn't sound so serious if the subsequent attempt 
managed reset the engine.
Not sure what the distinction is. The reset procedure is poke 
RESET_CTL wait for it to clear, poke GDRST and wait for it to clear. 
Just because step one is failing rather than step 2 does not mean that 
the reset as a whole has not failed.


Note that the purpose of RESET_CTL is to pause a bunch of stuff like 
the command streamers to prevent them from issuing new memory requests 
while the reset is in progress. If it fails, it likely means that a CS 
is refusing to stop. Most probably because it can't reach a stopping 
point because it is stuck waiting on a lost memory request or similar. 
And the point of stopping further memory requests during reset is that 
if the memory channel gets out of sync (because only the GT side is 
reset during a GT reset) then that can result in total system failure. 
As in potentially even the CPU can no longer get to memory if it is an 
integrated platform. So yes, it can be quite a serious failure indeed.




Thanks bspec didn't explain those details. My intention was to 
acknowledge that engine reset is a complicated process which why the 
driver retries  and don't spook CI/user if subsequent reset works but I 
get your objection on this.




I couldn't get enough details when this can happen that HW takes very 
long time to set the readiness bit.
Is it simply 'taking a long time' or is never clearing at all? If it 
is just that the timeout is too short then the proper fix would be to 
increase the timeout. But if it is taking seconds or longer or just 
never succeeding at all, then something is very bad.


I tried with 10x timeout without any help so I think the CS is stuck 
though re-try works. I will try to get more details from HW team on this 
issue.







then that is something that very much warrants more than a 
completely silent trace event. It most certainly should be flagged 
as a failure in CI.


Just because the driver will retry does not mean that this is not a 
serious error. And if the first attempt failed, why would a 
subsequent attempt succeed?


The patch is not ignoring the failure. If the subsequent attempt 
fails then driver load will fail or it will be wedged if that happens 
after driver load.
One thing I really hate about our driver is the total lack of 
information when something goes wrong during load. The driver wedges 
in total silence. There are many error paths that have no reporting at 
all. Which means you are left with a totally useless bug report.






Escalating to FLR may have more success, but that is not something 
that i915 currently does.


Do we still need to do FLR if a subsequent engine reset failure ?
Assuming that we are talking about modern(ish) platforms, an engine 
reset failure would be hit by GuC rather than i915, but that would be 
escalated to an i915 based full GT reset. Generally speaking though, 
if the engine reset fails the GT reset isn't going to do much better. 
It would fix a dead GuC problem but it can't help with memory related 
issues. If the full GT reset fails then we are out of escalation 
routes as there is no FLR path at present (I think we have that at 
driver unload on MTL but not for general reset?). The FLR resets a lot 
more than just the GT, so it does have a chance to fix some issues 
that a GT reset can't. After driver-level FLR, there is PCI level FLR. 
Not sure if that involves a full power down and restart, but if not 
then that would be the last escalation possible. A power cycle really 
should fix any issues, if it doesn't then it's time to return the 
system as being totally dead!


My recollection is that the vast majority of engine reset failures 
I've looked at have been completely catastrophic and the system only 
recovered after a reboot. I.e. after the card was power cycled. Such 
issues were generally caused by bad memory. Once the path to memory 
has died, there really is not much of the GPU that can do anything at 
all and there isn't much that can be done to recover it.



Thanks,

Nirmoy



John.





Regards,

Nirmoy



John.




v2: Improve commit message(Tvrtko)

Cc: Tvrtko Ursulin 
Cc: John Harrison

Re: [Intel-gfx] [PATCH] drm/i915/gt: Reduce log severity on reset prepare.

2023-12-05 Thread Nirmoy Das

Hi Tvrtko,

On 12/5/2023 11:05 AM, Tvrtko Ursulin wrote:


On 05/12/2023 08:50, Nirmoy Das wrote:

Hi Tvrtko,

On 12/5/2023 9:34 AM, Tvrtko Ursulin wrote:


On 01/12/2023 15:44, Nirmoy Das wrote:

gen8_engine_reset_prepare() can fail when HW fails to set
RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
error as driver will retry.

Let the caller of gen8_engine_reset_prepare() decide if a
failure in gen8_engine_reset_prepare is an error or not.


No complaints per se but I don't see the caller deciding and it is 
not really reducing log level but converting to trace. So commit 
message and patch do not align for me which I think should be improved.



I meant the return value is checked by the caller, 
gen8_reset_engines(). I will resend with a improved commit message.


Ah okay, maybe my bad for not figuring out that possibility. I guess 
it might be passable as is, but yes, clearer commit text would be better.


I sent a v2 already :)




Trace is good enough - we are not usually interested in seeing those 
as dbg/info/notice?



Idea is that all the GT related events are recorded in trace and dmesg 
could be noisy some times.



Regards,

Nirmoy



Regards,

Tvrtko



Thanks,

Nirmoy



Regards,

Tvrtko


Cc: Tvrtko Ursulin 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Andrzej Hajda 
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_reset.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c

index d5ed904f355d..e6fbc6202c80 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct 
intel_engine_cs *engine)

  ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
 700, 0, NULL);
  if (ret)
-    gt_err(engine->gt,
-   "%s reset request timed out: {request: %08x, 
RESET_CTL: %08x}\n",

-   engine->name, request,
-   intel_uncore_read_fw(uncore, reg));
+    GT_TRACE(engine->gt,
+ "%s reset request timed out: {request: %08x, 
RESET_CTL: %08x}\n",

+ engine->name, request,
+ intel_uncore_read_fw(uncore, reg));
    return ret;
  }


Re: [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace

2023-12-05 Thread Nirmoy Das

Hi John,

On 12/5/2023 10:10 AM, John Harrison wrote:

On 12/5/2023 00:52, Nirmoy Das wrote:

gen8_engine_reset_prepare() can fail when HW fails to set
RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
error as driver will retry.

Convert the log to a trace log for debugging without triggering
unnecessary concerns in CI or for end-users during non-fatal scenarios.
I strongly disagree with this change. The hardware spec for the 
RESET_CTL and GDRST registers are that they will self clear within a 
matter of microseconds. If something is so badly wrong with the 
hardware that it can't even manage to reset



This message is for reset readiness  poll timeout not that the reset is 
failed which doesn't sound so serious if the subsequent attempt managed 
reset the engine.


I couldn't get enough details when this can happen that HW takes very 
long time to set the readiness bit.



then that is something that very much warrants more than a completely 
silent trace event. It most certainly should be flagged as a failure 
in CI.


Just because the driver will retry does not mean that this is not a 
serious error. And if the first attempt failed, why would a subsequent 
attempt succeed?


The patch is not ignoring the failure. If the subsequent attempt fails 
then driver load will fail or it will be wedged if that happens after 
driver load.



Escalating to FLR may have more success, but that is not something 
that i915 currently does.


Do we still need to do FLR if a subsequent engine reset failure ?


Regards,

Nirmoy



John.




v2: Improve commit message(Tvrtko)

Cc: Tvrtko Ursulin 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Andrzej Hajda 
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
Signed-off-by: Nirmoy Das 
Reviewed-by: Andi Shyti 
Reviewed-by: Andrzej Hajda 
---
  drivers/gpu/drm/i915/gt/intel_reset.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c

index d5ed904f355d..e6fbc6202c80 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct 
intel_engine_cs *engine)

  ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
 700, 0, NULL);
  if (ret)
-    gt_err(engine->gt,
-   "%s reset request timed out: {request: %08x, 
RESET_CTL: %08x}\n",

-   engine->name, request,
-   intel_uncore_read_fw(uncore, reg));
+    GT_TRACE(engine->gt,
+ "%s reset request timed out: {request: %08x, RESET_CTL: 
%08x}\n",

+ engine->name, request,
+ intel_uncore_read_fw(uncore, reg));
    return ret;
  }




[PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace

2023-12-05 Thread Nirmoy Das
gen8_engine_reset_prepare() can fail when HW fails to set
RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
error as driver will retry.

Convert the log to a trace log for debugging without triggering
unnecessary concerns in CI or for end-users during non-fatal scenarios.

v2: Improve commit message(Tvrtko)

Cc: Tvrtko Ursulin 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Andrzej Hajda 
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
Signed-off-by: Nirmoy Das 
Reviewed-by: Andi Shyti 
Reviewed-by: Andrzej Hajda 
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index d5ed904f355d..e6fbc6202c80 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct 
intel_engine_cs *engine)
ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
   700, 0, NULL);
if (ret)
-   gt_err(engine->gt,
-  "%s reset request timed out: {request: %08x, RESET_CTL: 
%08x}\n",
-  engine->name, request,
-  intel_uncore_read_fw(uncore, reg));
+   GT_TRACE(engine->gt,
+"%s reset request timed out: {request: %08x, 
RESET_CTL: %08x}\n",
+engine->name, request,
+intel_uncore_read_fw(uncore, reg));
 
return ret;
 }
-- 
2.42.0



Re: [Intel-gfx] [PATCH] drm/i915/gt: Reduce log severity on reset prepare.

2023-12-05 Thread Nirmoy Das

Hi Tvrtko,

On 12/5/2023 9:34 AM, Tvrtko Ursulin wrote:


On 01/12/2023 15:44, Nirmoy Das wrote:

gen8_engine_reset_prepare() can fail when HW fails to set
RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
error as driver will retry.

Let the caller of gen8_engine_reset_prepare() decide if a
failure in gen8_engine_reset_prepare is an error or not.


No complaints per se but I don't see the caller deciding and it is not 
really reducing log level but converting to trace. So commit message 
and patch do not align for me which I think should be improved.



I meant the return value is checked by the caller, gen8_reset_engines(). 
I will resend with a improved commit message.


Thanks,

Nirmoy



Regards,

Tvrtko


Cc: Tvrtko Ursulin 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Andrzej Hajda 
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_reset.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c

index d5ed904f355d..e6fbc6202c80 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct 
intel_engine_cs *engine)

  ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
 700, 0, NULL);
  if (ret)
-    gt_err(engine->gt,
-   "%s reset request timed out: {request: %08x, 
RESET_CTL: %08x}\n",

-   engine->name, request,
-   intel_uncore_read_fw(uncore, reg));
+    GT_TRACE(engine->gt,
+ "%s reset request timed out: {request: %08x, RESET_CTL: 
%08x}\n",

+ engine->name, request,
+ intel_uncore_read_fw(uncore, reg));
    return ret;
  }


[PATCH] drm/i915/gt: Reduce log severity on reset prepare.

2023-12-01 Thread Nirmoy Das
gen8_engine_reset_prepare() can fail when HW fails to set
RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
error as driver will retry.

Let the caller of gen8_engine_reset_prepare() decide if a
failure in gen8_engine_reset_prepare is an error or not.

Cc: Tvrtko Ursulin 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Andrzej Hajda 
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index d5ed904f355d..e6fbc6202c80 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct 
intel_engine_cs *engine)
ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
   700, 0, NULL);
if (ret)
-   gt_err(engine->gt,
-  "%s reset request timed out: {request: %08x, RESET_CTL: 
%08x}\n",
-  engine->name, request,
-  intel_uncore_read_fw(uncore, reg));
+   GT_TRACE(engine->gt,
+"%s reset request timed out: {request: %08x, 
RESET_CTL: %08x}\n",
+engine->name, request,
+intel_uncore_read_fw(uncore, reg));
 
return ret;
 }
-- 
2.42.0



Re: [PATCH v3] drm/i915: Flush WC GGTT only on required platforms

2023-10-31 Thread Nirmoy Das

This is now merged to gt-next


Thanks,

Nirmoy

On 10/18/2023 4:04 PM, Nirmoy Das wrote:


On 10/18/2023 3:00 PM, Andi Shyti wrote:

Hi Nirmoy,


gen8_ggtt_invalidate() is only needed for limited set of platforms
where GGTT is mapped as WC. This was added as way to fix WC based 
GGTT in

commit 0f9b91c754b7 ("drm/i915: flush system agent TLBs on SNB") and
there are no reference in HW docs that forces us to use this on 
non-WC

backed GGTT.

This can also cause unwanted side-effects on XE_HP platforms where
GFX_FLSH_CNTL_GEN6 is not valid anymore.

v2: Add a func to detect wc ggtt detection (Ville)
v3: Improve commit log and add reference commit (Daniel)

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")

I'm wondering if this is the right Fixes, though. Should this
rather be:

Fixes: 6266992cf105 ("drm/i915/gt: remove GRAPHICS_VER == 10")

Hard to find a real Fixes for this. I just want to backport this to dg2
where we can have unwanted side-effects.

yes, this piece of code has moved around enough so to make it
diffuclt to track its origin.

I think the one I found should be the correct one,


That just removes a graphics ver, not related to WC GGTT map or XE_HP.


  but the dg2
force probe removeal can also become a placeholder for DG2 fixes.


Yes, I have no better ideas too.


Regards,

Nirmoy



I won't complain.

Andi


Re: [Intel-gfx] [PATCH v2 1/5] drm/i915/guc: Create the guc_to_i915() wrapper

2023-10-25 Thread Nirmoy Das



On 10/25/2023 4:53 PM, Andi Shyti wrote:

Hi Nirmoy,


+static inline struct drm_i915_private *guc_to_i915(struct intel_guc *guc)
+{
+   return guc_to_gt(guc)->i915;
+}
+

We don't want inline functions in headers files[1]. Otherwise the series
looks fine to me:

the reason for that patch is that we were including the
i915_drv.h just for that inline function and we were doing it
inside the gt/.

In this patch I am not changing any header dependency.


Sounds fair, thanks,

Nirmoy



I guess the original idea from Matt was to have a generic network
of intel_gt_needs_wa_xxx(), but it didn't develop further.


Reviewed-by: Nirmoy Das 

Thanks,
Andi


Re: [Intel-gfx] [PATCH v2 1/5] drm/i915/guc: Create the guc_to_i915() wrapper

2023-10-25 Thread Nirmoy Das

Hi Andi,

On 10/25/2023 4:35 PM, Andi Shyti wrote:

Given a reference to "guc", the guc_to_i915() returns the
pointer to "i915" private data.

Signed-off-by: Andi Shyti 
---
  drivers/gpu/drm/i915/gt/intel_gt.h|  5 +
  drivers/gpu/drm/i915/gt/uc/intel_guc.c|  2 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c|  2 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c |  2 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 10 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_rc.c |  2 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c   |  2 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  4 ++--
  8 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h 
b/drivers/gpu/drm/i915/gt/intel_gt.h
index 970bedf6b78a..12a638f05d63 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt.h
@@ -114,6 +114,11 @@ static inline struct intel_gt *gsc_to_gt(struct intel_gsc 
*gsc)
return container_of(gsc, struct intel_gt, gsc);
  }
  
+static inline struct drm_i915_private *guc_to_i915(struct intel_guc *guc)

+{
+   return guc_to_gt(guc)->i915;
+}
+


We don't want inline functions in headers files[1]. Otherwise the series 
looks fine to me:


Reviewed-by: Nirmoy Das 

[1] https://patchwork.freedesktop.org/series/124069/


Regards,

Nirmoy


  void intel_gt_common_init_early(struct intel_gt *gt);
  int intel_root_gt_init_early(struct drm_i915_private *i915);
  int intel_gt_assign_ggtt(struct intel_gt *gt);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
index 3f3df1166b86..2b450c43bbd7 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
@@ -330,7 +330,7 @@ static u32 guc_ctl_wa_flags(struct intel_guc *guc)
  
  static u32 guc_ctl_devid(struct intel_guc *guc)

  {
-   struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
+   struct drm_i915_private *i915 = guc_to_i915(guc);
  
  	return (INTEL_DEVID(i915) << 16) | INTEL_REVID(i915);

  }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index a4da0208c883..a1cd40d80517 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -355,7 +355,7 @@ guc_capture_alloc_steered_lists(struct intel_guc *guc,
  static const struct __guc_mmio_reg_descr_group *
  guc_capture_get_device_reglist(struct intel_guc *guc)
  {
-   struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
+   struct drm_i915_private *i915 = guc_to_i915(guc);
const struct __guc_mmio_reg_descr_group *lists;
  
  	if (GRAPHICS_VER(i915) >= 12)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 89e314b3756b..4e147de5118f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -265,7 +265,7 @@ int intel_guc_ct_init(struct intel_guc_ct *ct)
u32 *cmds;
int err;
  
-	err = i915_inject_probe_error(guc_to_gt(guc)->i915, -ENXIO);

+   err = i915_inject_probe_error(guc_to_i915(guc), -ENXIO);
if (err)
return err;
  
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c

index 55bc8b55fbc0..bf16351c9349 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c
@@ -520,7 +520,7 @@ void intel_guc_log_init_early(struct intel_guc_log *log)
  static int guc_log_relay_create(struct intel_guc_log *log)
  {
struct intel_guc *guc = log_to_guc(log);
-   struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
+   struct drm_i915_private *i915 = guc_to_i915(guc);
struct rchan *guc_log_relay_chan;
size_t n_subbufs, subbuf_size;
int ret;
@@ -573,7 +573,7 @@ static void guc_log_relay_destroy(struct intel_guc_log *log)
  static void guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log)
  {
struct intel_guc *guc = log_to_guc(log);
-   struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
+   struct drm_i915_private *i915 = guc_to_i915(guc);
intel_wakeref_t wakeref;
  
  	_guc_log_copy_debuglogs_for_relay(log);

@@ -589,7 +589,7 @@ static void guc_log_copy_debuglogs_for_relay(struct 
intel_guc_log *log)
  static u32 __get_default_log_level(struct intel_guc_log *log)
  {
struct intel_guc *guc = log_to_guc(log);
-   struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
+   struct drm_i915_private *i915 = guc_to_i915(guc);
  
  	/* A negative value means "use platform/config default" */

if (i915->params.guc_log_level < 0) {
@@ -664,7 +664,7 @@ void intel_guc_log_destroy(struct intel_guc_log *log)
  int intel_guc_log_set_level(struct intel_guc_log *log, u32 level)
 

[PATCH] drm/i915/gt: Use proper priority enum instead of 0

2023-10-23 Thread Nirmoy Das
I915_PRIORITY_NORMAL is 0 so use that instead for better
readability.

Cc: John Harrison 
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c 
b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 9a527e1f5be6..1a8e2b7db013 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -188,7 +188,7 @@ static void heartbeat(struct work_struct *wrk)
 * low latency and no jitter] the chance to naturally
 * complete before being preempted.
 */
-   attr.priority = 0;
+   attr.priority = I915_PRIORITY_NORMAL;
if (rq->sched.attr.priority >= attr.priority)
attr.priority = I915_PRIORITY_HEARTBEAT;
if (rq->sched.attr.priority >= attr.priority)
-- 
2.41.0



Re: [PATCH v3] drm/i915: Flush WC GGTT only on required platforms

2023-10-18 Thread Nirmoy Das



On 10/18/2023 3:00 PM, Andi Shyti wrote:

Hi Nirmoy,


gen8_ggtt_invalidate() is only needed for limited set of platforms
where GGTT is mapped as WC. This was added as way to fix WC based GGTT in
commit 0f9b91c754b7 ("drm/i915: flush system agent TLBs on SNB") and
there are no reference in HW docs that forces us to use this on non-WC
backed GGTT.

This can also cause unwanted side-effects on XE_HP platforms where
GFX_FLSH_CNTL_GEN6 is not valid anymore.

v2: Add a func to detect wc ggtt detection (Ville)
v3: Improve commit log and add reference commit (Daniel)

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")

I'm wondering if this is the right Fixes, though. Should this
rather be:

Fixes: 6266992cf105 ("drm/i915/gt: remove GRAPHICS_VER == 10")

Hard to find a real Fixes for this. I just want to backport this to dg2
where we can have unwanted side-effects.

yes, this piece of code has moved around enough so to make it
diffuclt to track its origin.

I think the one I found should be the correct one,


That just removes a graphics ver, not related to WC GGTT map or XE_HP.


  but the dg2
force probe removeal can also become a placeholder for DG2 fixes.


Yes, I have no better ideas too.


Regards,

Nirmoy



I won't complain.

Andi


Re: [PATCH v3] drm/i915: Flush WC GGTT only on required platforms

2023-10-18 Thread Nirmoy Das

Hi Andi,

On 10/18/2023 1:49 PM, Andi Shyti wrote:

Hi Nirmoy,

On Wed, Oct 18, 2023 at 11:38:15AM +0200, Nirmoy Das wrote:

gen8_ggtt_invalidate() is only needed for limited set of platforms
where GGTT is mapped as WC. This was added as way to fix WC based GGTT in
commit 0f9b91c754b7 ("drm/i915: flush system agent TLBs on SNB") and
there are no reference in HW docs that forces us to use this on non-WC
backed GGTT.

This can also cause unwanted side-effects on XE_HP platforms where
GFX_FLSH_CNTL_GEN6 is not valid anymore.

v2: Add a func to detect wc ggtt detection (Ville)
v3: Improve commit log and add reference commit (Daniel)

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")

I'm wondering if this is the right Fixes, though. Should this
rather be:

Fixes: 6266992cf105 ("drm/i915/gt: remove GRAPHICS_VER == 10")


Hard to find a real Fixes for this. I just want to backport this to dg2 
where we can have unwanted side-effects.



Regards,

Nirmoy



?

Andi


Re: [PATCH v2] drm/i915: Prevent potential null-ptr-deref in engine_init_common

2023-10-18 Thread Nirmoy Das

This now merged. CI errors are unrelated.

On 10/11/2023 2:25 PM, Nirmoy Das wrote:

If measure_breadcrumb_dw() returns an error and bce isn't created,
this commit ensures that intel_engine_destroy_pinned_context()
is not called with a NULL bce.

v2: Fix the subject s/UAF/null-ptr-deref(Jani)

Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates")
Cc: Oak Zeng 
Cc: Andi Shyti 
Cc: Jani Nikula 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_engine_cs.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 179d9546865b..4a11219e560e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1491,7 +1491,8 @@ static int engine_init_common(struct intel_engine_cs 
*engine)
return 0;
  
  err_bce_context:

-   intel_engine_destroy_pinned_context(bce);
+   if (bce)
+   intel_engine_destroy_pinned_context(bce);
  err_ce_context:
intel_engine_destroy_pinned_context(ce);
return ret;


[PATCH v3] drm/i915: Flush WC GGTT only on required platforms

2023-10-18 Thread Nirmoy Das
gen8_ggtt_invalidate() is only needed for limited set of platforms
where GGTT is mapped as WC. This was added as way to fix WC based GGTT in
commit 0f9b91c754b7 ("drm/i915: flush system agent TLBs on SNB") and
there are no reference in HW docs that forces us to use this on non-WC
backed GGTT.

This can also cause unwanted side-effects on XE_HP platforms where
GFX_FLSH_CNTL_GEN6 is not valid anymore.

v2: Add a func to detect wc ggtt detection (Ville)
v3: Improve commit log and add reference commit (Daniel)

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Joonas Lahtinen 
Cc: Jani Nikula 
Cc: Jonathan Cavitt 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Ville Syrjälä 
Cc: Daniel Vetter 
Cc:  # v6.2+
Suggested-by: Matt Roper 
Signed-off-by: Nirmoy Das 
Reviewed-by: Matt Roper 
Reviewed-by: Andi Shyti 
---
 drivers/gpu/drm/i915/gt/intel_ggtt.c | 35 +++-
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 1c93e84278a0..15fc8e4703f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -195,6 +195,21 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt)
spin_unlock_irq(>lock);
 }
 
+static bool needs_wc_ggtt_mapping(struct drm_i915_private *i915)
+{
+   /*
+* On BXT+/ICL+ writes larger than 64 bit to the GTT pagetable range
+* will be dropped. For WC mappings in general we have 64 byte burst
+* writes when the WC buffer is flushed, so we can't use it, but have to
+* resort to an uncached mapping. The WC issue is easily caught by the
+* readback check when writing GTT PTE entries.
+*/
+   if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11)
+   return true;
+
+   return false;
+}
+
 static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt)
 {
struct intel_uncore *uncore = ggtt->vm.gt->uncore;
@@ -202,8 +217,12 @@ static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt)
/*
 * Note that as an uncached mmio write, this will flush the
 * WCB of the writes into the GGTT before it triggers the invalidate.
+*
+* Only perform this when GGTT is mapped as WC, see ggtt_probe_common().
 */
-   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
+   if (needs_wc_ggtt_mapping(ggtt->vm.i915))
+   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6,
+ GFX_FLSH_CNTL_EN);
 }
 
 static void guc_ggtt_ct_invalidate(struct intel_gt *gt)
@@ -1140,17 +1159,11 @@ static int ggtt_probe_common(struct i915_ggtt *ggtt, 
u64 size)
GEM_WARN_ON(pci_resource_len(pdev, GEN4_GTTMMADR_BAR) != 
gen6_gttmmadr_size(i915));
phys_addr = pci_resource_start(pdev, GEN4_GTTMMADR_BAR) + 
gen6_gttadr_offset(i915);
 
-   /*
-* On BXT+/ICL+ writes larger than 64 bit to the GTT pagetable range
-* will be dropped. For WC mappings in general we have 64 byte burst
-* writes when the WC buffer is flushed, so we can't use it, but have to
-* resort to an uncached mapping. The WC issue is easily caught by the
-* readback check when writing GTT PTE entries.
-*/
-   if (IS_GEN9_LP(i915) || GRAPHICS_VER(i915) >= 11)
-   ggtt->gsm = ioremap(phys_addr, size);
-   else
+   if (needs_wc_ggtt_mapping(i915))
ggtt->gsm = ioremap_wc(phys_addr, size);
+   else
+   ggtt->gsm = ioremap(phys_addr, size);
+
if (!ggtt->gsm) {
drm_err(>drm, "Failed to map the ggtt page table\n");
return -ENOMEM;
-- 
2.41.0



Re: [Intel-gfx] [PATCH] drm/i915/mtl: Don't set PIPE_CONTROL_FLUSH_L3

2023-10-17 Thread Nirmoy Das



On 10/17/2023 4:42 PM, Andi Shyti wrote:

Hi Vinay,


This bit does not cause an explicit L3 flush. We already use

At all? Or only on newer hardware? And as a genuine spec change or as a
bug / workaround?

If the hardware has re-purposed the bit then it is probably worth at
least adding a comment to the bit definition to say that it is only
valid up to IP version 12.70.

At this point, this is a bug on MTL since this bit is not related to L3
flushes as per spec. Regarding older platforms, still checking the reason
why this was added (i.e if it fixed something and will regress if removed).
If not, we can extend the change for others as well in a separate patch. On
older platforms, this bit seems to cause an implicit flush at best.

PIPE_CONTROL_DC_FLUSH_ENABLE for that purpose.

Cc: Nirmoy Das 
Cc: Mikka Kuoppala 
Signed-off-by: Vinay Belgaumkar 
---
   drivers/gpu/drm/i915/gt/gen8_engine_cs.c | 8 ++--
   1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
index ba4c2422b340..abbc02f3e66e 100644
--- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
@@ -247,6 +247,7 @@ static int mtl_dummy_pipe_control(struct
i915_request *rq)
   int gen12_emit_flush_rcs(struct i915_request *rq, u32 mode)
   {
   struct intel_engine_cs *engine = rq->engine;
+    struct intel_gt *gt = rq->engine->gt;
     /*
    * On Aux CCS platforms the invalidation of the Aux
@@ -278,7 +279,8 @@ int gen12_emit_flush_rcs(struct i915_request
*rq, u32 mode)
    * deals with Protected Memory which is not needed for
    * AUX CCS invalidation and lead to unwanted side effects.
    */
-    if (mode & EMIT_FLUSH)
+    if ((mode & EMIT_FLUSH) &&
+    !(IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71

Why stop at 12.71? Is the meaning only changed for 12.70 and the
old/correct version will be restored in later hardware?

Was trying to keep this limited to MTL for now until the above statements
are verified.

I'm not fully conviced here... this is not what the hardware spec
says. Am I reading the specs wrong?


The main issue is we are using side-effects of that bit as to flush L3 
but that is not it's primary task.


Unless there is a WA specially mentioned for MTL to use that bit to 
flush L3, I see no reason to use on MTL or further.



Regards,

Nirmoy





Is there any ongoing discussion with the hardware developers?

Andi


Re: [Intel-gfx] [PATCH] drm/i915/mtl: Don't set PIPE_CONTROL_FLUSH_L3

2023-10-17 Thread Nirmoy Das



On 10/17/2023 2:23 AM, Belgaumkar, Vinay wrote:


On 10/16/2023 4:24 PM, John Harrison wrote:

On 10/16/2023 15:55, Vinay Belgaumkar wrote:

This bit does not cause an explicit L3 flush. We already use
At all? Or only on newer hardware? And as a genuine spec change or as 
a bug / workaround?


If the hardware has re-purposed the bit then it is probably worth at 
least adding a comment to the bit definition to say that it is only 
valid up to IP version 12.70.
At this point, this is a bug on MTL since this bit is not related to 
L3 flushes as per spec. Regarding older platforms, still checking the 
reason why this was added (i.e if it fixed something and will regress 
if removed). If not, we can extend the change for others as well in a 
separate patch. On older platforms, this bit seems to cause an 
implicit flush at best.



PIPE_CONTROL_DC_FLUSH_ENABLE for that purpose.

Cc: Nirmoy Das 
Cc: Mikka Kuoppala 
Signed-off-by: Vinay Belgaumkar 
---
  drivers/gpu/drm/i915/gt/gen8_engine_cs.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c 
b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c

index ba4c2422b340..abbc02f3e66e 100644
--- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
@@ -247,6 +247,7 @@ static int mtl_dummy_pipe_control(struct 
i915_request *rq)

  int gen12_emit_flush_rcs(struct i915_request *rq, u32 mode)
  {
  struct intel_engine_cs *engine = rq->engine;
+    struct intel_gt *gt = rq->engine->gt;
    /*
   * On Aux CCS platforms the invalidation of the Aux
@@ -278,7 +279,8 @@ int gen12_emit_flush_rcs(struct i915_request 
*rq, u32 mode)

   * deals with Protected Memory which is not needed for
   * AUX CCS invalidation and lead to unwanted side effects.
   */
-    if (mode & EMIT_FLUSH)
+    if ((mode & EMIT_FLUSH) &&
+    !(IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71
Why stop at 12.71? Is the meaning only changed for 12.70 and the 
old/correct version will be restored in later hardware?


Was trying to keep this limited to MTL for now until the above 
statements are verified.


If we don't need it for MTL, we are most likely not going to need it 
after MTL too. It should be fine if you make it >= MTL until


we get more clarity about lower platforms.


With that,  this is

Reviewed-by: Nirmoy Das 



Thanks,

Vinay.



John.



  bit_group_1 |= PIPE_CONTROL_FLUSH_L3;
    bit_group_1 |= PIPE_CONTROL_TILE_CACHE_FLUSH;
@@ -812,12 +814,14 @@ u32 *gen12_emit_fini_breadcrumb_rcs(struct 
i915_request *rq, u32 *cs)

  u32 flags = (PIPE_CONTROL_CS_STALL |
   PIPE_CONTROL_TLB_INVALIDATE |
   PIPE_CONTROL_TILE_CACHE_FLUSH |
- PIPE_CONTROL_FLUSH_L3 |
   PIPE_CONTROL_RENDER_TARGET_CACHE_FLUSH |
   PIPE_CONTROL_DEPTH_CACHE_FLUSH |
   PIPE_CONTROL_DC_FLUSH_ENABLE |
   PIPE_CONTROL_FLUSH_ENABLE);
  +    if (!(IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71
+    flags |= PIPE_CONTROL_FLUSH_L3;
+
  /* Wa_14016712196 */
  if (IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71)) || 
IS_DG2(i915))

  /* dummy PIPE_CONTROL + depth flush */




Re: [PATCH v2] drm/i915: Flush WC GGTT only on required platforms

2023-10-16 Thread Nirmoy Das

Hi Andi,

On 10/14/2023 10:51 AM, Andi Shyti wrote:

Hi Nirmoy,

On Fri, Oct 13, 2023 at 03:44:39PM +0200, Nirmoy Das wrote:

gen8_ggtt_invalidate() is only needed for limited set of platforms
where GGTT is mapped as WC otherwise this can cause unwanted
side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not
valid.

v2: Add a func to detect wc ggtt detection (Ville)

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Joonas Lahtinen 
Cc: Jani Nikula 
Cc: Jonathan Cavitt 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Ville Syrjälä 
Cc:  # v6.2+
Suggested-by: Matt Roper 
Signed-off-by: Nirmoy Das 
Acked-by: Andi Shyti 

I took some time to look at this and you can swap the a-b with
an r-b:

Reviewed-by: Andi Shyti 


Thanks! Going to resend one more rev with commit that is started using 
this register for WC mapping.



Regards,

Nirmoy



Thanks,
Andi


Re: [PATCH] drm/i915: Flush WC GGTT only on required platforms

2023-10-16 Thread Nirmoy Das



On 10/13/2023 6:15 PM, Daniel Vetter wrote:

On Fri, Oct 13, 2023 at 02:28:21PM +0200, Nirmoy Das wrote:

Hi Ville,

On 10/13/2023 12:50 PM, Ville Syrjälä wrote:

On Fri, Oct 13, 2023 at 12:31:40PM +0200, Nirmoy Das wrote:

gen8_ggtt_invalidate() is only needed for limitted set of platforms
where GGTT is mapped as WC

I know there is supposed to be some kind hw snooping of the ggtt
pte writes to invalidate the tlb, but are we sure GFX_FLSH_CNTL
has no other side effects we depend on?

I spent some time searching through the gfxspec. This GFX_FLSH_CNTL register
only seems to be for

invalidating TLB for GUnit  and (from git log ) we started to do that to
enable WC based GGTT updates.

Might be good to cite the relevant git commits in the commit message to
make this clear.


Yes, I should. It took me a while to find it. Going to add that and 
resend the patch.



Thanks,

Nirmoy


-Sima



So if I am not missing anything obvious then this should be safe.


Regards,

Nirmoy


otherwise this can cause unwanted
side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not
valid.

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Joonas Lahtinen 
Cc: Jani Nikula 
Cc: Jonathan Cavitt 
Cc: John Harrison 
Cc: Andi Shyti 
Cc:  # v6.2+
Suggested-by: Matt Roper 
Signed-off-by: Nirmoy Das 
---
   drivers/gpu/drm/i915/gt/intel_ggtt.c | 6 +-
   1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 4d7d88b92632..c2858d434bce 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -197,13 +197,17 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt)
   static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt)
   {
+   struct drm_i915_private *i915 = ggtt->vm.i915;
struct intel_uncore *uncore = ggtt->vm.gt->uncore;
/*
 * Note that as an uncached mmio write, this will flush the
 * WCB of the writes into the GGTT before it triggers the invalidate.
+*
+* Only perform this when GGTT is mapped as WC, see ggtt_probe_common().
 */
-   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
+   if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11)
+   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, 
GFX_FLSH_CNTL_EN);
   }
   static void guc_ggtt_invalidate(struct i915_ggtt *ggtt)
--
2.41.0


[PATCH v2] drm/i915: Flush WC GGTT only on required platforms

2023-10-13 Thread Nirmoy Das
gen8_ggtt_invalidate() is only needed for limited set of platforms
where GGTT is mapped as WC otherwise this can cause unwanted
side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not
valid.

v2: Add a func to detect wc ggtt detection (Ville)

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Joonas Lahtinen 
Cc: Jani Nikula 
Cc: Jonathan Cavitt 
Cc: John Harrison 
Cc: Andi Shyti 
Cc: Ville Syrjälä 
Cc:  # v6.2+
Suggested-by: Matt Roper 
Signed-off-by: Nirmoy Das 
Acked-by: Andi Shyti 
---
 drivers/gpu/drm/i915/gt/intel_ggtt.c | 35 +++-
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 4d7d88b92632..401667f83f96 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -195,6 +195,21 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt)
spin_unlock_irq(>lock);
 }
 
+static bool needs_wc_ggtt_mapping(struct drm_i915_private *i915)
+{
+   /*
+* On BXT+/ICL+ writes larger than 64 bit to the GTT pagetable range
+* will be dropped. For WC mappings in general we have 64 byte burst
+* writes when the WC buffer is flushed, so we can't use it, but have to
+* resort to an uncached mapping. The WC issue is easily caught by the
+* readback check when writing GTT PTE entries.
+*/
+   if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11)
+   return true;
+
+   return false;
+}
+
 static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt)
 {
struct intel_uncore *uncore = ggtt->vm.gt->uncore;
@@ -202,8 +217,12 @@ static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt)
/*
 * Note that as an uncached mmio write, this will flush the
 * WCB of the writes into the GGTT before it triggers the invalidate.
+*
+* Only perform this when GGTT is mapped as WC, see ggtt_probe_common().
 */
-   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
+   if (needs_wc_ggtt_mapping(ggtt->vm.i915))
+   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6,
+ GFX_FLSH_CNTL_EN);
 }
 
 static void guc_ggtt_invalidate(struct i915_ggtt *ggtt)
@@ -1126,17 +1145,11 @@ static int ggtt_probe_common(struct i915_ggtt *ggtt, 
u64 size)
GEM_WARN_ON(pci_resource_len(pdev, GEN4_GTTMMADR_BAR) != 
gen6_gttmmadr_size(i915));
phys_addr = pci_resource_start(pdev, GEN4_GTTMMADR_BAR) + 
gen6_gttadr_offset(i915);
 
-   /*
-* On BXT+/ICL+ writes larger than 64 bit to the GTT pagetable range
-* will be dropped. For WC mappings in general we have 64 byte burst
-* writes when the WC buffer is flushed, so we can't use it, but have to
-* resort to an uncached mapping. The WC issue is easily caught by the
-* readback check when writing GTT PTE entries.
-*/
-   if (IS_GEN9_LP(i915) || GRAPHICS_VER(i915) >= 11)
-   ggtt->gsm = ioremap(phys_addr, size);
-   else
+   if (needs_wc_ggtt_mapping(i915))
ggtt->gsm = ioremap_wc(phys_addr, size);
+   else
+   ggtt->gsm = ioremap(phys_addr, size);
+
if (!ggtt->gsm) {
drm_err(>drm, "Failed to map the ggtt page table\n");
return -ENOMEM;
-- 
2.41.0



Re: [PATCH] drm/i915: Flush WC GGTT only on required platforms

2023-10-13 Thread Nirmoy Das



On 10/13/2023 2:55 PM, Ville Syrjälä wrote:

On Fri, Oct 13, 2023 at 02:28:21PM +0200, Nirmoy Das wrote:

Hi Ville,

On 10/13/2023 12:50 PM, Ville Syrjälä wrote:

On Fri, Oct 13, 2023 at 12:31:40PM +0200, Nirmoy Das wrote:

gen8_ggtt_invalidate() is only needed for limitted set of platforms
where GGTT is mapped as WC

I know there is supposed to be some kind hw snooping of the ggtt
pte writes to invalidate the tlb, but are we sure GFX_FLSH_CNTL
has no other side effects we depend on?

I spent some time searching through the gfxspec. This GFX_FLSH_CNTL
register only seems to be for

invalidating TLB for GUnit  and (from git log ) we started to do that to
enable WC based GGTT updates.


So if I am not missing anything obvious then this should be safe.

OK.

The only code related complaint I have is that you are now duplicating
that same platform check in two different places. It's always better to
have a single point of truth instead of two or more, so that there is
no risk of introducing bugs due to mismatches.


I agree. I will resend with a static helper function to detect that.


Thanks,

Nirmoy





Regards,

Nirmoy


otherwise this can cause unwanted
side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not
valid.

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Joonas Lahtinen 
Cc: Jani Nikula 
Cc: Jonathan Cavitt 
Cc: John Harrison 
Cc: Andi Shyti 
Cc:  # v6.2+
Suggested-by: Matt Roper 
Signed-off-by: Nirmoy Das 
---
   drivers/gpu/drm/i915/gt/intel_ggtt.c | 6 +-
   1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 4d7d88b92632..c2858d434bce 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -197,13 +197,17 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt)
   
   static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt)

   {
+   struct drm_i915_private *i915 = ggtt->vm.i915;
struct intel_uncore *uncore = ggtt->vm.gt->uncore;
   
   	/*

 * Note that as an uncached mmio write, this will flush the
 * WCB of the writes into the GGTT before it triggers the invalidate.
+*
+* Only perform this when GGTT is mapped as WC, see ggtt_probe_common().
 */
-   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
+   if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11)
+   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, 
GFX_FLSH_CNTL_EN);
   }
   
   static void guc_ggtt_invalidate(struct i915_ggtt *ggtt)

--
2.41.0


Re: [PATCH] drm/i915: Flush WC GGTT only on required platforms

2023-10-13 Thread Nirmoy Das

Hi Ville,

On 10/13/2023 12:50 PM, Ville Syrjälä wrote:

On Fri, Oct 13, 2023 at 12:31:40PM +0200, Nirmoy Das wrote:

gen8_ggtt_invalidate() is only needed for limitted set of platforms
where GGTT is mapped as WC

I know there is supposed to be some kind hw snooping of the ggtt
pte writes to invalidate the tlb, but are we sure GFX_FLSH_CNTL
has no other side effects we depend on?


I spent some time searching through the gfxspec. This GFX_FLSH_CNTL 
register only seems to be for


invalidating TLB for GUnit  and (from git log ) we started to do that to 
enable WC based GGTT updates.



So if I am not missing anything obvious then this should be safe.


Regards,

Nirmoy




otherwise this can cause unwanted
side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not
valid.

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Joonas Lahtinen 
Cc: Jani Nikula 
Cc: Jonathan Cavitt 
Cc: John Harrison 
Cc: Andi Shyti 
Cc:  # v6.2+
Suggested-by: Matt Roper 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_ggtt.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 4d7d88b92632..c2858d434bce 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -197,13 +197,17 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt)
  
  static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt)

  {
+   struct drm_i915_private *i915 = ggtt->vm.i915;
struct intel_uncore *uncore = ggtt->vm.gt->uncore;
  
  	/*

 * Note that as an uncached mmio write, this will flush the
 * WCB of the writes into the GGTT before it triggers the invalidate.
+*
+* Only perform this when GGTT is mapped as WC, see ggtt_probe_common().
 */
-   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
+   if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11)
+   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, 
GFX_FLSH_CNTL_EN);
  }
  
  static void guc_ggtt_invalidate(struct i915_ggtt *ggtt)

--
2.41.0


Re: [PATCH] drm/i915: Flush WC GGTT only on required platforms

2023-10-13 Thread Nirmoy Das



On 10/13/2023 12:35 PM, Andi Shyti wrote:

Hi Nirmoy,

On Fri, Oct 13, 2023 at 12:31:40PM +0200, Nirmoy Das wrote:

gen8_ggtt_invalidate() is only needed for limitted set of platforms

/limitted/limited/


Added " autocmd FileType gitcommit setlocal spell" to my vim config. 
Wish I knew about it


before.





where GGTT is mapped as WC otherwise this can cause unwanted
side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not
valid.

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Joonas Lahtinen 
Cc: Jani Nikula 
Cc: Jonathan Cavitt 
Cc: John Harrison 
Cc: Andi Shyti 
Cc:  # v6.2+
Suggested-by: Matt Roper 
Signed-off-by: Nirmoy Das 

Acked-by: Andi Shyti 


Thanks,

Nirmoy



Andi


[PATCH] drm/i915: Flush WC GGTT only on required platforms

2023-10-13 Thread Nirmoy Das
gen8_ggtt_invalidate() is only needed for limitted set of platforms
where GGTT is mapped as WC otherwise this can cause unwanted
side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not
valid.

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Joonas Lahtinen 
Cc: Jani Nikula 
Cc: Jonathan Cavitt 
Cc: John Harrison 
Cc: Andi Shyti 
Cc:  # v6.2+
Suggested-by: Matt Roper 
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_ggtt.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 4d7d88b92632..c2858d434bce 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -197,13 +197,17 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt)
 
 static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt)
 {
+   struct drm_i915_private *i915 = ggtt->vm.i915;
struct intel_uncore *uncore = ggtt->vm.gt->uncore;
 
/*
 * Note that as an uncached mmio write, this will flush the
 * WCB of the writes into the GGTT before it triggers the invalidate.
+*
+* Only perform this when GGTT is mapped as WC, see ggtt_probe_common().
 */
-   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
+   if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11)
+   intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, 
GFX_FLSH_CNTL_EN);
 }
 
 static void guc_ggtt_invalidate(struct i915_ggtt *ggtt)
-- 
2.41.0



[PATCH v2] drm/i915: Prevent potential null-ptr-deref in engine_init_common

2023-10-11 Thread Nirmoy Das
If measure_breadcrumb_dw() returns an error and bce isn't created,
this commit ensures that intel_engine_destroy_pinned_context()
is not called with a NULL bce.

v2: Fix the subject s/UAF/null-ptr-deref(Jani)

Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates")
Cc: Oak Zeng 
Cc: Andi Shyti 
Cc: Jani Nikula 
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 179d9546865b..4a11219e560e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1491,7 +1491,8 @@ static int engine_init_common(struct intel_engine_cs 
*engine)
return 0;
 
 err_bce_context:
-   intel_engine_destroy_pinned_context(bce);
+   if (bce)
+   intel_engine_destroy_pinned_context(bce);
 err_ce_context:
intel_engine_destroy_pinned_context(ce);
return ret;
-- 
2.41.0



Re: [PATCH] drm/i915: Prevent potential UAF in engine_init_common

2023-10-11 Thread Nirmoy Das

Hi Andi,

On 10/11/2023 2:22 PM, Andi Shyti wrote:

Hi Nirmoy,

On Wed, Oct 11, 2023 at 01:54:51PM +0200, Nirmoy Das wrote:

If measure_breadcrumb_dw() returns an error and bce isn't created,
this commit ensures that intel_engine_destroy_pinned_context()
is not called with a NULL bce.

Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates")
Cc: Oak Zeng 
Cc: Andi Shyti 
Signed-off-by: Nirmoy Das 

Reviewed-by: Andi Shyti 


Resent with a fixed subject. Please check again.


Thanks,

Nirmoy



Andi


Re: [PATCH] drm/i915: Prevent potential UAF in engine_init_common

2023-10-11 Thread Nirmoy Das



On 10/11/2023 2:20 PM, Jani Nikula wrote:

On Wed, 11 Oct 2023, Nirmoy Das  wrote:

If measure_breadcrumb_dw() returns an error and bce isn't created,
this commit ensures that intel_engine_destroy_pinned_context()
is not called with a NULL bce.

So it's a potential NULL pointer dereference, not use after free like
the subject says. Please fix the subject.


ah right. I will resend.


Thanks,

Nirmoy



BR,
Jani.



Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates")
Cc: Oak Zeng 
Cc: Andi Shyti 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_engine_cs.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 179d9546865b..4a11219e560e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1491,7 +1491,8 @@ static int engine_init_common(struct intel_engine_cs 
*engine)
return 0;
  
  err_bce_context:

-   intel_engine_destroy_pinned_context(bce);
+   if (bce)
+   intel_engine_destroy_pinned_context(bce);
  err_ce_context:
intel_engine_destroy_pinned_context(ce);
return ret;


[PATCH] drm/i915: Prevent potential UAF in engine_init_common

2023-10-11 Thread Nirmoy Das
If measure_breadcrumb_dw() returns an error and bce isn't created,
this commit ensures that intel_engine_destroy_pinned_context()
is not called with a NULL bce.

Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates")
Cc: Oak Zeng 
Cc: Andi Shyti 
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 179d9546865b..4a11219e560e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1491,7 +1491,8 @@ static int engine_init_common(struct intel_engine_cs 
*engine)
return 0;
 
 err_bce_context:
-   intel_engine_destroy_pinned_context(bce);
+   if (bce)
+   intel_engine_destroy_pinned_context(bce);
 err_ce_context:
intel_engine_destroy_pinned_context(ce);
return ret;
-- 
2.41.0



Re: [Intel-gfx] [PATCH v2] drm/i915: Reduce MCR lock surface

2023-10-04 Thread Nirmoy Das

Hi Rodrigo,

On 10/4/2023 4:37 PM, Rodrigo Vivi wrote:

On Wed, Oct 04, 2023 at 03:54:59PM +0200, Nirmoy Das wrote:

Hi Rodrigo,

On 10/4/2023 2:44 PM, Rodrigo Vivi wrote:

On Wed, Oct 04, 2023 at 02:04:07PM +0200, Nirmoy Das wrote:

Take the mcr lock only when driver needs to write into a mcr based
tlb based registers.

To prevent GT reset interference, employ gt->reset.mutex instead, since
intel_gt_mcr_multicast_write relies on gt->uncore->lock not being held.

This looks a lot like protecting code and not protecting data [1]

But to be really honest I'm afraid we were already doing this before
this patch but with 2 other locks instead.

I haven't thought about that but yes, the issue was there already.



[1] - https://blog.ffwll.ch/2022/07/locking-engineering.html


v2: remove unused var, flags.

Signed-off-by: Nirmoy Das 
---
   drivers/gpu/drm/i915/gt/intel_tlb.c | 13 +
   1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.c 
b/drivers/gpu/drm/i915/gt/intel_tlb.c
index 139608c30d97..0ad905df4a98 100644
--- a/drivers/gpu/drm/i915/gt/intel_tlb.c
+++ b/drivers/gpu/drm/i915/gt/intel_tlb.c
@@ -52,15 +52,13 @@ static void mmio_invalidate_full(struct intel_gt *gt)
struct intel_engine_cs *engine;
intel_engine_mask_t awake, tmp;
enum intel_engine_id id;
-   unsigned long flags;
if (GRAPHICS_VER(i915) < 8)
return;
intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL);
-   intel_gt_mcr_lock(gt, );
-   spin_lock(>lock); /* serialise invalidate with GT reset */
+   mutex_lock(>reset.mutex);/* serialise invalidate with GT reset */

I'm still looking at this and the commit message above and trying to understand
why we are doing this and changing the previous 2 by this other one. why?


We need the MCR lock only for intel_gt_mcr_multicast_*() so I am not
replacing the two locks here but moving the mcr lock down

where we were doing intel_gt_mcr_multicast_write_fw()


why s/spin_lock(>lock)/mutex_lock(>reset.mutex):

intel_gt_mcr_multicast_*() expects gt->uncore->lock to be not held

is there any lockdep assert or primitive that we could/should do
that to avoid this same issue in the future?


We have locdep asserts for those mcr functions.



anyway, this is also another thing that it is important for the
commit message.

and why is that? what I have in mind goes along with the comment
above intel_de_read_fw():
"""
Access to registers should
  * therefore generally be serialised, by either the dev_priv->uncore.lock or
"""


Yes, the commit message should've been more clear.


Anyways, please ignore this patch. I need to find a better way and it 
also didn't fix the issue completely that I was working on.



Thanks,

Nirmoy




and to
achieve this, I could do something like:

if (engine->tlb_inv.mcr) {

      spin_unlock(>lock);

  intel_gt_mcr_lock(gt, );

  intel_gt_mcr_multicast_write_fw

  intel_gt_mcr_unlock(gt, flags);

     spin_lock(>lock);

}

Or take gt->reset.mutex instead which should block any concurrent gt reset.

If this is not acceptable then I can pick the above 1st option but I am not
sure how safe is it do release uncore->lock and then take it back again.

hmm... probably the gt_reset one is better than releasing and grabbing it
again.


awake = 0;
for_each_engine(engine, gt, id) {
@@ -68,9 +66,9 @@ static void mmio_invalidate_full(struct intel_gt *gt)
continue;
if (engine->tlb_inv.mcr)
-   intel_gt_mcr_multicast_write_fw(gt,
-   
engine->tlb_inv.reg.mcr_reg,
-   
engine->tlb_inv.request);
+   intel_gt_mcr_multicast_write(gt,
+
engine->tlb_inv.reg.mcr_reg,
+engine->tlb_inv.request);

you are already taking the forcewake_all domain above, so you wouldn't
need to convert this to the variant that grabs the forcewake underneath.

Also this is not mentioned in the commit message above.

intel_gt_mcr_multicast_write() takes the mcr lock for us, helps replacing 
multiple lines into one.
Will there be any side-effects for that ?

hmm... I can't forsee side-effects here... but I'm asking myself why on the non
MCR ones we are using the global forcewake_all and the _fw to start with.
Maybe there was a reason for that? Because in general we should prefer the non 
_fw
variants to start with. Maybe we should dig into the history there to understand
why the line below started with the intel_uncore_write_fw below?


I should've added that the commit message.

I'm even wondering if this should be 2 separated patches?!


Regards,
Nirmoy



else
intel_

Re: [PATCH v2] drm/i915: Reduce MCR lock surface

2023-10-04 Thread Nirmoy Das

Hi Rodrigo,

On 10/4/2023 2:44 PM, Rodrigo Vivi wrote:

On Wed, Oct 04, 2023 at 02:04:07PM +0200, Nirmoy Das wrote:

Take the mcr lock only when driver needs to write into a mcr based
tlb based registers.

To prevent GT reset interference, employ gt->reset.mutex instead, since
intel_gt_mcr_multicast_write relies on gt->uncore->lock not being held.

This looks a lot like protecting code and not protecting data [1]

But to be really honest I'm afraid we were already doing this before
this patch but with 2 other locks instead.


I haven't thought about that but yes, the issue was there already.




[1] - https://blog.ffwll.ch/2022/07/locking-engineering.html


v2: remove unused var, flags.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_tlb.c | 13 +
  1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.c 
b/drivers/gpu/drm/i915/gt/intel_tlb.c
index 139608c30d97..0ad905df4a98 100644
--- a/drivers/gpu/drm/i915/gt/intel_tlb.c
+++ b/drivers/gpu/drm/i915/gt/intel_tlb.c
@@ -52,15 +52,13 @@ static void mmio_invalidate_full(struct intel_gt *gt)
struct intel_engine_cs *engine;
intel_engine_mask_t awake, tmp;
enum intel_engine_id id;
-   unsigned long flags;
  
  	if (GRAPHICS_VER(i915) < 8)

return;
  
  	intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL);
  
-	intel_gt_mcr_lock(gt, );

-   spin_lock(>lock); /* serialise invalidate with GT reset */
+   mutex_lock(>reset.mutex);/* serialise invalidate with GT reset */

I'm still looking at this and the commit message above and trying to understand
why we are doing this and changing the previous 2 by this other one. why?



We need the MCR lock only for intel_gt_mcr_multicast_*() so I am not 
replacing the two locks here but moving the mcr lock down


where we were doing intel_gt_mcr_multicast_write_fw()


why s/spin_lock(>lock)/mutex_lock(>reset.mutex):

intel_gt_mcr_multicast_*() expects gt->uncore->lock to be not held and 
to achieve this, I could do something like:


if (engine->tlb_inv.mcr) {

     spin_unlock(>lock);

 intel_gt_mcr_lock(gt, );

 intel_gt_mcr_multicast_write_fw

 intel_gt_mcr_unlock(gt, flags);

    spin_lock(>lock);

}

Or take gt->reset.mutex instead which should block any concurrent gt reset.

If this is not acceptable then I can pick the above 1st option but I am 
not sure how safe is it do release uncore->lock and then take it back again.




  
  	awake = 0;

for_each_engine(engine, gt, id) {
@@ -68,9 +66,9 @@ static void mmio_invalidate_full(struct intel_gt *gt)
continue;
  
  		if (engine->tlb_inv.mcr)

-   intel_gt_mcr_multicast_write_fw(gt,
-   
engine->tlb_inv.reg.mcr_reg,
-   
engine->tlb_inv.request);
+   intel_gt_mcr_multicast_write(gt,
+
engine->tlb_inv.reg.mcr_reg,
+engine->tlb_inv.request);

you are already taking the forcewake_all domain above, so you wouldn't
need to convert this to the variant that grabs the forcewake underneath.

Also this is not mentioned in the commit message above.


intel_gt_mcr_multicast_write() takes the mcr lock for us, helps replacing 
multiple lines into one.
Will there be any side-effects for that ?

I should've added that the commit message.

Regards,
Nirmoy





else
intel_uncore_write_fw(uncore,
  engine->tlb_inv.reg.reg,
@@ -90,8 +88,7 @@ static void mmio_invalidate_full(struct intel_gt *gt)
 IS_ALDERLAKE_P(i915)))
intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
  
-	spin_unlock(>lock);

-   intel_gt_mcr_unlock(gt, flags);
+   mutex_unlock(>reset.mutex);
  
  	for_each_engine_masked(engine, gt, awake, tmp) {

if (wait_for_invalidate(engine))
--
2.41.0



[PATCH v2] drm/i915: Reduce MCR lock surface

2023-10-04 Thread Nirmoy Das
Take the mcr lock only when driver needs to write into a mcr based
tlb based registers.

To prevent GT reset interference, employ gt->reset.mutex instead, since
intel_gt_mcr_multicast_write relies on gt->uncore->lock not being held.

v2: remove unused var, flags.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_tlb.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.c 
b/drivers/gpu/drm/i915/gt/intel_tlb.c
index 139608c30d97..0ad905df4a98 100644
--- a/drivers/gpu/drm/i915/gt/intel_tlb.c
+++ b/drivers/gpu/drm/i915/gt/intel_tlb.c
@@ -52,15 +52,13 @@ static void mmio_invalidate_full(struct intel_gt *gt)
struct intel_engine_cs *engine;
intel_engine_mask_t awake, tmp;
enum intel_engine_id id;
-   unsigned long flags;
 
if (GRAPHICS_VER(i915) < 8)
return;
 
intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL);
 
-   intel_gt_mcr_lock(gt, );
-   spin_lock(>lock); /* serialise invalidate with GT reset */
+   mutex_lock(>reset.mutex);/* serialise invalidate with GT reset */
 
awake = 0;
for_each_engine(engine, gt, id) {
@@ -68,9 +66,9 @@ static void mmio_invalidate_full(struct intel_gt *gt)
continue;
 
if (engine->tlb_inv.mcr)
-   intel_gt_mcr_multicast_write_fw(gt,
-   
engine->tlb_inv.reg.mcr_reg,
-   
engine->tlb_inv.request);
+   intel_gt_mcr_multicast_write(gt,
+
engine->tlb_inv.reg.mcr_reg,
+engine->tlb_inv.request);
else
intel_uncore_write_fw(uncore,
  engine->tlb_inv.reg.reg,
@@ -90,8 +88,7 @@ static void mmio_invalidate_full(struct intel_gt *gt)
 IS_ALDERLAKE_P(i915)))
intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
 
-   spin_unlock(>lock);
-   intel_gt_mcr_unlock(gt, flags);
+   mutex_unlock(>reset.mutex);
 
for_each_engine_masked(engine, gt, awake, tmp) {
if (wait_for_invalidate(engine))
-- 
2.41.0



[PATCH] drm/i915: Reduce MCR lock surface

2023-10-04 Thread Nirmoy Das
Take the mcr lock only when driver needs to write into a mcr based
tlb based registers.

To prevent GT reset interference, employ gt->reset.mutex instead, since
intel_gt_mcr_multicast_write relies on gt->uncore->lock not being held.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_tlb.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.c 
b/drivers/gpu/drm/i915/gt/intel_tlb.c
index 139608c30d97..7b2d9549e595 100644
--- a/drivers/gpu/drm/i915/gt/intel_tlb.c
+++ b/drivers/gpu/drm/i915/gt/intel_tlb.c
@@ -59,8 +59,7 @@ static void mmio_invalidate_full(struct intel_gt *gt)
 
intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL);
 
-   intel_gt_mcr_lock(gt, );
-   spin_lock(>lock); /* serialise invalidate with GT reset */
+   mutex_lock(>reset.mutex);/* serialise invalidate with GT reset */
 
awake = 0;
for_each_engine(engine, gt, id) {
@@ -68,9 +67,9 @@ static void mmio_invalidate_full(struct intel_gt *gt)
continue;
 
if (engine->tlb_inv.mcr)
-   intel_gt_mcr_multicast_write_fw(gt,
-   
engine->tlb_inv.reg.mcr_reg,
-   
engine->tlb_inv.request);
+   intel_gt_mcr_multicast_write(gt,
+
engine->tlb_inv.reg.mcr_reg,
+engine->tlb_inv.request);
else
intel_uncore_write_fw(uncore,
  engine->tlb_inv.reg.reg,
@@ -90,8 +89,7 @@ static void mmio_invalidate_full(struct intel_gt *gt)
 IS_ALDERLAKE_P(i915)))
intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
 
-   spin_unlock(>lock);
-   intel_gt_mcr_unlock(gt, flags);
+   mutex_unlock(>reset.mutex);
 
for_each_engine_masked(engine, gt, awake, tmp) {
if (wait_for_invalidate(engine))
-- 
2.41.0



Re: [PATCH] drm/i915: Invalidate all GTs in flush_tlb_invalidate()

2023-10-02 Thread Nirmoy Das

Hi Andi,

On 10/2/2023 3:45 PM, Andi Shyti wrote:

Hi Nirmoy,

On Mon, Oct 02, 2023 at 02:20:32PM +0200, Nirmoy Das wrote:

Don't return early if one of the GT doesn't require
any flushing.

Fixes: d6c531ab4820 ("drm/i915: Invalidate the TLBs on each GT")
Cc: Chris Wilson 
Cc: Fei Yang 
Cc: Mauro Carvalho Chehab 
Cc: Andi Shyti 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Nirmoy Das 
Cc: Matthew Auld 
Cc: "Ville Syrjälä" 
Cc: Andrzej Hajda 
Cc: "Christian König" 
Cc: Jonathan Cavitt 
Cc: Matt Roper 
Cc: intel-...@lists.freedesktop.org
Signed-off-by: Nirmoy Das 

This patch has been sent already here:

https://patchwork.freedesktop.org/patch/560309/?series=124472=1

Can we take it from there?


Missed it. Sure,  works for me.


Thanks,

Nirmoy



Andi


Re: [Intel-gfx] [PATCH] drm/i915: Invalidate the TLBs on each GT

2023-10-02 Thread Nirmoy Das

+dri-devel

On 10/2/2023 4:07 PM, Jonathan Cavitt wrote:

From: Chris Wilson 

With multi-GT devices, the object may have been bound on each GT and so
we need to invalidate the TLBs across all GT before releasing the pages
back to the system.

Fixes: d6c531ab4820 ("drm/i915: Invalidate the TLBs on each GT")
Signed-off-by: Chris Wilson 
Signed-off-by: Jonathan Cavitt 
CC: Matt Roper 
CC: Andi Shyti 
Reviewed-by: Andi Shyti 


Reviewed-by: Nirmoy Das 

Can't find it in dri-devel but if you didn't then Cc to 
dri-devel@lists.freedesktop.org for change that touches gt/gem files.



Regards,

Nirmoy


---
  drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c 
b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index 6b6d22c194110..0ba955611dfb5 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -198,7 +198,7 @@ static void flush_tlb_invalidate(struct drm_i915_gem_object 
*obj)
  
  	for_each_gt(gt, i915, id) {

if (!obj->mm.tlb[id])
-   return;
+   continue;
  
  		intel_gt_invalidate_tlb_full(gt, obj->mm.tlb[id]);

obj->mm.tlb[id] = 0;


[PATCH] drm/i915: Invalidate all GTs in flush_tlb_invalidate()

2023-10-02 Thread Nirmoy Das
Don't return early if one of the GT doesn't require
any flushing.

Fixes: d6c531ab4820 ("drm/i915: Invalidate the TLBs on each GT")
Cc: Chris Wilson 
Cc: Fei Yang 
Cc: Mauro Carvalho Chehab 
Cc: Andi Shyti 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Nirmoy Das 
Cc: Matthew Auld 
Cc: "Ville Syrjälä" 
Cc: Andrzej Hajda 
Cc: "Christian König" 
Cc: Jonathan Cavitt 
Cc: Matt Roper 
Cc: intel-...@lists.freedesktop.org
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c 
b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index 6b6d22c19411..0ba955611dfb 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -198,7 +198,7 @@ static void flush_tlb_invalidate(struct drm_i915_gem_object 
*obj)
 
for_each_gt(gt, i915, id) {
if (!obj->mm.tlb[id])
-   return;
+   continue;
 
intel_gt_invalidate_tlb_full(gt, obj->mm.tlb[id]);
obj->mm.tlb[id] = 0;
-- 
2.41.0



Re: [Intel-gfx] [PATCH v7 1/4] drm/i915: Introduce intel_gt_mcr_lock_sanitize()

2023-09-29 Thread Nirmoy Das



On 9/28/2023 11:42 PM, Andrzej Hajda wrote:

On 28.09.2023 15:00, Nirmoy Das wrote:

Implement intel_gt_mcr_lock_sanitize() to provide a mechanism
for cleaning the steer semaphore when absolutely necessary.

v2: remove unnecessary lock(Andi, Matt)
 improve the kernel doc(Matt)
 s/intel_gt_mcr_lock_clear/intel_gt_mcr_lock_sanitize

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 22 ++
  drivers/gpu/drm/i915/gt/intel_gt_mcr.h |  1 +
  2 files changed, 23 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c

index bf4a933de03a..326c2ed1d99b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -419,6 +419,28 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, 
unsigned long flags)

  intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
  }
  +/**
+ * intel_gt_mcr_lock_sanitize - Sanitize MCR steering lock
+ * @gt: GT structure
+ *
+ * This will be used to sanitize the initial status of the hardware 
lock
+ * during driver load and resume since there won't be any concurrent 
access
+ * from other agents at those times, but it's possible that boot 
firmware

+ * may have left the lock in a bad state.
+ *
+ */
+void intel_gt_mcr_lock_sanitize(struct intel_gt *gt)
+{
+    /*
+ * This gets called at load/resume time, so we shouldn't be
+ * racing with other driver threads grabbing the mcr lock.
+ */
+    lockdep_assert_not_held(>mcr_lock);
+
+    if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
+    intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);


I wonder if it wouldn't be useful to check and report if it is locked 
before unconditional release, no strong feelings.

Not so useful for user but may be as debug log if we need.


Reviewed-by: Andrzej Hajda 



Thanks,

Nirmoy



Regards
Andrzej



+}
+
  /**
   * intel_gt_mcr_read - read a specific instance of an MCR register
   * @gt: GT structure
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h

index 41684495b7da..01ac565a56a4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
@@ -11,6 +11,7 @@
  void intel_gt_mcr_init(struct intel_gt *gt);
  void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags);
  void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags);
+void intel_gt_mcr_lock_sanitize(struct intel_gt *gt);
    u32 intel_gt_mcr_read(struct intel_gt *gt,
    i915_mcr_reg_t reg,




Re: [PATCH v7 1/4] drm/i915: Introduce intel_gt_mcr_lock_sanitize()

2023-09-29 Thread Nirmoy Das
Thanks reviewing this series. Merged it in  gt-next so hopefully we have 
bit greener CI for MTL now.



Regards,

Nirmoy

On 9/28/2023 3:00 PM, Nirmoy Das wrote:

Implement intel_gt_mcr_lock_sanitize() to provide a mechanism
for cleaning the steer semaphore when absolutely necessary.

v2: remove unnecessary lock(Andi, Matt)
 improve the kernel doc(Matt)
 s/intel_gt_mcr_lock_clear/intel_gt_mcr_lock_sanitize

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 22 ++
  drivers/gpu/drm/i915/gt/intel_gt_mcr.h |  1 +
  2 files changed, 23 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index bf4a933de03a..326c2ed1d99b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -419,6 +419,28 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned 
long flags)
intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
  }
  
+/**

+ * intel_gt_mcr_lock_sanitize - Sanitize MCR steering lock
+ * @gt: GT structure
+ *
+ * This will be used to sanitize the initial status of the hardware lock
+ * during driver load and resume since there won't be any concurrent access
+ * from other agents at those times, but it's possible that boot firmware
+ * may have left the lock in a bad state.
+ *
+ */
+void intel_gt_mcr_lock_sanitize(struct intel_gt *gt)
+{
+   /*
+* This gets called at load/resume time, so we shouldn't be
+* racing with other driver threads grabbing the mcr lock.
+*/
+   lockdep_assert_not_held(>mcr_lock);
+
+   if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
+   intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
+}
+
  /**
   * intel_gt_mcr_read - read a specific instance of an MCR register
   * @gt: GT structure
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
index 41684495b7da..01ac565a56a4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
@@ -11,6 +11,7 @@
  void intel_gt_mcr_init(struct intel_gt *gt);
  void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags);
  void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags);
+void intel_gt_mcr_lock_sanitize(struct intel_gt *gt);
  
  u32 intel_gt_mcr_read(struct intel_gt *gt,

  i915_mcr_reg_t reg,


[PATCH v7 4/4] drm/i915/mtl: Skip MCR ops for ring fault register

2023-09-28 Thread Nirmoy Das
On MTL GEN12_RING_FAULT_REG is not replicated so don't
do mcr based operation for this register.

v2: use MEDIA_VER() instead of GRAPHICS_VER()(Matt).
v3: s/"MEDIA_VER(i915) == 13"/"MEDIA_VER(i915) >= 13"(Matt)
improve comment.
v4: improve the comment further(Andi)

Signed-off-by: Nirmoy Das 
Reviewed-by: Matt Roper 
Reviewed-by: Andi Shyti 
Reviewed-by: Andrzej Hajda 
---
 drivers/gpu/drm/i915/gt/intel_gt.c  | 13 -
 drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
 drivers/gpu/drm/i915/i915_gpu_error.c   | 11 ++-
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c
index 93062c35e072..dff8bba1f5d4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -262,10 +262,21 @@ intel_gt_clear_error_registers(struct intel_gt *gt,
   I915_MASTER_ERROR_INTERRUPT);
}
 
-   if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) {
+   /*
+* For the media GT, this ring fault register is not replicated,
+* so don't do multicast/replicated register read/write operation on it.
+*/
+   if (MEDIA_VER(i915) >= 13 && gt->type == GT_MEDIA) {
+   intel_uncore_rmw(uncore, XELPMP_RING_FAULT_REG,
+RING_FAULT_VALID, 0);
+   intel_uncore_posting_read(uncore,
+ XELPMP_RING_FAULT_REG);
+
+   } else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) {
intel_gt_mcr_multicast_rmw(gt, XEHP_RING_FAULT_REG,
   RING_FAULT_VALID, 0);
intel_gt_mcr_read_any(gt, XEHP_RING_FAULT_REG);
+
} else if (GRAPHICS_VER(i915) >= 12) {
intel_uncore_rmw(uncore, GEN12_RING_FAULT_REG, 
RING_FAULT_VALID, 0);
intel_uncore_posting_read(uncore, GEN12_RING_FAULT_REG);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index cca4bac8f8b0..eecd0a87a647 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1084,6 +1084,7 @@
 
 #define GEN12_RING_FAULT_REG   _MMIO(0xcec4)
 #define XEHP_RING_FAULT_REGMCR_REG(0xcec4)
+#define XELPMP_RING_FAULT_REG  _MMIO(0xcec4)
 #define   GEN8_RING_FAULT_ENGINE_ID(x) (((x) >> 12) & 0x7)
 #define   RING_FAULT_GTTSEL_MASK   (1 << 11)
 #define   RING_FAULT_SRCID(x)  (((x) >> 3) & 0xff)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
b/drivers/gpu/drm/i915/i915_gpu_error.c
index f4ebcfb70289..b4e31e59c799 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1234,7 +1234,16 @@ static void engine_record_registers(struct 
intel_engine_coredump *ee)
if (GRAPHICS_VER(i915) >= 6) {
ee->rc_psmi = ENGINE_READ(engine, RING_PSMI_CTL);
 
-   if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50))
+   /*
+* For the media GT, this ring fault register is not replicated,
+* so don't do multicast/replicated register read/write
+* operation on it.
+*/
+   if (MEDIA_VER(i915) >= 13 && engine->gt->type == GT_MEDIA)
+   ee->fault_reg = intel_uncore_read(engine->uncore,
+ 
XELPMP_RING_FAULT_REG);
+
+   else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50))
ee->fault_reg = intel_gt_mcr_read_any(engine->gt,
  
XEHP_RING_FAULT_REG);
else if (GRAPHICS_VER(i915) >= 12)
-- 
2.41.0



[PATCH v7 3/4] drm/i915: Clean steer semaphore on resume

2023-09-28 Thread Nirmoy Das
During resume, the steer semaphore on GT1 was observed to be held. The
hardware team has confirmed the safety of clearing steer semaphores
for all GTs during driver load/resume, as no lock acquisitions can occur
in this process by other agents.

v2: reset on resume not in intel_gt_init().
v3: do the reset on intel_gt_resume_early()
v4: do general sanitization for all GTs(Matt)

Signed-off-by: Nirmoy Das 
Reviewed-by: Andi Shyti 
---
 drivers/gpu/drm/i915/gt/intel_gt_pm.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index dab73980c9f1..3461f3e74277 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -13,6 +13,7 @@
 #include "intel_engine_pm.h"
 #include "intel_gt.h"
 #include "intel_gt_clock_utils.h"
+#include "intel_gt_mcr.h"
 #include "intel_gt_pm.h"
 #include "intel_gt_print.h"
 #include "intel_gt_requests.h"
@@ -218,6 +219,15 @@ void intel_gt_pm_fini(struct intel_gt *gt)
 
 void intel_gt_resume_early(struct intel_gt *gt)
 {
+   /*
+* Sanitize steer semaphores during driver resume. This is necessary
+* to address observed cases of steer semaphores being
+* held after a suspend operation. Confirmation from the hardware team
+* assures the safety of this operation, as no lock acquisitions
+* by other agents occur during driver load/resume process.
+*/
+   intel_gt_mcr_lock_sanitize(gt);
+
intel_uncore_resume_early(gt->uncore);
intel_gt_check_and_clear_faults(gt);
 }
-- 
2.41.0



[PATCH v7 2/4] drm/i915: Introduce the intel_gt_resume_early()

2023-09-28 Thread Nirmoy Das
Move early resume functions of gt to a proper file.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_gt_pm.c | 6 ++
 drivers/gpu/drm/i915/gt/intel_gt_pm.h | 1 +
 drivers/gpu/drm/i915/i915_driver.c| 6 ++
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 5a942af0a14e..dab73980c9f1 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -216,6 +216,12 @@ void intel_gt_pm_fini(struct intel_gt *gt)
intel_rc6_fini(>rc6);
 }
 
+void intel_gt_resume_early(struct intel_gt *gt)
+{
+   intel_uncore_resume_early(gt->uncore);
+   intel_gt_check_and_clear_faults(gt);
+}
+
 int intel_gt_resume(struct intel_gt *gt)
 {
struct intel_engine_cs *engine;
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index 6c9a46452364..b1eeb5b33918 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -78,6 +78,7 @@ void intel_gt_pm_fini(struct intel_gt *gt);
 void intel_gt_suspend_prepare(struct intel_gt *gt);
 void intel_gt_suspend_late(struct intel_gt *gt);
 int intel_gt_resume(struct intel_gt *gt);
+void intel_gt_resume_early(struct intel_gt *gt);
 
 void intel_gt_runtime_suspend(struct intel_gt *gt);
 int intel_gt_runtime_resume(struct intel_gt *gt);
diff --git a/drivers/gpu/drm/i915/i915_driver.c 
b/drivers/gpu/drm/i915/i915_driver.c
index d50347e5773a..78501a83ba10 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1327,10 +1327,8 @@ static int i915_drm_resume_early(struct drm_device *dev)
drm_err(_priv->drm,
"Resume prepare failed: %d, continuing anyway\n", ret);
 
-   for_each_gt(gt, dev_priv, i) {
-   intel_uncore_resume_early(gt->uncore);
-   intel_gt_check_and_clear_faults(gt);
-   }
+   for_each_gt(gt, dev_priv, i)
+   intel_gt_resume_early(gt);
 
intel_display_power_resume_early(dev_priv);
 
-- 
2.41.0



[PATCH v7 1/4] drm/i915: Introduce intel_gt_mcr_lock_sanitize()

2023-09-28 Thread Nirmoy Das
Implement intel_gt_mcr_lock_sanitize() to provide a mechanism
for cleaning the steer semaphore when absolutely necessary.

v2: remove unnecessary lock(Andi, Matt)
improve the kernel doc(Matt)
s/intel_gt_mcr_lock_clear/intel_gt_mcr_lock_sanitize

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 22 ++
 drivers/gpu/drm/i915/gt/intel_gt_mcr.h |  1 +
 2 files changed, 23 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index bf4a933de03a..326c2ed1d99b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -419,6 +419,28 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned 
long flags)
intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
 }
 
+/**
+ * intel_gt_mcr_lock_sanitize - Sanitize MCR steering lock
+ * @gt: GT structure
+ *
+ * This will be used to sanitize the initial status of the hardware lock
+ * during driver load and resume since there won't be any concurrent access
+ * from other agents at those times, but it's possible that boot firmware
+ * may have left the lock in a bad state.
+ *
+ */
+void intel_gt_mcr_lock_sanitize(struct intel_gt *gt)
+{
+   /*
+* This gets called at load/resume time, so we shouldn't be
+* racing with other driver threads grabbing the mcr lock.
+*/
+   lockdep_assert_not_held(>mcr_lock);
+
+   if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
+   intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
+}
+
 /**
  * intel_gt_mcr_read - read a specific instance of an MCR register
  * @gt: GT structure
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
index 41684495b7da..01ac565a56a4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
@@ -11,6 +11,7 @@
 void intel_gt_mcr_init(struct intel_gt *gt);
 void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags);
 void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags);
+void intel_gt_mcr_lock_sanitize(struct intel_gt *gt);
 
 u32 intel_gt_mcr_read(struct intel_gt *gt,
  i915_mcr_reg_t reg,
-- 
2.41.0



Re: [PATCH] drm/i915: Don't set PIPE_CONTROL_FLUSH_L3 for aux inval

2023-09-28 Thread Nirmoy Das

Hi Tapani,

On 9/27/2023 6:13 AM, Tapani Pälli wrote:
Fixes all regressions we saw, I also run some extra vulkan and GL 
workloads, no regressions observed.


Tested-by: Tapani Pälli 



Thanks to testing it. The patch is now merged with" 
 # v5.8+" tag so it should trickle down to


v6.4.10. as normal stable release process.


Thanks,

Nirmoy



On 26.9.2023 17.24, Nirmoy Das wrote:

PIPE_CONTROL_FLUSH_L3 is not needed for aux invalidation
so don't set that.

Fixes: 78a6ccd65fa3 ("drm/i915/gt: Ensure memory quiesced before 
invalidation")

Cc: Jonathan Cavitt 
Cc: Andi Shyti 
Cc:  # v5.8+
Cc: Andrzej Hajda 
Cc: Tvrtko Ursulin 
Cc: Matt Roper 
Cc: Tejas Upadhyay 
Cc: Lucas De Marchi 
Cc: Prathap Kumar Valsan 
Cc: Tapani Pälli 
Cc: Mark Janes 
Cc: Rodrigo Vivi 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/gen8_engine_cs.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c 
b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c

index 0143445dba83..ba4c2422b340 100644
--- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
@@ -271,8 +271,17 @@ int gen12_emit_flush_rcs(struct i915_request 
*rq, u32 mode)

  if (GRAPHICS_VER_FULL(rq->i915) >= IP_VER(12, 70))
  bit_group_0 |= PIPE_CONTROL_CCS_FLUSH;
  +    /*
+ * L3 fabric flush is needed for AUX CCS invalidation
+ * which happens as part of pipe-control so we can
+ * ignore PIPE_CONTROL_FLUSH_L3. Also PIPE_CONTROL_FLUSH_L3
+ * deals with Protected Memory which is not needed for
+ * AUX CCS invalidation and lead to unwanted side effects.
+ */
+    if (mode & EMIT_FLUSH)
+    bit_group_1 |= PIPE_CONTROL_FLUSH_L3;
+
  bit_group_1 |= PIPE_CONTROL_TILE_CACHE_FLUSH;
-    bit_group_1 |= PIPE_CONTROL_FLUSH_L3;
  bit_group_1 |= PIPE_CONTROL_RENDER_TARGET_CACHE_FLUSH;
  bit_group_1 |= PIPE_CONTROL_DEPTH_CACHE_FLUSH;
  /* Wa_1409600907:tgl,adl-p */


Re: [PATCH v6 2/4] drm/i915: Introduce the intel_gt_resume_early()

2023-09-28 Thread Nirmoy Das

Hi Andi,

On 9/28/2023 9:24 AM, Andi Shyti wrote:

Hi Nirmoy,

On Wed, Sep 27, 2023 at 11:03:55PM +0200, Nirmoy Das wrote:

Move early resume functions of gt to a proper file.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_gt_pm.c | 6 ++
  drivers/gpu/drm/i915/gt/intel_gt_pm.h | 1 +
  drivers/gpu/drm/i915/i915_driver.c| 6 ++
  3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 5a942af0a14e..dab73980c9f1 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -216,6 +216,12 @@ void intel_gt_pm_fini(struct intel_gt *gt)
intel_rc6_fini(>rc6);
  }
  
+void intel_gt_resume_early(struct intel_gt *gt)

+{
+   intel_uncore_resume_early(gt->uncore);
+   intel_gt_check_and_clear_faults(gt);
+}
+

should this go into the gt/ directory?



Don't wanted to add anything new to i915_driver which can be done at 
lower levels as suggested by Jani and this seems like it


should go to gt dir.


Besides, if we don't need
spinlocks in the whole reset function, we could directly have the

intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);

here, so that we avoid having one line functions.


I think steer specific functions should stay in one file/header.


Thanks,

Nirmoy




Andi


Re: [PATCH v6 1/4] drm/i915: Introduce intel_gt_mcr_lock_reset()

2023-09-28 Thread Nirmoy Das



On 9/28/2023 9:18 AM, Andi Shyti wrote:

Hi Nirmoy,

your client is still missing my e-mails? :)


I did reply with a question!





+void intel_gt_mcr_lock_reset(struct intel_gt *gt)
+{
+   unsigned long __flags;
+
+   lockdep_assert_not_held(>uncore->lock);
+
+   spin_lock_irqsave(>mcr_lock, __flags);
+
+   if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
+   intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
+
+   spin_unlock_irqrestore(>mcr_lock, __flags);

I commented here that probably we don't need the locks. And I
asked you whether you had any reason for locking here.



I asked the question back to you and I send this before you could reply.

Will remove the lock.


Thanks,

Nirmoy



Andi


+}


Re: [PATCH v6 1/4] drm/i915: Introduce intel_gt_mcr_lock_reset()

2023-09-28 Thread Nirmoy Das



On 9/28/2023 12:23 AM, Matt Roper wrote:

On Wed, Sep 27, 2023 at 11:03:54PM +0200, Nirmoy Das wrote:

Implement intel_gt_mcr_lock_reset() to provide a mechanism
for resetting the steer semaphore when absolutely necessary.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 29 ++
  drivers/gpu/drm/i915/gt/intel_gt_mcr.h |  1 +
  2 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index bf4a933de03a..d98e0d2fc2ee 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -419,6 +419,35 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned 
long flags)
intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
  }
  
+/**

+ * intel_gt_mcr_lock_reset - Reset MCR steering lock
+ * @gt: GT structure
+ *
+ * Performs a steer semaphore reset operation. On MTL and beyond, a hardware
+ * lock will also be taken to serialize access not only for the driver,
+ * but also for external hardware and firmware agents.

The text here makes it sound like this reset function is going to take
the lock.  Since we have the same language in the lock() function's
kerneldoc, I think you can just delete this whole sentence.


+ * However, there may be situations where the driver must reset the semaphore
+ * but only when it is absolutely certain that no other agent should own the
+ * lock at that given time.

This part leads to questions about what such situations would be and how
we'd know it's safe to use.  Maybe it's best to just say something like
"This will be used to sanitize the initial status of the hardware lock
during driver load and resume since there won't be any concurrent access
from other agents at those times, but it's possible that boot firmware
may have left the lock in a bad state."
sounds better than mine. I will just keep that as description and remove 
rest.



+ *
+ * Context: Takes gt->mcr_lock.  uncore->lock should *not* be held when this
+ *  function is called, although it may be acquired after this
+ *  function call.
+ */
+void intel_gt_mcr_lock_reset(struct intel_gt *gt)
+{
+   unsigned long __flags;
+
+   lockdep_assert_not_held(>uncore->lock);
+
+   spin_lock_irqsave(>mcr_lock, __flags);

If we're doing this to sanitize at load/resume, then presumably we
shouldn't ever be racing with other driver threads either, right?



Driver load and suspend/resume is single threaded afaik but I think I 
just needed double conformation.


Will remove the lock.


Thanks,

Nirmoy


  If it
was possible for some other thread to already be grabbing the MCR lock,
then that would mean it also isn't safe for us to reset it here either.


Matt


+
+   if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
+   intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
+
+   spin_unlock_irqrestore(>mcr_lock, __flags);
+}
+
  /**
   * intel_gt_mcr_read - read a specific instance of an MCR register
   * @gt: GT structure
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
index 41684495b7da..485c7711f2e8 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
@@ -11,6 +11,7 @@
  void intel_gt_mcr_init(struct intel_gt *gt);
  void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags);
  void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags);
+void intel_gt_mcr_lock_reset(struct intel_gt *gt);
  
  u32 intel_gt_mcr_read(struct intel_gt *gt,

  i915_mcr_reg_t reg,
--
2.41.0



Re: [PATCH v6 3/4] drm/i915: Reset steer semaphore for media GT on resume

2023-09-28 Thread Nirmoy Das



On 9/28/2023 9:19 AM, Andi Shyti wrote:

Hi Nirmoy,

On Wed, Sep 27, 2023 at 11:03:56PM +0200, Nirmoy Das wrote:

During resume, the steer semaphore on GT1 was observed to be held. The
hardware team has confirmed the safety of clearing the steer semaphore
during driver load/resume, as no lock acquisitions can occur in this
process by other agents.

v2: reset on resume not in intel_gt_init().
v3: do the reset on intel_gt_resume_early()

Signed-off-by: Nirmoy Das 

In the previous version I added my r-b here.

I moved code to different function so wanted to be sure :)

  Please consider it
for the next version:

Reviewed-by: Andi Shyti 

Even though there are still some quesions coming from Matt.


I will make it generic for all GT in the next version.


Thanks,

Nirmoy



Thanks,
Andi


---
  drivers/gpu/drm/i915/gt/intel_gt_pm.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index dab73980c9f1..59cebf205b72 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -13,6 +13,7 @@
  #include "intel_engine_pm.h"
  #include "intel_gt.h"
  #include "intel_gt_clock_utils.h"
+#include "intel_gt_mcr.h"
  #include "intel_gt_pm.h"
  #include "intel_gt_print.h"
  #include "intel_gt_requests.h"
@@ -218,6 +219,17 @@ void intel_gt_pm_fini(struct intel_gt *gt)
  
  void intel_gt_resume_early(struct intel_gt *gt)

  {
+   /*
+* Reset the steer semaphore on GT1, as we have observed it
+* remaining held after a suspend operation. Confirmation
+* from the hardware team ensures the safety of resetting
+* the steer semaphore during driver load/resume, as there
+* are no lock acquisitions during this process by other
+* agents.
+*/
+   if (MEDIA_VER(gt->i915) >= 13 && gt->type == GT_MEDIA)
+   intel_gt_mcr_lock_reset(gt);
+
intel_uncore_resume_early(gt->uncore);
intel_gt_check_and_clear_faults(gt);
  }
--
2.41.0


Re: [PATCH v6 3/4] drm/i915: Reset steer semaphore for media GT on resume

2023-09-28 Thread Nirmoy Das



On 9/28/2023 12:35 AM, Matt Roper wrote:

On Wed, Sep 27, 2023 at 11:03:56PM +0200, Nirmoy Das wrote:

During resume, the steer semaphore on GT1 was observed to be held. The
hardware team has confirmed the safety of clearing the steer semaphore
during driver load/resume, as no lock acquisitions can occur in this
process by other agents.

I guess the question is whether we just want to handle the one case
where we've already seen a BIOS snapshot screw up (i.e., specifically on
GT1 during resume), or do we want to make this a general sanitization
that we do on both GTs at both load and resume, just to be safe?  Given
that the hardware team has indicated no external agents would be
expected to be using steering at the point the driver is
loading/resuming, maybe it's best to always do the sanitization on
platforms that have a hardware semaphore?


Agree, will make it got all GT.


Thanks,

Nirmoy




Matt


v2: reset on resume not in intel_gt_init().
v3: do the reset on intel_gt_resume_early()

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gt/intel_gt_pm.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index dab73980c9f1..59cebf205b72 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -13,6 +13,7 @@
  #include "intel_engine_pm.h"
  #include "intel_gt.h"
  #include "intel_gt_clock_utils.h"
+#include "intel_gt_mcr.h"
  #include "intel_gt_pm.h"
  #include "intel_gt_print.h"
  #include "intel_gt_requests.h"
@@ -218,6 +219,17 @@ void intel_gt_pm_fini(struct intel_gt *gt)
  
  void intel_gt_resume_early(struct intel_gt *gt)

  {
+   /*
+* Reset the steer semaphore on GT1, as we have observed it
+* remaining held after a suspend operation. Confirmation
+* from the hardware team ensures the safety of resetting
+* the steer semaphore during driver load/resume, as there
+* are no lock acquisitions during this process by other
+* agents.
+*/
+   if (MEDIA_VER(gt->i915) >= 13 && gt->type == GT_MEDIA)
+   intel_gt_mcr_lock_reset(gt);
+
intel_uncore_resume_early(gt->uncore);
intel_gt_check_and_clear_faults(gt);
  }
--
2.41.0



[PATCH v6 4/4] drm/i915/mtl: Skip MCR ops for ring fault register

2023-09-27 Thread Nirmoy Das
On MTL GEN12_RING_FAULT_REG is not replicated so don't
do mcr based operation for this register.

v2: use MEDIA_VER() instead of GRAPHICS_VER()(Matt).
v3: s/"MEDIA_VER(i915) == 13"/"MEDIA_VER(i915) >= 13"(Matt)
improve comment.
v4: improve the comment further(Andi)

Signed-off-by: Nirmoy Das 
Reviewed-by: Matt Roper 
Reviewed-by: Andi Shyti 
Reviewed-by: Andrzej Hajda 
---
 drivers/gpu/drm/i915/gt/intel_gt.c  | 13 -
 drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
 drivers/gpu/drm/i915/i915_gpu_error.c   | 11 ++-
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c
index 93062c35e072..dff8bba1f5d4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -262,10 +262,21 @@ intel_gt_clear_error_registers(struct intel_gt *gt,
   I915_MASTER_ERROR_INTERRUPT);
}
 
-   if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) {
+   /*
+* For the media GT, this ring fault register is not replicated,
+* so don't do multicast/replicated register read/write operation on it.
+*/
+   if (MEDIA_VER(i915) >= 13 && gt->type == GT_MEDIA) {
+   intel_uncore_rmw(uncore, XELPMP_RING_FAULT_REG,
+RING_FAULT_VALID, 0);
+   intel_uncore_posting_read(uncore,
+ XELPMP_RING_FAULT_REG);
+
+   } else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) {
intel_gt_mcr_multicast_rmw(gt, XEHP_RING_FAULT_REG,
   RING_FAULT_VALID, 0);
intel_gt_mcr_read_any(gt, XEHP_RING_FAULT_REG);
+
} else if (GRAPHICS_VER(i915) >= 12) {
intel_uncore_rmw(uncore, GEN12_RING_FAULT_REG, 
RING_FAULT_VALID, 0);
intel_uncore_posting_read(uncore, GEN12_RING_FAULT_REG);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index cca4bac8f8b0..eecd0a87a647 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1084,6 +1084,7 @@
 
 #define GEN12_RING_FAULT_REG   _MMIO(0xcec4)
 #define XEHP_RING_FAULT_REGMCR_REG(0xcec4)
+#define XELPMP_RING_FAULT_REG  _MMIO(0xcec4)
 #define   GEN8_RING_FAULT_ENGINE_ID(x) (((x) >> 12) & 0x7)
 #define   RING_FAULT_GTTSEL_MASK   (1 << 11)
 #define   RING_FAULT_SRCID(x)  (((x) >> 3) & 0xff)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
b/drivers/gpu/drm/i915/i915_gpu_error.c
index f4ebcfb70289..b4e31e59c799 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1234,7 +1234,16 @@ static void engine_record_registers(struct 
intel_engine_coredump *ee)
if (GRAPHICS_VER(i915) >= 6) {
ee->rc_psmi = ENGINE_READ(engine, RING_PSMI_CTL);
 
-   if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50))
+   /*
+* For the media GT, this ring fault register is not replicated,
+* so don't do multicast/replicated register read/write
+* operation on it.
+*/
+   if (MEDIA_VER(i915) >= 13 && engine->gt->type == GT_MEDIA)
+   ee->fault_reg = intel_uncore_read(engine->uncore,
+ 
XELPMP_RING_FAULT_REG);
+
+   else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50))
ee->fault_reg = intel_gt_mcr_read_any(engine->gt,
  
XEHP_RING_FAULT_REG);
else if (GRAPHICS_VER(i915) >= 12)
-- 
2.41.0



[PATCH v6 3/4] drm/i915: Reset steer semaphore for media GT on resume

2023-09-27 Thread Nirmoy Das
During resume, the steer semaphore on GT1 was observed to be held. The
hardware team has confirmed the safety of clearing the steer semaphore
during driver load/resume, as no lock acquisitions can occur in this
process by other agents.

v2: reset on resume not in intel_gt_init().
v3: do the reset on intel_gt_resume_early()

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_gt_pm.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index dab73980c9f1..59cebf205b72 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -13,6 +13,7 @@
 #include "intel_engine_pm.h"
 #include "intel_gt.h"
 #include "intel_gt_clock_utils.h"
+#include "intel_gt_mcr.h"
 #include "intel_gt_pm.h"
 #include "intel_gt_print.h"
 #include "intel_gt_requests.h"
@@ -218,6 +219,17 @@ void intel_gt_pm_fini(struct intel_gt *gt)
 
 void intel_gt_resume_early(struct intel_gt *gt)
 {
+   /*
+* Reset the steer semaphore on GT1, as we have observed it
+* remaining held after a suspend operation. Confirmation
+* from the hardware team ensures the safety of resetting
+* the steer semaphore during driver load/resume, as there
+* are no lock acquisitions during this process by other
+* agents.
+*/
+   if (MEDIA_VER(gt->i915) >= 13 && gt->type == GT_MEDIA)
+   intel_gt_mcr_lock_reset(gt);
+
intel_uncore_resume_early(gt->uncore);
intel_gt_check_and_clear_faults(gt);
 }
-- 
2.41.0



[PATCH v6 2/4] drm/i915: Introduce the intel_gt_resume_early()

2023-09-27 Thread Nirmoy Das
Move early resume functions of gt to a proper file.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_gt_pm.c | 6 ++
 drivers/gpu/drm/i915/gt/intel_gt_pm.h | 1 +
 drivers/gpu/drm/i915/i915_driver.c| 6 ++
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 5a942af0a14e..dab73980c9f1 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -216,6 +216,12 @@ void intel_gt_pm_fini(struct intel_gt *gt)
intel_rc6_fini(>rc6);
 }
 
+void intel_gt_resume_early(struct intel_gt *gt)
+{
+   intel_uncore_resume_early(gt->uncore);
+   intel_gt_check_and_clear_faults(gt);
+}
+
 int intel_gt_resume(struct intel_gt *gt)
 {
struct intel_engine_cs *engine;
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
index 6c9a46452364..b1eeb5b33918 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h
@@ -78,6 +78,7 @@ void intel_gt_pm_fini(struct intel_gt *gt);
 void intel_gt_suspend_prepare(struct intel_gt *gt);
 void intel_gt_suspend_late(struct intel_gt *gt);
 int intel_gt_resume(struct intel_gt *gt);
+void intel_gt_resume_early(struct intel_gt *gt);
 
 void intel_gt_runtime_suspend(struct intel_gt *gt);
 int intel_gt_runtime_resume(struct intel_gt *gt);
diff --git a/drivers/gpu/drm/i915/i915_driver.c 
b/drivers/gpu/drm/i915/i915_driver.c
index d50347e5773a..78501a83ba10 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1327,10 +1327,8 @@ static int i915_drm_resume_early(struct drm_device *dev)
drm_err(_priv->drm,
"Resume prepare failed: %d, continuing anyway\n", ret);
 
-   for_each_gt(gt, dev_priv, i) {
-   intel_uncore_resume_early(gt->uncore);
-   intel_gt_check_and_clear_faults(gt);
-   }
+   for_each_gt(gt, dev_priv, i)
+   intel_gt_resume_early(gt);
 
intel_display_power_resume_early(dev_priv);
 
-- 
2.41.0



[PATCH v6 1/4] drm/i915: Introduce intel_gt_mcr_lock_reset()

2023-09-27 Thread Nirmoy Das
Implement intel_gt_mcr_lock_reset() to provide a mechanism
for resetting the steer semaphore when absolutely necessary.

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 29 ++
 drivers/gpu/drm/i915/gt/intel_gt_mcr.h |  1 +
 2 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index bf4a933de03a..d98e0d2fc2ee 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -419,6 +419,35 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned 
long flags)
intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
 }
 
+/**
+ * intel_gt_mcr_lock_reset - Reset MCR steering lock
+ * @gt: GT structure
+ *
+ * Performs a steer semaphore reset operation. On MTL and beyond, a hardware
+ * lock will also be taken to serialize access not only for the driver,
+ * but also for external hardware and firmware agents.
+ * However, there may be situations where the driver must reset the semaphore
+ * but only when it is absolutely certain that no other agent should own the
+ * lock at that given time.
+ *
+ * Context: Takes gt->mcr_lock.  uncore->lock should *not* be held when this
+ *  function is called, although it may be acquired after this
+ *  function call.
+ */
+void intel_gt_mcr_lock_reset(struct intel_gt *gt)
+{
+   unsigned long __flags;
+
+   lockdep_assert_not_held(>uncore->lock);
+
+   spin_lock_irqsave(>mcr_lock, __flags);
+
+   if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
+   intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1);
+
+   spin_unlock_irqrestore(>mcr_lock, __flags);
+}
+
 /**
  * intel_gt_mcr_read - read a specific instance of an MCR register
  * @gt: GT structure
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h 
b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
index 41684495b7da..485c7711f2e8 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h
@@ -11,6 +11,7 @@
 void intel_gt_mcr_init(struct intel_gt *gt);
 void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags);
 void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags);
+void intel_gt_mcr_lock_reset(struct intel_gt *gt);
 
 u32 intel_gt_mcr_read(struct intel_gt *gt,
  i915_mcr_reg_t reg,
-- 
2.41.0



[PATCH v5 3/3] drm/i915/mtl: Skip MCR ops for ring fault register

2023-09-27 Thread Nirmoy Das
On MTL GEN12_RING_FAULT_REG is not replicated so don't
do mcr based operation for this register.

v2: use MEDIA_VER() instead of GRAPHICS_VER()(Matt).
v3: s/"MEDIA_VER(i915) == 13"/"MEDIA_VER(i915) >= 13"(Matt)
improve comment.
v4: improve the comment further(Andi)

Signed-off-by: Nirmoy Das 
Reviewed-by: Matt Roper 
Reviewed-by: Andi Shyti 
Reviewed-by: Andrzej Hajda 
---
 drivers/gpu/drm/i915/gt/intel_gt.c  | 13 -
 drivers/gpu/drm/i915/gt/intel_gt_regs.h |  1 +
 drivers/gpu/drm/i915/i915_gpu_error.c   | 11 ++-
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c
index 93062c35e072..dff8bba1f5d4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -262,10 +262,21 @@ intel_gt_clear_error_registers(struct intel_gt *gt,
   I915_MASTER_ERROR_INTERRUPT);
}
 
-   if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) {
+   /*
+* For the media GT, this ring fault register is not replicated,
+* so don't do multicast/replicated register read/write operation on it.
+*/
+   if (MEDIA_VER(i915) >= 13 && gt->type == GT_MEDIA) {
+   intel_uncore_rmw(uncore, XELPMP_RING_FAULT_REG,
+RING_FAULT_VALID, 0);
+   intel_uncore_posting_read(uncore,
+ XELPMP_RING_FAULT_REG);
+
+   } else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) {
intel_gt_mcr_multicast_rmw(gt, XEHP_RING_FAULT_REG,
   RING_FAULT_VALID, 0);
intel_gt_mcr_read_any(gt, XEHP_RING_FAULT_REG);
+
} else if (GRAPHICS_VER(i915) >= 12) {
intel_uncore_rmw(uncore, GEN12_RING_FAULT_REG, 
RING_FAULT_VALID, 0);
intel_uncore_posting_read(uncore, GEN12_RING_FAULT_REG);
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h 
b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index cca4bac8f8b0..eecd0a87a647 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1084,6 +1084,7 @@
 
 #define GEN12_RING_FAULT_REG   _MMIO(0xcec4)
 #define XEHP_RING_FAULT_REGMCR_REG(0xcec4)
+#define XELPMP_RING_FAULT_REG  _MMIO(0xcec4)
 #define   GEN8_RING_FAULT_ENGINE_ID(x) (((x) >> 12) & 0x7)
 #define   RING_FAULT_GTTSEL_MASK   (1 << 11)
 #define   RING_FAULT_SRCID(x)  (((x) >> 3) & 0xff)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
b/drivers/gpu/drm/i915/i915_gpu_error.c
index f4ebcfb70289..b4e31e59c799 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1234,7 +1234,16 @@ static void engine_record_registers(struct 
intel_engine_coredump *ee)
if (GRAPHICS_VER(i915) >= 6) {
ee->rc_psmi = ENGINE_READ(engine, RING_PSMI_CTL);
 
-   if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50))
+   /*
+* For the media GT, this ring fault register is not replicated,
+* so don't do multicast/replicated register read/write
+* operation on it.
+*/
+   if (MEDIA_VER(i915) >= 13 && engine->gt->type == GT_MEDIA)
+   ee->fault_reg = intel_uncore_read(engine->uncore,
+ 
XELPMP_RING_FAULT_REG);
+
+   else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50))
ee->fault_reg = intel_gt_mcr_read_any(engine->gt,
  
XEHP_RING_FAULT_REG);
else if (GRAPHICS_VER(i915) >= 12)
-- 
2.41.0



[PATCH v5 2/3] drm/i915: Reset steer semaphore for media GT on resume

2023-09-27 Thread Nirmoy Das
During resume, the steer semaphore on GT1 was observed to be held. The
hardware team has confirmed the safety of clearing the steer semaphore
during driver load/resume, as no lock acquisitions can occur in this
process by other agents.

v2: reset on resume not in intel_gt_init().

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/i915/gt/intel_gt_pm.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index 5a942af0a14e..b19062e30b9b 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -13,6 +13,7 @@
 #include "intel_engine_pm.h"
 #include "intel_gt.h"
 #include "intel_gt_clock_utils.h"
+#include "intel_gt_mcr.h"
 #include "intel_gt_pm.h"
 #include "intel_gt_print.h"
 #include "intel_gt_requests.h"
@@ -228,6 +229,17 @@ int intel_gt_resume(struct intel_gt *gt)
 
GT_TRACE(gt, "\n");
 
+   /*
+* Reset the steer semaphore on GT1, as we have observed it
+* remaining held after a suspend operation. Confirmation
+* from the hardware team ensures the safety of resetting
+* the steer semaphore during driver load/resume, as there
+* are no lock acquisitions during this process by other
+* agents.
+*/
+   if (MEDIA_VER(gt->i915) >= 13 && gt->type == GT_MEDIA)
+   intel_gt_mcr_lock_reset(gt);
+
/*
 * After resume, we may need to poke into the pinned kernel
 * contexts to paper over any damage caused by the sudden suspend.
-- 
2.41.0



  1   2   3   4   5   6   >